interview prep · 696 questions
What actually gets asked
Real interview questions mapped to every topic in the course — grouped by module and chapter. Tap any card for how to approach it, what a strong answer covers, a quick self-check, the follow-ups, and the trap. Company tags are best-effort and sourced: a company is shown only when a public source names it; everything else reads “Commonly asked.” The ★ must-know set is the high-yield core — widely asked and easy to get wrong.
Showing 696 of 696 questions
01 Core CS / DSA 90 Q's
1.1 Big-O & complexity reasoning 15
-
Why do we drop constants and lower-order terms in Big-O?
Big-O describes asymptotic growth as n grows large, so the dominant term decides how the cost scales; constants and lower-order terms become negligible in that limit.
3n + 5andnboth grow linearly, so both are O(n). The point is to compare how algorithms scale, not to predict exact wall-clock time.What a strong answer coversBig-O measures asymptotic growth as
n → ∞, not real wall-clock time.The fastest-growing term dominates; constants and lower-order terms vanish in the limit.
3n + 5,n, and500nare all O(n) — the same growth class.The goal is comparing how algorithms scale, not benchmarking a specific machine.
Quick self-checkWhich expression is NOT O(n)?
-
Linear: the constant 2 and the +7 drop, leaving O(n).
-
n dominates log n, so this is O(n).
-
Correct — dividing by a constant doesn't change the class; it's still O(n²), which grows faster than O(n).
-
A constant multiple of n is still O(n).
Follow-ups they push on- When do constants actually matter in practice?
- Is an O(n) algorithm always faster than an O(n log n) one?
Red flag Treating O(2n) or O(n + 100) as meaningfully different from O(n), or implying Big-O predicts real runtime rather than scaling.
source: Tech Interview Handbook — Algorithms / Complexity ↗ -
What's the time complexity of the naive recursive Fibonacci, and why is it so bad?
Naive
fib(n) = fib(n-1) + fib(n-2)is O(2^n) (more precisely O(φ^n)) because each call spawns two more and the recursion tree's size roughly doubles per level, recomputing the same subproblems exponentially many times. Memoizing the results collapses it to O(n) time (eachfib(k)computed once), and an iterative version is O(n) time with O(1) space. This is the canonical 'overlapping subproblems ⇒ use DP' example.What a strong answer coversBranching factor 2 with depth n ⇒ ~
2^ncalls.Same subproblems recomputed repeatedly (overlapping subproblems).
Memoization ⇒ O(n) time; iterative ⇒ O(n) time, O(1) space.
Recursion tree size, not depth, drives the exponential cost.
Quick self-checkWhy is naive recursive Fibonacci exponential while memoized is linear?
-
Wrong — the recurrence is the same; the difference is caching, not branching.
-
Correct — overlapping subproblems are recomputed exponentially without a cache.
-
Wrong — recursion isn't inherently exponential; the redundant recomputation is the cause.
-
Wrong — both versions still add fib(n-1) + fib(n-2).
Follow-ups they push on- How much space does the memoized version use?
- Can you compute Fibonacci faster than O(n)?
Red flag Estimating it as O(n) by counting recursion depth instead of the exponential number of nodes in the call tree.
source: Tech Interview Handbook — Dynamic Programming cheatsheet ↗ -
An algorithm has two separate phases: one O(n) and one O(n^2). What's the overall complexity, and what if a third phase is O(m)?
Sequential phases add, then you keep the dominant term: O(n) + O(n^2) = O(n^2), because the quadratic term swamps the linear one as n grows. When a phase depends on a different input size
m, you cannot fold it into n — the honest answer is O(n^2 + m), keeping both variables because either could dominate depending on the inputs.What a strong answer coversSequential (non-nested) phases add; nested phases multiply.
After adding, drop dominated terms:
O(n) + O(n²) = O(n²).Independent input sizes stay separate:
O(n² + m), notO(n²).Only collapse
mintonif you can provem ≤ n(or similar).
Follow-ups they push on- When is it wrong to assume m ≈ n in a graph problem (V vs E)?
- If the phases were nested instead of sequential, what changes?
Red flag Silently assuming a second input variable equals n, or multiplying sequential phases that should be added.
source: Big-O Cheat Sheet ↗ -
You sort an array (O(n log n)) and then do a single linear scan. What's the combined complexity, and is the sort 'free' because the scan is O(n)?
Combined it's O(n log n) — the sort dominates the linear scan, so the scan is effectively absorbed, but the sort is certainly not free; it sets the overall complexity. A common mistake is to advertise 'an O(n) two-pointer solution' while quietly sorting first, which makes the real cost O(n log n). Always fold the preprocessing cost into the bound you quote.
What a strong answer coversO(n log n) + O(n) = O(n log n)— the larger term wins.The sort is the bottleneck, not 'free'.
A two-pointer pass after a sort is an O(n log n) solution overall.
Quote the cost including preprocessing, not just the hot loop.
Quick self-checkSort (O(n log n)) followed by a separate O(n) scan — overall?
-
Wrong — ignores the dominant sorting step.
-
Correct — the sort dominates; adding O(n) doesn't raise the class.
-
Wrong — the phases are sequential (added), not nested (multiplied).
-
Wrong — there's at least linear work, far more than logarithmic.
Follow-ups they push on- When would an O(n)-space hash approach beat the sort-then-scan approach?
- If the array were already sorted, what would change?
Red flag Claiming an 'O(n) solution' that secretly sorts the input first — the honest bound is O(n log n).
source: Tech Interview Handbook — Algorithms / Sorting ↗ -
Order these from fastest- to slowest-growing: O(n log n), O(1), O(n!), O(log n), O(n^2), O(n), O(2^n).
O(1) < O(log n) < O(n) < O(n log n) < O(n^2) < O(2^n) < O(n!). The split that matters most in interviews is polynomial (everything up to O(n^2)) versus exponential/factorial (O(2^n), O(n!)), which become intractable for even modest n. Knowing where a candidate algorithm sits on this ladder is usually the first thing an interviewer wants.
Follow-ups they push on- Give a concrete algorithm that lands in each class.
Red flag Putting O(n log n) above O(n^2), or thinking O(2^n) and O(n^2) are close because both 'have an n and a power'.
source: Big-O Cheat Sheet ↗ -
If 100x more data slows an operation ~100x, what's its complexity? What if it slows ~10,000x?
~100x slowdown for 100x data is linear, O(n). A ~10,000x slowdown is 100^2, i.e. O(n^2) — the quadratic term means scaling the input scales the cost by the square. This back-of-envelope reasoning is exactly how you sanity-check whether a measured slowdown matches your assumed complexity.
Red flag Confusing the input multiplier with the time multiplier, or assuming any slowdown larger than linear must be exponential.
source: GeeksforGeeks — Big-O Notation Interview Questions ↗ -
A loop runs `for (i = 1; i < n; i *= 2)`. What's its time complexity, and why?
It's O(log n). Multiplying
iby 2 each iteration meansitakes the values 1, 2, 4, 8, …, so the loop body runs about log2(n) times beforeireaches n. Any loop where the counter is multiplied or divided by a constant factor (rather than added to) is logarithmic — this is the same shape as binary search.What a strong answer coversCounter multiplied/divided by a constant ⇒ logarithmic, not linear.
1, 2, 4, … n means ~
log2(n)iterations.Contrast with
i += 1(ori += c), which is O(n).Nesting this inside an O(n) loop gives O(n log n).
Quick self-checkTime complexity of `for (i = n; i > 1; i /= 2)`?
-
Wrong — the counter is halved each step, not decremented by one.
-
Correct — halving n repeatedly takes ~log2(n) steps to reach 1.
-
Wrong — there's only one loop, no multiplicative nesting.
-
Wrong — the number of iterations grows with n, just slowly.
Follow-ups they push on- What's the complexity if the inner loop instead did `j *= 3`?
- What if you nest this logarithmic loop inside a `for i in 0..n`?
Red flag Calling it O(n) because it's 'a loop up to n', ignoring that the counter grows geometrically, not by one.
source: GeeksforGeeks — Big-O Notation Interview Questions ↗ -
What does Big-O actually bound — Big-O vs Big-Theta vs Big-Omega — and why do people say O(n) when they mean Θ(n)?
Big-O is an upper bound (grows no faster than), Big-Omega (Ω) is a lower bound (grows no slower than), and Big-Theta (Θ) is a tight bound (both at once). Strictly, an O(n) algorithm is also O(n^2) because O is only an upper bound, so the precise claim is usually Θ(n). In interviews people say 'O(n)' loosely to mean the tight bound; it's fine, but knowing the distinction signals rigor.
What a strong answer coversO = upper bound, Ω = lower bound, Θ = tight (both).
An O(n) algorithm is technically also O(n^2) — O doesn't have to be tight.
Θ(n) is the precise statement people usually intend by 'O(n)'.
Worst-case Big-O is the common interview default unless stated otherwise.
Quick self-checkWhich statement is technically correct for an algorithm that always does exactly n steps?
-
Wrong — Θ(n) is correct, but it IS also O(n²) since O is just an upper bound.
-
Correct — it's bounded above by both n and n², and tightly bounded by n.
-
Wrong — n steps does not grow at least as fast as n², so it's not Ω(n²).
-
Wrong — any larger upper bound like O(n²) also holds.
Follow-ups they push on- Give an algorithm whose best and worst cases differ in Big-Theta.
- Is saying 'quicksort is O(n^2)' wrong?
Red flag Insisting O must be a tight bound, or conflating worst-case with Big-O (they're independent axes).
source: MIT OCW 6.006 — Asymptotic notation ↗ -
A function loops `i` from 0 to n and, inside, loops `j` from 0 to `i`. Is that O(n^2)?
Yes, it's O(n^2) — even though the inner loop doesn't always run n times. The total iterations are 0 + 1 + 2 + … + (n-1) = n(n-1)/2, which is ~n^2/2; dropping the constant gives O(n^2). The lesson: a triangular nested loop is still quadratic, because half of a square is still proportional to n^2.
What a strong answer coversTotal work is the arithmetic series
0+1+…+(n-1) = n(n-1)/2.That's
~n²/2⇒ O(n²) after dropping the constant.'Inner loop shorter each time' does not save an order of magnitude.
Same total as comparing all unique pairs of n items.
Quick self-checkTotal iterations of the inner body across the whole run?
-
Wrong — that would be a single linear pass, not a nested loop.
-
Wrong — no halving happens; both loops are arithmetic.
-
Correct — it's the sum 0+1+…+(n-1) = n(n-1)/2 ≈ n²/2.
-
Wrong — that's the full square; the triangular loop runs about half as many times (same O(n²) class though).
Follow-ups they push on- How many distinct pairs `(i, j)` with `i < j` exist among n items?
- What if the inner loop ran to `i*i` instead of `i`?
Red flag Claiming it's O(n) or 'O(n²/2)' — the triangular shape is still Θ(n²) and constants are dropped.
source: GeeksforGeeks — Big-O Notation Interview Questions ↗ -
Explain best, average, and worst case. Which one does Big-O usually refer to, and why?
Best/average/worst describe how cost varies across different inputs of the same size. Interviewers usually mean worst-case Big-O because it's the guarantee that holds regardless of input, but average case matters for things like quicksort (avg O(n log n), worst O(n^2)) and hash maps (avg O(1), worst O(n)). Good practice: state worst-case first, then add the expected/average case with the assumption it relies on.
Follow-ups they push on- What assumption makes a hash-map lookup average O(1)?
- Why might average case be the more honest number for quicksort?
Red flag Quoting average case as if it were a worst-case guarantee, especially for hashing or randomized algorithms.
source: Tech Interview Handbook — Algorithms / Complexity ↗ -
What is amortized complexity? Why is appending to a dynamic array amortized O(1) if a resize is O(n)?
Amortized cost is the average cost per operation across a long sequence, even if individual operations vary. A dynamic array doubles capacity on resize, so a resize costs O(n) but only happens after ~n cheap appends; spreading that O(n) over the n appends gives O(1) per append on average. The doubling (geometric growth) is what makes the total work across n appends O(n), not O(n^2).
Follow-ups they push on- What breaks if you grow the array by a fixed +1 instead of doubling?
- Is amortized O(1) the same as worst-case O(1)?
Red flag Calling a single resizing append O(1), or claiming linear (+1) growth still gives amortized O(1) appends.
source: InterviewPlus — Understanding Amortized Time Complexity ↗ -
What's the time and space complexity of a recursive function that recurses on n/2 and does O(1) work per call?
Halving n each call with O(1) work per level gives O(log n) time (about log2(n) levels). Space is O(log n) too because the call stack holds one frame per level until the base case unwinds — a point candidates often miss when they say O(1) space. This is the binary-search recursion shape.
Follow-ups they push on- How does an iterative version change the space complexity?
Red flag Reporting O(1) space for a recursive solution by forgetting the call stack costs O(depth).
source: InterviewPrep — Algorithm Complexity Interview Questions ↗ -
Two nested loops over the same array of size n look O(n^2). When can nested loops still be O(n)?
Nesting doesn't automatically mean O(n^2) — what matters is total iterations. In a sliding-window or two-pointer pass, the inner pointer advances monotonically and never resets, so across the whole run it moves at most n times total: the two loops combined do O(n) work. Always count how many times the inner body actually runs, not how deeply the loops nest.
Follow-ups they push on- What's the complexity of the sliding-window longest-substring solution?
Red flag Mechanically multiplying loop depths instead of bounding the total number of inner iterations.
source: GeeksforGeeks — Big-O Notation Interview Questions ↗ -
When do constant factors and Big-O 'lie' in practice — i.e. when is a higher-Big-O algorithm actually faster?
Big-O hides constants and cache effects, so for small or medium n a higher-Big-O algorithm with a tiny constant often wins. Classic cases: insertion sort (O(n^2)) beats quicksort on tiny arrays — which is why Timsort/introsort fall back to it; linear scan of a contiguous array can beat a hash map for small n because of cache locality and no hashing overhead; and an O(n log n) algorithm with huge constants can lose to a well-tuned O(n^2) until n is large. The honest senior answer is 'Big-O tells you scaling behavior; profile to know the crossover point for your actual n.'
What a strong answer coversBig-O drops constants and ignores cache locality / memory hierarchy.
Insertion sort beats quicksort for tiny n ⇒ hybrid sorts switch over.
Contiguous linear scan can beat a hash map for small n (locality, no hashing).
There's a crossover n; profile rather than assume the lower class always wins.
Follow-ups they push on- Why does Timsort use insertion sort on small runs?
- How does cache locality favor arrays over linked lists despite equal Big-O?
Red flag Treating Big-O as a real-runtime ranking for all n, ignoring constants, locality, and the crossover point.
source: Tech Interview Handbook — Algorithms / Complexity ↗ -
You can solve a problem in O(n) time with O(n) extra space, or O(n log n) time with O(1) space. How do you decide?
It's a time/space trade-off driven by constraints: if memory is tight (embedded, huge inputs, streaming) favour the O(1)-space version; if latency dominates and memory is cheap, take the O(n)-time version. The strong-signal answer names the constraints out loud, states the assumption (e.g. input fits in memory), and asks the interviewer about input size and environment rather than guessing.
Follow-ups they push on- When would O(n) extra space be a non-starter even if it's faster?
Red flag Optimizing only time and never mentioning the space cost, or picking one without surfacing the trade-off.
source: InterviewPrep — Algorithm Analysis Interview Questions ↗
1.2 Linear structures — when to reach for each 15
-
Why is a doubly linked list often paired with a hash map (e.g. in an LRU cache), and what does each part provide?
The hash map gives O(1) lookup from key to its node; the doubly linked list gives O(1) reordering — unlink a node from anywhere and move it to the front/back using its
prev/nextpointers. Neither alone suffices: a hash map has no order, and a list alone needs O(n) to find a node. Together they back an LRU cache where bothgetandput(including evicting the least-recently-used entry) are O(1). The doubly-linked part is essential because unlinking an interior node in O(1) requires knowing its predecessor, which only a backward pointer provides.What a strong answer coversHash map: O(1) find key → node. Doubly linked list: O(1) reorder/evict.
Map stores pointers to list nodes, not values, so manipulation is direct.
prevpointer is what makes interior unlink O(1) (a singly list can't).Move-to-front on access; evict from the tail (the LRU end).
Quick self-checkIn an LRU cache, why a doubly linked list rather than a singly linked one?
-
Wrong — it uses more (an extra pointer per node).
-
Correct — without prev you'd need an O(n) scan to find the predecessor.
-
Wrong — they can store any payload; the issue is O(1) unlinking.
-
Wrong — LRU order is by recency of use, not by key, and neither list type auto-sorts.
Follow-ups they push on- Why a doubly (not singly) linked list specifically?
- What stale-reference bug appears if you evict from the list but not the map?
Red flag Removing the evicted node from the list but leaving its key in the hash map, leaving a dangling stale reference.
source: LeetCode 146 — LRU Cache (company tags) ↗ -
Array vs linked list: compare index access, insert/delete, and memory. When would you choose each?
Arrays are contiguous: O(1) random index access but O(n) to insert/delete in the middle (shifting). Linked lists give O(1) insert/delete at a known node or the ends, but O(n) to find or index because you must walk the pointers. Reach for an array when you index a lot and the size is roughly known; reach for a linked list when you constantly add/remove at the front (or splice known nodes) and rarely index.
Follow-ups they push on- Why is array access cache-friendly but linked-list traversal often isn't?
- What does a dynamic array (ArrayList/vector) change about this comparison?
Red flag Saying linked lists are 'faster for inserts' without the 'at a known position' caveat — finding the position is still O(n).
source: Tech Interview Handbook — Linked List cheatsheet ↗ -
Merge two sorted linked lists into one sorted list.
Walk both lists with a dummy head and a
tailpointer: at each step append the smaller of the two current nodes and advance that list; when one list runs out, append the remainder of the other. O(n + m) time, O(1) extra space because you splice existing nodes rather than allocate new ones. The dummy node removes the special case for choosing the very first node.What a strong answer coversDummy head +
tailpointer avoids first-node special-casing.Splice existing nodes ⇒ O(1) extra space.
O(n + m) time, one comparison per node.
Attach the leftover tail wholesale once one list empties.
Follow-ups they push on- How does this become the merge step of merge sort on a list?
- Extend to merging k sorted lists efficiently.
Red flag Forgetting to attach the remaining nodes of the non-empty list after the loop ends.
source: LeetCode 21 — Merge Two Sorted Lists (company tags) ↗ -
What's the difference between a singly and a doubly linked list, and what does the second pointer cost and buy you?
A singly linked list has only a
nextpointer per node (forward traversal only); a doubly linked list adds aprevpointer, enabling backward traversal and O(1) deletion of a node given only a reference to it. The cost is one extra pointer of memory per node plus more bookkeeping on every insert/delete (you must fix two links, not one). Choose doubly when you need to walk backward or splice out arbitrary nodes cheaply (LRU caches, browser history); choose singly to save memory when forward-only suffices.What a strong answer coversSingly:
nextonly, forward traversal; Doubly:prev+next.Doubly enables O(1) delete given just the node and backward walks.
Cost: extra pointer per node + dual-link maintenance on every edit.
Use doubly for LRU/history; singly when forward-only and memory-tight.
Quick self-checkWhat does the `prev` pointer in a doubly linked list primarily buy you?
-
Wrong — both list types are O(n) to index; arrays give O(1) random access.
-
Correct — prev lets you fix the predecessor's link without an O(n) search.
-
Wrong — it adds a pointer, increasing memory.
-
Wrong — no list type sorts automatically.
Follow-ups they push on- Can you delete a known node in O(1) in a singly linked list (with a trick)?
- Why do LRU caches specifically need the doubly-linked variant?
Red flag Updating only `next` (or only `prev`) on insert/delete and corrupting one direction of the list.
source: GeeksforGeeks — Doubly Linked List ↗ -
Explain a stack vs a queue vs a deque in one sentence each, and give a real use for each.
A stack is LIFO — last in, first out — used for undo, call stacks, DFS, and expression parsing. A queue is FIFO — first in, first out — used for task/work queues and BFS. A deque is double-ended, O(1) push/pop at both ends, used when you need front-and-back access (and as a faster substitute for inserting at index 0 of an array).
Follow-ups they push on- Which would you use for BFS, and why not the other?
Red flag Mixing up which end stays open, or claiming a stack is good for FIFO ordering.
source: Tech Interview Handbook — Stack cheatsheet ↗ -
Reverse a singly linked list.
Iterate with three pointers —
prev,curr,next— and on each step reverse the link (curr.next = prev) then advance all three; returnprevat the end. This is O(n) time, O(1) space. The recursive version is O(n) space due to the call stack, so mention the iterative one first.Follow-ups they push on- Now reverse only nodes between positions m and n.
- Reverse the list in groups of k.
Red flag Losing the rest of the list by overwriting `curr.next` before saving `next`.
source: LeetCode 206 — Reverse Linked List (company tags) ↗ -
Determine if a string of brackets ()[]{} is validly matched.
Push each opening bracket onto a stack; on a closing bracket, pop and check it matches the expected opener, failing fast on mismatch or empty stack. At the end the string is valid only if the stack is empty. O(n) time, O(n) space — the classic motivating example for why stacks exist.
Follow-ups they push on- Handle the longest valid-parentheses substring.
- What if other characters are interleaved with the brackets?
Red flag Forgetting to check the stack is empty at the end, so unmatched openers like `(((` are wrongly accepted.
source: LeetCode 20 — Valid Parentheses (company tags) ↗ -
Find the middle node of a singly linked list in one pass.
Use the fast/slow pointer trick: advance
slowby one andfastby two each step; whenfastreaches the end,slowsits at the middle. It's O(n) time, O(1) space, and finishes in a single pass — no need to first count the length and then walk halfway. For an even-length list, decide up front whether you return the first or second middle (thefast/fast.nextloop condition controls this).What a strong answer coversTwo pointers, speeds 1 and 2 ⇒ slow lands at the middle in one pass.
O(n) time, O(1) space; no length precomputation.
Even length: loop condition picks first vs second middle.
Same tortoise/hare machinery as cycle detection.
Follow-ups they push on- For even length, how do you choose between the two middles?
- How does this generalize to finding the node n/k of the way through?
Red flag Looping while `fast != null` instead of checking `fast && fast.next`, dereferencing null on even-length lists.
source: LeetCode 876 — Middle of the Linked List (company tags) ↗ -
Remove the nth node from the end of a singly linked list in one pass.
Advance a
fastpointer n nodes ahead, then movefastandslowtogether untilfasthits the end — nowslowis just before the node to remove, so you splice it out. Use a dummy head in front of the real head so removing the first node needs no special case. One pass, O(n) time, O(1) space.What a strong answer coversGap of n between
fastandslowlocates the target in one pass.Dummy node before head removes the edge case of deleting the head.
O(n) time, O(1) space.
Stop
fastat the last node soslowlands on the predecessor.
Follow-ups they push on- Why does the dummy node matter when n equals the list length?
- How would you do it in two passes, and why prefer one?
Red flag Skipping the dummy node and crashing (or returning the wrong head) when the node to remove is the head itself.
source: LeetCode 19 — Remove Nth Node From End of List (company tags) ↗ -
How does a circular buffer (ring buffer) work, and where is it the right choice?
A ring buffer is a fixed-size array with
headandtailindices that wrap around using modulo; you enqueue attailand dequeue athead, both O(1), reusing slots instead of shifting. It's ideal for bounded producer/consumer streams — audio/IO buffering, recent-event logs, fixed-window rate limiting — where memory must be capped and old data can be overwritten. The classic subtlety is distinguishing full from empty when head == tail (track a size/count or leave one slot unused).What a strong answer coversFixed array + wrapping
head/tailvia modulo ⇒ O(1) enqueue/dequeue.No element shifting and no dynamic allocation after setup.
Best for bounded streaming buffers (audio, logs, IO).
full-vs-empty ambiguity at
head == tailneeds a count or a sacrificed slot.
Follow-ups they push on- How do you tell a full buffer from an empty one?
- What happens to the oldest data when the buffer is full and you write?
Red flag Failing to disambiguate full from empty (both have head == tail), corrupting reads/writes.
source: GeeksforGeeks — Circular Queue ↗ -
Given daily temperatures, for each day return how many days until a warmer one (a monotonic-stack problem).
Use a monotonic decreasing stack of indices: scan left to right, and while the current temperature exceeds the temperature at the stack's top index, pop it and record the gap (current index − popped index) as its answer. Push the current index. Each index is pushed and popped at most once, so it's O(n) time, O(n) space — far better than the O(n^2) double loop. The stack pattern answers 'next greater element' style questions generally.
What a strong answer coversStack holds indices awaiting a warmer day, kept decreasing by temperature.
Pop and resolve each index when a warmer day arrives.
Each index pushed/popped once ⇒ O(n) time.
Generalizes to 'next greater/smaller element' problems.
Follow-ups they push on- How does this generalize to 'next greater element' on a circular array?
- Why is the amortized cost O(n) despite the inner while-loop?
Red flag Storing temperatures instead of indices on the stack, losing the distance needed for the answer.
source: LeetCode 739 — Daily Temperatures (company tags) ↗ -
Detect whether a singly linked list has a cycle, using O(1) extra space.
Use Floyd's tortoise-and-hare: a slow pointer moves one step and a fast pointer two steps; if they ever meet there's a cycle, and if fast reaches null there isn't. O(n) time, O(1) space. A hash set of visited nodes also works but costs O(n) space, so lead with Floyd's.
Follow-ups they push on- Return the node where the cycle begins.
- How do you find the cycle's length?
Red flag Advancing the fast pointer without null-checking both `fast` and `fast.next`, causing a crash on even-length lists.
source: LeetCode 141 — Linked List Cycle (company tags) ↗ -
Design a stack that returns its minimum element in O(1) alongside push/pop/top.
Keep a second 'min stack' that records the running minimum in parallel with the main stack; on push you store min(value, currentMin), and on pop you pop both. Every operation stays O(1) time and the structure uses O(n) extra space. The key idea is that each level remembers the min as of when it was pushed, so popping restores the previous min for free.
Follow-ups they push on- Reduce the extra space when many pushed values repeat.
Red flag Storing only a single min variable, which can't recover the previous minimum after the current min is popped.
source: LeetCode 155 — Min Stack (company tags) ↗ -
Why is inserting at the front of a dynamic array O(n), and what should you use instead?
Inserting at index 0 forces every existing element to shift one slot right, which is O(n) per insert. If you frequently add/remove at the front, use a deque (or a linked list), which gives O(1) push/pop at both ends. This is a common hidden-quadratic bug: building a result by repeatedly inserting at the front of an array turns an O(n) loop into O(n^2).
Follow-ups they push on- When is appending to the end of a dynamic array still cheap?
Red flag Reaching for `arr.unshift(...)`/insert-at-0 in a loop and not noticing it makes the whole loop quadratic.
source: MDN — JavaScript Array ↗ -
Implement a FIFO queue using two LIFO stacks.
Keep an
instack for pushes and anoutstack for pops; whenoutis empty, pour everything frominintoout, which reverses the order and exposes the oldest element. Each element is moved at most once between stacks, so dequeue is amortized O(1) even though a single transfer is O(n). This is a clean test of whether a candidate understands LIFO-vs-FIFO and amortized cost.Follow-ups they push on- What's the worst-case (not amortized) cost of a single pop?
Red flag Transferring on every dequeue instead of only when `out` is empty, which makes it O(n) per op.
source: LeetCode 232 — Implement Queue using Stacks (company tags) ↗
1.3 Hashing structures 15
-
Find the length of the longest consecutive sequence of integers in an unsorted array, in O(n).
Put every number in a hash set for O(1) membership, then for each number
xstart counting a run only ifx - 1is absent (so x is a sequence start); from such a start, extend x, x+1, x+2, … while present and track the longest. Starting only at run-beginnings means each number is visited O(1) times overall, giving O(n) time, O(n) space — beating the O(n log n) sort-then-scan.What a strong answer coversSet membership gives O(1) 'is this number present?' checks.
Only begin counting where
x-1is missing (a run start).That guard bounds total work to O(n), not O(n²).
Beats sorting (O(n log n)) by trading time for O(n) space.
Quick self-checkWhy is the algorithm O(n) and not O(n²) despite the inner while-loop?
-
Wrong — a single run can be long; the bound comes from where runs start.
-
Correct — the `x-1 absent` guard ensures each element is visited O(1) times across all runs.
-
Wrong — sorting would make it O(n log n); this approach avoids sorting.
-
Wrong — O(1) lookups alone don't prevent quadratic total work; the run-start guard does.
Follow-ups they push on- Why does the 'only start where x-1 is absent' check keep it O(n)?
- What if duplicates are present in the input?
Red flag Extending a run from every element (O(n^2)) instead of only from numbers that begin a run.
source: LeetCode 128 — Longest Consecutive Sequence (company tags) ↗ -
Why must objects used as hash-map keys be effectively immutable, and what is the equals/hashCode contract?
A hash map places a key in a bucket derived from its hash; if you mutate a key after insertion so its hash changes, the entry is now in the 'wrong' bucket and lookups silently fail to find it. The equals/hashCode contract is the rule that ties them together: if two objects are equal they must have the same hash code, and equal objects must stay equal — so keys should be immutable (or at least their hash-relevant fields must be). Override
hashCodewhenever you overrideequals, or hash-based collections break.What a strong answer coversBucket is chosen from the key's hash at insert time.
Mutating a key's hash-relevant fields strands the entry in the wrong bucket.
Contract: equal objects ⇒ equal hash codes (not vice-versa).
Override
equals⇒ you must overridehashCodetoo.
Quick self-checkYou override `equals` to compare two fields but leave the default `hashCode`. What breaks?
-
Wrong — the map first picks a bucket by hashCode, so equal objects may land in different buckets.
-
Correct — the contract is violated; lookup hashes to a different bucket than where the key sits.
-
Wrong — hash maps don't sort keys at all.
-
Wrong — the failure is correctness (missed lookups), not just performance.
Follow-ups they push on- What goes wrong if hashCode is constant for all keys?
- Why is using a mutable list as a key dangerous?
Red flag Overriding `equals` but not `hashCode` (or mutating a key in place), so lookups for present keys return nothing.
source: Oracle Java SE — Object.hashCode() contract ↗ -
How does a hash map achieve average O(1) lookup, and why is the worst case O(n)?
A hash function maps a key to a bucket index, so with a good hash and a reasonable load factor most buckets hold ~1 entry and lookup is average O(1). The worst case is O(n) when many keys collide into the same bucket (bad hash, adversarial keys, or everything hashing the same), degrading a bucket into a linear scan. The O(1) is therefore an expected/average bound, not a guarantee.
Follow-ups they push on- What makes a hash function 'good'?
- How can an attacker force the worst case (hash flooding)?
Red flag Stating O(1) as a hard worst-case guarantee instead of an average/expected one.
source: Hirist — Top HashMap Interview Questions ↗ -
When should you use a hash set vs a hash map?
A hash set stores keys only and answers 'have I seen this?' — use it for membership, dedup, and presence checks. A hash map stores key→value associations — use it when you also need data attached to each key (counts, indices, last-seen position). Both give average O(1) ops; a set is essentially a map whose values you don't care about. Reach for the map the moment you need to remember *something about* each key, not just *that* you saw it.
What a strong answer coversSet: membership / dedup / 'seen?' — keys only.
Map: key → value — counts, indices, metadata per key.
Both average O(1); a set is a valueless map.
Two Sum needs a map (value→index); 'contains duplicate' needs only a set.
Quick self-checkYou must return the index of a matching earlier element. Set or map?
-
Wrong — a set proves presence but stores no index to return.
-
Correct — you need to recall *where* you saw the value, which requires a stored value.
-
Wrong — only the map retains the index you must return.
-
Wrong — a hash map solves it in one O(n) pass without sorting.
Follow-ups they push on- Which would you use for Two Sum, and why not the other?
- Which for 'does this array contain any duplicate'?
Red flag Using a set when you later need the associated value (e.g. an index), forcing an awkward rework.
source: AlgoArk — Hash Map Patterns for Interviews ↗ -
Determine whether an array contains any duplicate values.
Walk the array once, inserting each value into a hash set; if a value is already present, return true immediately, otherwise return false at the end. O(n) time, O(n) space. Alternatively sort first and check adjacent equal pairs for O(n log n) time and O(1) extra space — a clean time/space trade-off to mention.
What a strong answer coversHash set: insert each, return true on the first repeat.
O(n) time, O(n) space.
Sort-and-scan alternative: O(n log n) time, O(1) extra space.
Early exit on first duplicate; no need to finish the scan.
Follow-ups they push on- What if duplicates only count when within k indices of each other?
- How would you do it with O(1) extra space?
Red flag Comparing all pairs with a double loop (O(n^2)) when a single hash-set pass is O(n).
source: LeetCode 217 — Contains Duplicate (company tags) ↗ -
Given an array and a target, return indices of two numbers that sum to the target.
Walk the array once, and for each value
xcheck a hash map fortarget - x; if present you've found the pair, otherwise storex -> indexand continue. This is O(n) time, O(n) space — the canonical 'use a hash map to remember what you've seen' problem. The brute-force double loop is O(n^2); the hash map trades space for that speedup.Follow-ups they push on- What changes if the array is already sorted?
- How would you return all unique pairs (3Sum-style)?
Red flag Matching an element with itself by checking the map before inserting the current element incorrectly.
source: LeetCode 1 — Two Sum (company tags) ↗ -
Find the first non-repeating character in a string and return its index.
Make one pass to build a hash map (or 26-length array) of character counts, then a second pass over the string returning the index of the first character whose count is 1; return -1 if none. Two linear passes, O(n) time, O(1) space if the alphabet is fixed (at most 26/128 entries). The second pass must walk the original string order, not the map, because a map has no positional order.
What a strong answer coversPass 1: count frequencies in a map/array.
Pass 2: scan the string in order, return first index with count 1.
O(n) time; O(1) space for a fixed alphabet.
Iterate the string (ordered), not the map (unordered), in pass 2.
Follow-ups they push on- Why can't you find the answer by iterating the hash map directly?
- How would you support a streaming version where characters arrive over time?
Red flag Iterating the map instead of the string in pass 2 and returning a non-first unique character because maps lack order.
source: LeetCode 387 — First Unique Character in a String (company tags) ↗ -
Why does iterating a hash map give no guaranteed order, and what should you use if you need ordering?
Entries are placed by hash value into buckets, so iteration order reflects the internal bucket layout — which changes with the hash function, capacity, and resizes — not insertion or sort order. If you need a stable order, use an insertion-ordered map (Java's LinkedHashMap, Python's dict since 3.7) for insertion order, or a tree/sorted map (TreeMap, C++ std::map) for key-sorted order at O(log n) per op. Never rely on a plain hash map's iteration order; it's an implementation detail that can differ across runs or versions.
What a strong answer coversIteration order follows bucket layout, not insertion or sort order.
Order can change after a resize/rehash or across language versions.
Need insertion order ⇒ LinkedHashMap / Python dict.
Need sorted order ⇒ TreeMap / std::map (O(log n) ops).
Quick self-checkYou need keys returned in sorted order on every iteration. Which structure?
-
Wrong — plain hash maps give bucket order, not sorted order.
-
Correct — it maintains keys in sorted order at O(log n) per operation.
-
Wrong — that preserves insertion order, not key-sorted order.
-
Wrong — sets are also unordered; switching set/map doesn't add ordering.
Follow-ups they push on- Python dicts preserve insertion order since 3.7 — is that the same as 'sorted'?
- What ordering does a TreeMap give, and at what cost?
Red flag Depending on a plain hash map's iteration order in tests or logic, then breaking when it changes.
source: AlgoArk — Hash Map Patterns for Interviews ↗ -
Compare separate chaining and open addressing for collision handling.
Separate chaining stores colliding keys in a per-bucket list (or tree, as Java 8+ does past a threshold), so it tolerates high load factors but pays pointer/indirection overhead. Open addressing keeps everything in the array and probes for the next free slot (linear/quadratic probing, double hashing); it's cache-friendlier but degrades sharply as load factor approaches 1 and complicates deletion. The choice trades memory locality against sensitivity to load factor.
Follow-ups they push on- Why does deletion need tombstones in open addressing?
- Why does Java convert long chains into trees?
Red flag Describing chaining and open addressing as interchangeable without noting their load-factor and deletion behaviour differs.
source: GetSDEReady — HashMap & HashSet Interview Questions ↗ -
What is a load factor, and what happens when it's exceeded?
Load factor is entries divided by buckets — a measure of how full the table is (Java's HashMap defaults to 0.75). When it's exceeded the table resizes: capacity roughly doubles and every key is rehashed into the larger array, an O(n) operation that happens rarely, keeping amortized insert O(1). A higher load factor saves memory but raises collision rates and slows lookups; a lower one wastes space.
Follow-ups they push on- Why double the capacity rather than grow by a constant?
Red flag Thinking each insert that crosses the threshold is cheap, or that resize never happens.
source: Hirist — Top HashMap Interview Questions ↗ -
Group a list of strings into anagrams.
Use a hash map keyed by a canonical form of each word and collect words sharing a key. The canonical key is either the sorted characters (O(k log k) per word) or a 26-length character-count signature (O(k) per word); the latter is faster for long strings. Total time is about O(n*k), space O(n*k). The trick the interviewer is probing is choosing a good collision-free key.
Follow-ups they push on- Which key is better when words are long, and why?
Red flag Comparing every pair of words for the anagram relation (O(n^2 * k)) instead of bucketing by a canonical key.
source: LeetCode 49 — Group Anagrams (company tags) ↗ -
Count the number of contiguous subarrays whose sum equals k.
Track a running prefix sum and a hash map of how many times each prefix sum has occurred; at each index, the count of subarrays ending here equals the number of earlier prefix sums equal to
prefixSum - k. Seed the map with{0: 1}to count subarrays starting at index 0. This is O(n) time, O(n) space, versus the O(n^2) brute force.Follow-ups they push on- Why must the map be seeded with prefix sum 0?
Red flag Forgetting the `{0:1}` seed, which drops every subarray that starts at index 0.
source: LeetCode 560 — Subarray Sum Equals K (company tags) ↗ -
When is a hash map the wrong data structure? What do you reach for instead?
A hash map gives no ordering, so it's wrong when you need sorted iteration, the min/max, or range queries ('all keys between A and B'). For those, use an ordered/tree-based map (red-black tree, like Java's TreeMap or C++ std::map) giving O(log n) ordered operations, or a heap when you only need the extreme. Hash maps shine for pure key lookup, dedup, and frequency counting.
Follow-ups they push on- What does a TreeMap give you that a HashMap can't?
Red flag Defaulting to a hash map for problems that need ordering or range scans and then bolting on a sort every query.
source: AlgoArk — Hash Map Patterns for Interviews ↗ -
What makes a good hash function, and what is 'hash flooding' (algorithmic complexity attack)?
A good hash function distributes keys uniformly across buckets, is fast to compute, and is deterministic — minimizing collisions so buckets stay ~O(1). Hash flooding is a denial-of-service attack where an adversary crafts many keys that all hash to the same bucket, collapsing every lookup/insert to O(n) and the whole table to O(n^2) work — historically used to DoS web servers via crafted POST/query parameters. Defenses include per-process randomized/seeded hashing (SipHash) so an attacker can't predict the bucket, and converting long collision chains into balanced trees (Java 8+ does this).
What a strong answer coversGood hash: uniform, fast, deterministic ⇒ low collision rate.
Hash flooding forces worst-case collisions ⇒ O(n) ops, O(n²) total (DoS).
Defense 1: seeded/randomized hashing (e.g. SipHash) hides the mapping.
Defense 2: treeify long chains (O(n) → O(log n) within a bucket).
Follow-ups they push on- Why does a per-process random seed defeat the attack?
- How does treeifying long buckets bound the worst case?
Red flag Assuming worst-case collisions only happen by chance and ignoring that they can be deliberately induced.
source: GeeksforGeeks — Hash Functions and Hashing ↗ -
Design a structure with insert, delete, and getRandom all in average O(1).
Combine a dynamic array (for O(1) random access by index) with a hash map from value to its index in the array. Insert appends and records the index; delete swaps the target with the last element, pops the tail, and fixes the moved element's index; getRandom picks a random array index. The swap-with-last trick is what keeps delete O(1) instead of O(n).
Follow-ups they push on- How do you support duplicate values?
Red flag Deleting by shifting the array (O(n)) instead of swapping the victim with the last element.
source: LeetCode 380 — Insert Delete GetRandom O(1) (company tags) ↗
1.4 Trees 15
-
Find the lowest common ancestor (LCA) of two nodes in a binary tree.
Recurse: if the current node is null or equals either target, return it; otherwise recurse left and right. If both sides return non-null, the current node is the LCA (the targets split here); if only one side does, propagate that side up. O(n) time, O(h) stack space. If it's specifically a BST, you can do better: walk down, going left when both targets are smaller and right when both are larger — the first node that splits them is the LCA, O(h) time.
What a strong answer coversGeneral tree: both subtrees return non-null ⇒ this node is the LCA.
Return the non-null side upward when only one target is found below.
O(n) time, O(h) stack for the general case.
BST shortcut: descend by comparing values, first split node is the LCA.
Follow-ups they push on- How does the BST version beat the general O(n) approach?
- What changes if each node also stores a parent pointer?
Red flag Assuming both targets actually exist in the tree, or applying the BST descent on a non-BST.
source: LeetCode 236 — Lowest Common Ancestor of a Binary Tree (company tags) ↗ -
What property defines a binary search tree, and what are its operation costs when balanced vs degenerate?
In a BST every node's left subtree holds only smaller keys and its right subtree only larger keys, so an in-order traversal yields sorted order. Search/insert/delete are O(log n) when the tree is balanced (height ~log n) but degrade to O(n) when it degenerates into a linked-list shape (e.g. inserting already-sorted data). That fragility is exactly why self-balancing variants exist.
Follow-ups they push on- What insertion order produces a degenerate BST?
- How do you validate that a tree is a proper BST?
Red flag Claiming a BST is always O(log n) without the 'when balanced' qualifier.
source: GeeksforGeeks — Self-Balancing Binary Search Trees ↗ -
Return the level-order traversal of a binary tree (values grouped by level).
Run BFS with a queue: at each step record the current queue size (that's one full level), then dequeue exactly that many nodes, collect their values into a level list, and enqueue their children. Repeat until the queue empties. O(n) time, O(width) space. Snapshotting the queue size per round is the trick that cleanly separates one level from the next.
What a strong answer coversBFS with a queue; snapshot the level size each round.
Process exactly that many nodes to isolate one level.
Enqueue children as you go for the next level.
O(n) time, O(max width) space.
Follow-ups they push on- Produce a zigzag (alternating left-right) level order.
- Return only the rightmost node of each level (right side view).
Red flag Not capturing the level size before the loop, so children enqueued mid-level bleed into the current level.
source: LeetCode 102 — Binary Tree Level Order Traversal (company tags) ↗ -
What is a heap / priority queue, and what are the costs of peek, insert, and extract?
A binary heap is a complete tree (stored in an array) maintaining the heap property — each parent is <= (min-heap) or >= (max-heap) its children — so the extreme element sits at the root. Peek-min/max is O(1); insert and extract are O(log n) because you sift up/down one level at a time. It's the go-to for top-K, scheduling, Dijkstra, and merging K sorted streams.
Follow-ups they push on- Why is building a heap from n items O(n) and not O(n log n)?
Red flag Confusing a heap with a BST, or thinking it keeps all elements fully sorted (it only orders the root).
source: CodeJeet — Heap / Priority Queue Interview Questions ↗ -
Compare the four binary-tree traversals (preorder, inorder, postorder, level-order) and say when you'd use each.
Preorder (node, left, right) visits the root first — good for copying/serializing a tree. Inorder (left, node, right) yields sorted order in a BST — good for validation and producing ordered output. Postorder (left, right, node) visits children before the parent — good for deletion and bottom-up aggregates like subtree sums/heights. Level-order is BFS with a queue, processing tier by tier — good for shortest-depth and 'by level' problems. The first three are DFS (recursion or stack); level-order is BFS (queue).
What a strong answer coversPreorder: serialize/clone (root before children).
Inorder: BST ⇒ sorted output; used for validation.
Postorder: delete / bottom-up subtree aggregates.
Level-order: BFS via queue; depth and per-level problems.
Quick self-checkWhich traversal of a valid BST produces the keys in ascending sorted order?
-
Wrong — preorder visits the root before its left subtree, so it isn't sorted.
-
Correct — left, node, right yields ascending order in a BST.
-
Wrong — postorder visits the root last, not in sorted position.
-
Wrong — BFS order reflects depth, not key ordering.
Follow-ups they push on- Which traversal reconstructs a BST's sorted sequence?
- Why is postorder natural for freeing/deleting a tree?
Red flag Mixing up the visit positions, or using a stack for level-order instead of a queue (that's DFS, not BFS).
source: GeeksforGeeks — Tree Traversals (Inorder, Preorder, Postorder) ↗ -
Compute the diameter of a binary tree (longest path between any two nodes).
Do a single postorder DFS that returns each node's height while updating a global max: at each node, the longest path *through* it is
leftHeight + rightHeight(in edges), so track the maximum of that across all nodes and return1 + max(leftHeight, rightHeight)to the parent. O(n) time, O(h) stack. The key insight is that the answer is a path that bends at some node, computed from its two subtree depths.What a strong answer coversPostorder DFS returns height; a side variable tracks the best diameter.
Path through a node =
leftHeight + rightHeight(edge count).Return
1 + max(left, right)upward as the node's height.O(n) time, O(h) stack — one traversal, not one per node.
Follow-ups they push on- Why compute height and diameter in the same pass instead of two?
- Should the diameter be measured in nodes or edges (be consistent)?
Red flag Recomputing height separately at every node (O(n^2)) instead of folding it into one postorder pass.
source: LeetCode 543 — Diameter of Binary Tree (company tags) ↗ -
What advantage does a trie have over a hash map for storing strings, and what's the catch?
A trie answers prefix queries — 'all words starting with "pre"', autocomplete, longest-prefix matching — which a hash map cannot do without scanning every key, and it shares storage for common prefixes. Lookups are O(m) in the word length, independent of how many words are stored. The catch is memory: each node carries a child map/array (up to alphabet size), so a sparse trie can use far more memory than a hash set of the same words, and it's only worthwhile when prefix operations matter.
What a strong answer coversTrie supports prefix / autocomplete queries; a hash map can't, cheaply.
Lookup is O(m) in word length, not in the number of stored words.
Common prefixes are shared, but each node holds child links.
Catch: high memory overhead; use only when prefixes matter.
Quick self-checkWhat can a trie do that a hash map of the same words fundamentally cannot do efficiently?
-
Wrong — a hash map already does exact lookup in average O(1).
-
Correct — prefix enumeration walks a subtree; a hash map would have to scan all keys.
-
Wrong — tries usually use more memory due to per-node child links.
-
Wrong — neither structure stores duplicate keys by design.
Follow-ups they push on- How would you compress a sparse trie (radix/Patricia trie)?
- When is a plain hash set strictly better than a trie?
Red flag Reaching for a trie when only exact-match lookup is needed — a hash set is simpler and lighter there.
source: GeeksforGeeks — Trie Data Structure ↗ -
Find the kth smallest element in a binary search tree.
Do an inorder traversal (which visits BST keys in ascending order) and stop at the kth visited node — you don't need to traverse the whole tree. An iterative inorder with an explicit stack lets you halt early at O(h + k) time. If the tree is queried for many different k values, augment each node with its left-subtree size so each query becomes O(h) by navigating directly.
What a strong answer coversInorder visits BST keys ascending ⇒ the kth visited is the answer.
Stop early at the kth node; no full traversal needed.
Iterative stack-based inorder ⇒ O(h + k) time.
For repeated queries, store subtree sizes ⇒ O(h) per query.
Follow-ups they push on- How do subtree-size augmentations speed up many repeated queries?
- How would you find the kth largest instead?
Red flag Collecting the entire inorder list and indexing (O(n)) instead of stopping at the kth element.
source: LeetCode 230 — Kth Smallest Element in a BST (company tags) ↗ -
Validate that a binary tree is a valid binary search tree.
Recurse with a valid (min, max) range for each node: the root is unbounded, the left child tightens the max to the parent's value and the right child tightens the min. A node fails if its value violates its range. O(n) time, O(h) stack space. Equivalently, an in-order traversal of a valid BST is strictly increasing, so you can check that the previous visited value is always smaller.
Follow-ups they push on- Why isn't it enough to just compare each node to its two children?
Red flag Only comparing a node against its immediate children, which misses violations deeper in a subtree.
source: LeetCode 98 — Validate Binary Search Tree (company tags) ↗ -
What is a self-balancing tree (AVL / red-black), and where are they used in real systems?
Self-balancing BSTs perform rotations on insert/delete to keep height O(log n), guaranteeing O(log n) operations regardless of input order. AVL trees keep height balance tighter (faster lookups, more rotations); red-black trees balance more loosely (fewer rotations, faster writes). They back ordered maps/sets such as Java's TreeMap and C++ std::map, and red-black trees appear in the Linux process scheduler.
Follow-ups they push on- When would you prefer AVL's stricter balance over red-black?
Red flag Treating AVL and red-black as identical, or not knowing they guarantee O(log n) by construction.
source: AlgoCademy — Introduction to Self-Balancing BSTs ↗ -
Implement a trie (prefix tree) supporting insert, search, and startsWith.
Each node holds a map/array of child links and an
isEndflag; insert walks/creates a path of nodes one character at a time, search walks the path and checksisEnd, and startsWith walks the path without requiringisEnd. All three are O(m) for a word of length m, independent of how many words are stored. Tries shine for autocomplete, spellcheck, and prefix-heavy lookups where a hash map can't answer prefix queries.Follow-ups they push on- Add wildcard '.' matching.
- How would you support delete?
Red flag Conflating 'a word ends here' (`isEnd`) with 'a prefix exists here', which breaks exact-word search.
source: LeetCode 208 — Implement Trie (company tags) ↗ -
Find the kth largest element in an unsorted array.
Maintain a min-heap of size k: push each element, and whenever the heap exceeds k pop the smallest, so the heap ends holding the k largest with the kth largest at its root. That's O(n log k) time, O(k) space. Quickselect gives average O(n) by partitioning around a pivot and recursing into only the relevant side, with O(n^2) worst case — mention both and the trade-off.
Follow-ups they push on- When is quickselect's O(n) average worth its O(n^2) worst case?
Red flag Sorting the whole array (O(n log n)) and indexing, or using a max-heap of size n when a size-k min-heap suffices.
source: LeetCode 215 — Kth Largest Element in an Array (company tags) ↗ -
Why is building a heap from n elements O(n) and not O(n log n)? And how do you do an in-place heapsort?
Bottom-up heapify (sift-down from the last internal node up to the root) is O(n), not O(n log n), because most nodes sit near the leaves and sift down only a tiny distance — summing the work weighted by height gives a convergent series bounded by O(n). (Inserting one-by-one with sift-up is the O(n log n) way.) Heapsort then builds a max-heap in place, repeatedly swaps the root (the max) with the last unsorted element and sifts down the reduced heap — O(n log n) time, O(1) extra space, but not stable.
What a strong answer coversBottom-up heapify is O(n): most nodes are shallow, work sums to O(n).
Repeated sift-up inserts would be O(n log n) — the slower build.
Heapsort: build max-heap, swap root to the end, shrink, sift down.
Heapsort is O(n log n), O(1) space, not stable.
Quick self-checkWhy is bottom-up heap construction O(n) rather than O(n log n)?
-
Wrong — a sift-down can be up to O(log n); the savings come from the height distribution.
-
Correct — the cost is Σ (nodes at height h)·h, which sums to O(n).
-
Wrong — sift-down does compare; the bound is about how far nodes move.
-
Wrong — bottom-up heapify eagerly fixes every subtree.
Follow-ups they push on- Why is sift-down-from-the-bottom cheaper than n separate insertions?
- Why isn't heapsort stable, and when does that matter?
Red flag Claiming heap construction is always O(n log n), conflating the build phase with n individual insertions.
source: GeeksforGeeks — Time Complexity of Building a Heap ↗ -
Why do relational databases use B-trees / B+ trees for indexes instead of a binary search tree?
B-trees are shallow and high-fanout — each node holds many keys, so the tree stays only a few levels deep even for millions of rows, which minimizes expensive disk seeks (disk I/O, not comparisons, is the bottleneck). A binary tree would be far taller and cost many more page reads. In a B+ tree all values live in the leaves, which are linked together, so range scans and ORDER BY can sweep the leaves sequentially without re-walking the tree.
Follow-ups they push on- Why does fanout matter more than tree height in comparisons?
- How does the linked leaf layer of a B+ tree help range queries?
Red flag Justifying B-trees by comparison count rather than by minimizing disk page reads.
source: Use The Index, Luke — Anatomy of an Index (B-tree) ↗ -
Merge k sorted linked lists into one sorted list.
Push the head of each list into a min-heap keyed by node value; repeatedly pop the smallest, append it to the result, and push that node's successor. Each of the n total nodes is pushed/popped once at O(log k) cost, giving O(n log k) time and O(k) heap space. Divide-and-conquer pairwise merging hits the same O(n log k) without a heap.
Follow-ups they push on- Compare the heap approach with pairwise divide-and-conquer merging.
Red flag Concatenating all lists and sorting (O(n log n)) instead of exploiting that each list is already sorted.
source: LeetCode 23 — Merge k Sorted Lists (company tags) ↗
1.5 Graphs 15
-
What is union-find (disjoint set union), and what do union by rank and path compression buy you?
Union-find tracks elements partitioned into disjoint sets via a parent-pointer forest, supporting
find(which set/root an element belongs to) andunion(merge two sets). Path compression flattens the tree by pointing visited nodes straight at the root during find, and union by rank/size always attaches the smaller tree under the larger; together they make each operation nearly O(1) — amortized O(α(n)), the inverse-Ackermann function, effectively constant. It's the tool for dynamic connectivity, counting connected components, cycle detection in undirected graphs, and Kruskal's MST.What a strong answer coversForest of parent pointers;
findreturns the set root,unionmerges.Path compression: repoint nodes to the root during find.
Union by rank/size: attach smaller tree under larger.
Together ⇒ amortized O(α(n)) ≈ constant per operation.
Quick self-checkWith both path compression and union by rank, the amortized cost per operation is:
-
Wrong — that's the bound with only one optimization, not both.
-
Correct — combined, they give inverse-Ackermann amortized cost, ≤ 4 for any practical n.
-
Wrong — that's the naive worst case without optimizations.
-
Wrong — it's amortized near-constant, not a hard per-operation O(1).
Follow-ups they push on- Why is union-find better than BFS/DFS for *dynamic* connectivity queries?
- How does Kruskal's algorithm use union-find?
Red flag Implementing find/union without either optimization, degrading to O(n) per op on adversarial unions.
source: GeeksforGeeks — Disjoint Set (Union-Find) with Rank & Path Compression ↗ -
Adjacency list vs adjacency matrix: compare space and edge-lookup cost, and say when to use each.
An adjacency list stores each node's neighbours, using O(V + E) space — efficient for sparse graphs, which is most real-world graphs. An adjacency matrix is a V x V grid giving O(1) edge-existence checks but O(V^2) space regardless of edge count, so it only pays off for dense graphs or when you constantly test specific edges. Default to the list unless the graph is dense.
Follow-ups they push on- Which representation makes 'is there an edge u-v?' fastest?
Red flag Using a matrix for a large sparse graph and wasting O(V^2) memory on mostly-empty cells.
source: Tech Interview Handbook — Graph cheatsheet ↗ -
BFS vs DFS: how do they differ, and when do you pick each?
BFS explores level by level using a queue and finds the shortest path in an unweighted graph (fewest edges); DFS dives deep along one branch using recursion or an explicit stack and suits connectivity, cycle detection, and topological sort. BFS uses O(width) memory, DFS uses O(depth). Choose BFS when you need shortest hops or level order; choose DFS when you need to fully explore structure or order dependencies.
Follow-ups they push on- Why does BFS, not DFS, give the shortest path in an unweighted graph?
- When does DFS risk a stack overflow?
Red flag Using DFS to find a shortest unweighted path, or forgetting a visited set and looping forever on cycles.
source: Tech Interview Handbook — Graph cheatsheet ↗ -
Why must graph traversals track visited nodes, and what's the cost of forgetting?
Graphs can contain cycles and multiple paths to the same node, so without a visited set a traversal revisits nodes and, on a cycle, loops forever or explodes in work. A visited set makes both BFS and DFS O(V + E) by guaranteeing each node and edge is processed once. (Trees are the special case where you can skip it — they have no cycles.)
Follow-ups they push on- Why is a visited set unnecessary when traversing a tree?
Red flag Copy-pasting tree-traversal code onto a graph and infinite-looping on the first cycle.
source: Tech Interview Handbook — Graph cheatsheet ↗ -
Define directed vs undirected and weighted vs unweighted graphs, with an example of each.
In a directed graph edges have a direction (Twitter 'follows'); in an undirected graph they go both ways (Facebook 'friends'). Weighted edges carry a cost or distance (road network with mileage); unweighted edges just record a connection (a maze of equal steps). These two axes determine your algorithm choice — e.g. unweighted shortest path uses BFS, weighted uses Dijkstra.
Follow-ups they push on- How does each property change which traversal/shortest-path algorithm you pick?
Red flag Modelling a one-way relationship (like 'follows') as an undirected edge and corrupting the graph's meaning.
source: Tech Interview Handbook — Graph cheatsheet ↗ -
Count the number of connected components in an undirected graph. Two ways?
Way 1 — traversal: loop over all nodes; each time you hit an unvisited node, increment the count and BFS/DFS to mark its whole component visited. O(V + E). Way 2 — union-find: start with V components and
unionthe endpoints of every edge; each successful merge of two distinct sets drops the count by one. O(E·α(V)). Union-find shines when edges arrive incrementally or you also need connectivity queries; traversal is simplest for a static graph.What a strong answer coversTraversal: count = number of BFS/DFS launches from unvisited nodes.
Union-find: start at V, decrement on each cross-set union.
Both are near-linear: O(V + E) vs O(E·α(V)).
Prefer union-find for streaming edges / repeated connectivity queries.
Follow-ups they push on- Which approach fits a stream of edges arriving over time, and why?
- How would you also report the size of the largest component?
Red flag Forgetting isolated (degree-0) vertices, which are components of their own and easy to miss.
source: LeetCode 323 — Number of Connected Components in an Undirected Graph (company tags) ↗ -
How does Dijkstra's algorithm work, and why does it break with negative edge weights?
Dijkstra greedily grows a set of finalized shortest distances: a min-heap repeatedly pops the closest unfinalized node, finalizes its distance, and relaxes its outgoing edges. With a binary heap it's O((V + E) log V). It relies on the assumption that once you finalize a node, no later path can be shorter — true only with non-negative weights. A negative edge can make a 'longer-looking' path actually cheaper after the node is already finalized, breaking correctness; for negative edges use Bellman-Ford (O(V·E)), which also detects negative cycles.
What a strong answer coversMin-heap pops the nearest unfinalized node, then relaxes its edges.
O((V + E) log V) with a binary heap.
Correct only because finalized nodes can't be improved — needs non-negative weights.
Negative edges ⇒ use Bellman-Ford (O(V·E)), which finds negative cycles.
Quick self-checkWhy does Dijkstra fail on graphs with negative edge weights?
-
Wrong — heaps handle negative values fine; the issue is the greedy invariant.
-
Correct — the greedy 'finalize the closest' invariant assumes adding edges can't reduce cost.
-
Wrong — it terminates but can return incorrect distances.
-
Wrong — edge weight sign has nothing to do with direction.
Follow-ups they push on- What does Bellman-Ford do that Dijkstra can't?
- How does A* differ from Dijkstra?
Red flag Running Dijkstra on a graph with negative edges and trusting the (silently wrong) result.
source: Tech Interview Handbook — Graph cheatsheet ↗ -
How do you detect a cycle in a graph, and why does the method differ between directed and undirected graphs?
In an undirected graph, DFS finds a cycle if it reaches an already-visited node that isn't the immediate parent (or union-find: an edge joining two nodes already in the same set). In a directed graph a plain visited set is insufficient — you must track nodes currently on the recursion stack (often three colors: white/unvisited, gray/in-progress, black/done); a back edge to a gray node means a cycle. The difference is that in directed graphs revisiting a finished node is fine (it's just a shared descendant), whereas an edge back to an *in-progress* ancestor is the cycle.
What a strong answer coversUndirected: visited neighbor that isn't the parent ⇒ cycle (or union-find).
Directed: need a recursion-stack / gray marker, not just visited.
Back edge to a gray (in-progress) node ⇒ directed cycle.
Revisiting a finished (black) node in a digraph is not a cycle.
Quick self-checkDetecting a cycle in a DIRECTED graph requires tracking which of these beyond a visited set?
-
Wrong — in-degree drives Kahn's topological sort, not DFS color-based detection.
-
Correct — a back edge to an in-progress ancestor is exactly a directed cycle.
-
Wrong — cycle existence is independent of weights.
-
Wrong — a plain visited set yields false positives in digraphs (shared descendants aren't cycles).
Follow-ups they push on- Why isn't a simple visited set enough for directed cycle detection?
- How does topological sort also reveal a directed cycle?
Red flag Reusing the undirected approach (plain visited set) on a directed graph and reporting false cycles.
source: GeeksforGeeks — Detect Cycle in a Directed Graph ↗ -
Given a directed acyclic dependency graph, produce a valid build/task order (topological ordering via Kahn's algorithm).
Compute every node's in-degree, seed a queue with all in-degree-0 nodes (no dependencies), then repeatedly dequeue a node, append it to the order, and decrement its neighbors' in-degrees — enqueuing any that hit zero. O(V + E). If the emitted order contains fewer than V nodes, a cycle exists and no valid ordering is possible, so the same algorithm doubles as cycle detection. This is exactly Course Schedule II / dependency resolution.
What a strong answer coversKahn's: start from in-degree-0 nodes, peel them off layer by layer.
Decrement neighbors' in-degrees; enqueue when they reach 0.
O(V + E) time and space.
Output size < V ⇒ a cycle ⇒ no valid ordering.
Follow-ups they push on- How does the same run tell you the graph has a cycle?
- How would you produce *all* valid topological orders?
Red flag Assuming an ordering always exists and not checking for the cycle case (output shorter than V).
source: LeetCode 210 — Course Schedule II (company tags) ↗ -
Count the number of islands in a grid of land ('1') and water ('0').
Scan every cell; when you hit unvisited land, increment the island count and flood-fill (BFS or DFS) all connected land, marking it visited so you don't recount it. The grid is an implicit graph where each cell connects to its 4 neighbours. O(rows * cols) time and space. The core insight is recognizing a 2D matrix as a graph traversal.
Follow-ups they push on- How would you handle a grid too large to fit in memory?
- Count islands with diagonal connectivity.
Red flag Not marking visited cells (recounting the same island) or only checking diagonal instead of 4-directional neighbours.
source: LeetCode 200 — Number of Islands (company tags) ↗ -
Given course prerequisites, determine whether you can finish all courses.
Model courses as a directed graph and ask whether it has a cycle: if it does, the prerequisites are circular and you can't finish. Use Kahn's algorithm (BFS topological sort — repeatedly remove in-degree-0 nodes; if you can't remove them all, a cycle remains) or DFS cycle detection with a recursion-stack marker. O(V + E) time and space.
Follow-ups they push on- Return a valid course ordering (Course Schedule II).
- BFS vs DFS for detecting the cycle?
Red flag Detecting a cycle with a simple visited set but no 'currently on the recursion stack' distinction, giving false positives.
source: LeetCode 207 — Course Schedule (company tags) ↗ -
What is a topological sort, what graphs admit one, and how do you compute it?
A topological sort is a linear ordering of a directed graph's vertices where every edge u->v has u before v — it exists if and only if the graph is a DAG (no cycles). Compute it with Kahn's algorithm (repeatedly emit in-degree-0 nodes) or via DFS finish times reversed. It models dependency resolution: build systems, task scheduling, course prerequisites.
Follow-ups they push on- How does the same algorithm also tell you the graph has a cycle?
Red flag Claiming any directed graph can be topologically sorted — cycles make it impossible.
source: AlgoMonster — Course Schedule (topological sort) ↗ -
You need the shortest path in an unweighted graph. Which algorithm, and what changes if edges have weights?
Unweighted shortest path is plain BFS — the first time you reach a node is via the fewest edges, so it's optimal at O(V + E). With non-negative weights, BFS no longer works because fewer edges can cost more; switch to Dijkstra's algorithm, which uses a min-heap/priority queue to always expand the cheapest frontier node. The shift from a queue to a priority queue is the key recognition.
Follow-ups they push on- Why does Dijkstra break with negative edge weights?
Red flag Reaching for Dijkstra on an unweighted graph (overkill) or using BFS when edges carry weights (wrong answer).
source: Tech Interview Handbook — Graph cheatsheet ↗ -
Make a deep copy (clone) of a connected undirected graph.
Traverse with BFS or DFS while keeping a hash map from original node to its clone. When you first see a node, create its clone and record it; then for each neighbor, create-or-look-up its clone and wire up the edge. The map serves double duty as both the visited set and the original→copy lookup, which is what prevents infinite loops on cycles. O(V + E) time and space.
What a strong answer coversMap original → clone doubles as the visited set.
Create a clone on first sight; reuse the mapped clone afterward.
Wire each neighbor edge using looked-up clones.
O(V + E) time and space; works via BFS or DFS.
Follow-ups they push on- Why does the original→clone map prevent infinite recursion on cycles?
- How does this change for a directed graph?
Red flag Cloning a neighbor again instead of reusing the mapped clone, producing duplicate nodes and looping on cycles.
source: LeetCode 133 — Clone Graph (company tags) ↗ -
Find the length of the shortest word transformation from beginWord to endWord changing one letter at a time (Word Ladder).
Model each word as a graph node with edges to words differing by one letter, then run BFS from beginWord — the first time you reach endWord, the BFS depth is the shortest transformation length (unweighted shortest path). To find neighbors efficiently, use wildcard patterns like
h*tas buckets so you don't compare every pair of words. BFS guarantees the shortest sequence; bidirectional BFS from both ends prunes the frontier and is a strong optimization to mention.What a strong answer coversWords are nodes; one-letter-apart words are edges ⇒ unweighted graph.
BFS gives the shortest transformation (fewest steps).
Wildcard buckets (
h*t) generate neighbors without all-pairs comparison.Bidirectional BFS searches from both ends to cut the explored frontier.
Follow-ups they push on- Why BFS rather than DFS for the *shortest* sequence?
- How does bidirectional BFS reduce the work?
Red flag Using DFS (finds *a* path, not the shortest) or comparing all word pairs (O(N^2·L)) to build edges.
source: LeetCode 127 — Word Ladder (company tags) ↗
1.6 Algorithm categories — recognize the pattern 15
-
What's the general template for backtracking problems, and how do you prune to avoid exploring dead ends?
Backtracking is DFS over a decision tree: choose an option, explore by recursing, then un-choose (undo the change) before trying the next option. You hit a base case when a full candidate is built (record it) and prune by checking constraints *before* recursing — abandoning a branch the moment it can't lead to a valid solution. Pruning (e.g. skipping a queen placement under attack in N-Queens, or stopping when a partial sum exceeds the target) is what turns brute-force enumeration into something tractable.
What a strong answer coversPattern: choose → explore → un-choose (restore state on the way out).
Base case records a complete candidate.
Prune early: reject a branch before recursing when it can't succeed.
Used for permutations, combinations, N-Queens, Sudoku, word search.
Quick self-checkWhat is the defining structure of a backtracking algorithm?
-
Wrong — that's tabulation/DP, not backtracking's explore-and-undo.
-
Correct — the choose/explore/un-choose cycle over a decision tree is backtracking.
-
Wrong — that's greedy, which by definition doesn't backtrack.
-
Wrong — that describes divide-and-conquer.
Follow-ups they push on- How does N-Queens prune attacked positions?
- Why must you undo the choice after recursing, not before?
Red flag Forgetting to undo the choice on the way back (state leaks across branches), or pruning only after fully building candidates.
source: Tech Interview Handbook — Recursion / Backtracking ↗ -
Find the contiguous subarray with the largest sum (Maximum Subarray / Kadane's algorithm).
Kadane's algorithm: scan once, maintaining
curr = max(x, curr + x)(either start fresh at x or extend the running subarray) and tracking the bestcurrseen. O(n) time, O(1) space. The key decision at each element — extend the previous subarray or restart — is a one-line DP. Watch the all-negative case: initialize the answer to the first element (or -∞), not 0, so you don't wrongly return 0 for an empty subarray.What a strong answer coversPer element:
curr = max(x, curr + x)— extend or restart.Track the maximum
curr; O(n) time, O(1) space.It's a one-variable DP (running best ending here).
All-negative inputs: init answer to first element, never 0.
Quick self-checkWhy initialize Kadane's answer to the first element (or -∞) rather than 0?
-
Wrong — the init value doesn't affect space; it affects correctness.
-
Correct — 0 would imply an empty subarray, which the standard problem disallows.
-
Wrong — the init value doesn't change the iteration count.
-
Wrong — overflow is unrelated to the choice of initial maximum.
Follow-ups they push on- How would you also return the start/end indices of the subarray?
- What changes for the maximum *product* subarray?
Red flag Initializing the max to 0, which returns 0 for an all-negative array instead of the largest (least negative) element.
source: LeetCode 53 — Maximum Subarray (company tags) ↗ -
Recursion vs iteration: what are a base case and the call stack, and when does recursion risk a stack overflow?
Recursion solves a problem by calling itself on smaller inputs until a base case stops the descent; each call pushes a frame onto the call stack and pops it on return. Without a correct base case (or with too-deep recursion) the stack grows until it overflows. Deep recursion on large inputs should be rewritten iteratively (or made tail-recursive where the language optimizes it) to use O(1) instead of O(depth) stack space.
Follow-ups they push on- How would you convert a deep DFS recursion into an iterative one?
Red flag Omitting or mis-ordering the base case (infinite recursion), or ignoring the O(depth) stack cost on large inputs.
source: Tech Interview Handbook — Algorithms cheatsheet ↗ -
Climbing stairs: you can take 1 or 2 steps at a time — how many ways to reach step n? Why is this Fibonacci?
The ways to reach step n equal the ways to reach n-1 (then a 1-step) plus the ways to reach n-2 (then a 2-step):
ways(n) = ways(n-1) + ways(n-2)— the Fibonacci recurrence. Bottom-up DP keeping just the last two values gives O(n) time and O(1) space. Recognizing that the final move splits the problem into independent subproblems is the DP insight; the naive recursion without memoization is exponential.What a strong answer coversways(n) = ways(n-1) + ways(n-2)⇒ Fibonacci shape.Subproblems overlap ⇒ DP, not naive exponential recursion.
Rolling two variables ⇒ O(n) time, O(1) space.
Base cases:
ways(0)=1,ways(1)=1.
Follow-ups they push on- Generalize to taking 1, 2, or 3 steps.
- What if each step has a cost and you minimize total cost (min cost climbing)?
Red flag Solving with naive O(2^n) recursion, or botching the base cases so the count is off by one.
source: LeetCode 70 — Climbing Stairs (company tags) ↗ -
Compare quicksort and mergesort. Why is comparison sorting bounded at O(n log n)?
Quicksort partitions around a pivot in place — average O(n log n), O(log n) stack space, but O(n^2) worst case on bad pivots and not stable. Mergesort splits and merges — guaranteed O(n log n) and stable, but needs O(n) extra space. Any comparison-based sort is bounded below by O(n log n) because there are n! possible orderings and each comparison yields one bit, so you need at least log2(n!) ~ n log n comparisons to distinguish them.
Follow-ups they push on- How do non-comparison sorts like counting/radix beat O(n log n)?
- Why does Timsort (Python/Java) blend mergesort and insertion sort?
Red flag Calling quicksort O(n log n) worst case, or claiming any sort whatsoever beats O(n log n) (only non-comparison ones can).
source: Tech Interview Handbook — Algorithms / Sorting ↗ -
House Robber: maximize the sum of non-adjacent house values along a street.
At each house you either skip it (carry forward the best so far) or rob it (its value plus the best up to two houses back):
dp[i] = max(dp[i-1], dp[i-2] + nums[i]). Keep just the two previous results for O(n) time, O(1) space. The greedy 'rob every other house' fails — the optimal choice depends on values, which is the cue for DP over greedy.What a strong answer coversTransition:
dp[i] = max(dp[i-1], dp[i-2] + nums[i])(skip vs rob).Two rolling variables ⇒ O(n) time, O(1) space.
Greedy 'every other house' is wrong; the answer is value-dependent.
Classic optimal-substructure + overlapping-subproblems DP.
Follow-ups they push on- What changes if the houses are arranged in a circle (House Robber II)?
- Why does a greedy alternating strategy fail here?
Red flag Assuming the answer is just the larger of the even-index vs odd-index sums, which a counterexample breaks.
source: LeetCode 198 — House Robber (company tags) ↗ -
Generate all subsets (the power set) of a set of distinct integers.
Use backtracking: at each index decide to include or exclude that element, recursing on the rest and recording the running subset at every node of the decision tree. There are 2^n subsets, so it's O(n·2^n) time (n to copy each subset) — inherent to the output size. An iterative alternative builds subsets by, for each new element, appending it to every subset seen so far. Passing a
startindex prevents revisiting earlier elements and generating duplicates.What a strong answer coversInclude/exclude decision per element ⇒ binary choice tree of 2^n leaves.
Record the partial subset at every recursion node.
O(n·2^n) — bounded by the output size itself.
A
startindex avoids re-choosing earlier elements (no dup subsets).
Follow-ups they push on- How do you handle duplicate input values (Subsets II)?
- How does this template extend to permutations and combinations?
Red flag Adding the same combination twice by recursing from index 0 instead of advancing a `start` pointer.
source: LeetCode 78 — Subsets (company tags) ↗ -
Compute the product of all elements except self, without using division and in O(n).
Use prefix and suffix products: first pass fills each position with the product of everything to its left; second pass multiplies in the product of everything to its right (tracked in a running variable). O(n) time, O(1) extra space if the output array doesn't count. Division would be the obvious trick but is explicitly banned — and it breaks on zeros anyway, which is exactly why the prefix/suffix approach is the expected answer.
What a strong answer coversLeft-products pass, then a right-products running multiply.
O(n) time, O(1) extra space (output aside).
Avoids division — which the problem bans and which fails on zeros.
Each output = (product of all left) × (product of all right).
Follow-ups they push on- Why is the division approach fragile when the array contains a zero?
- How do you keep it O(1) extra space (reusing the output array)?
Red flag Using division (banned, and breaks with one or more zeros) instead of prefix/suffix products.
source: LeetCode 238 — Product of Array Except Self (company tags) ↗ -
What cues in a problem tell you to reach for binary search? Search a rotated sorted array as an example.
The cue is 'sorted (or monotonic) + find', or a search space you can halve by a yes/no test — binary search gives O(log n). In a rotated sorted array, at each midpoint one half is still sorted; check whether the target lies within that sorted half to decide which side to discard, keeping it O(log n). Binary search also hides in 'find minimum capacity/threshold' problems via binary-search-on-the-answer.
Follow-ups they push on- Find the minimum in a rotated sorted array.
- How do you binary-search on the answer?
Red flag Off-by-one and infinite loops from sloppy mid/low/high updates, or assuming the array must be fully sorted to apply it.
source: LeetCode 33 — Search in Rotated Sorted Array (company tags) ↗ -
When do you use two pointers vs a sliding window? Give the canonical cue for each.
Two pointers fits sorted arrays and pair/triplet problems: move a left and right pointer inward based on a comparison (e.g. pair-sum, removing duplicates). Sliding window fits 'longest/shortest contiguous subarray or substring satisfying a constraint': grow the right edge and shrink the left when the constraint breaks. Both turn an O(n^2) brute force into O(n) by never resetting the pointers backwards.
Follow-ups they push on- What signals a fixed-size window vs a variable-size one?
Red flag Resetting the inner pointer to the window start on each step, which silently reintroduces O(n^2) behaviour.
source: DEV — Two Pointers & Sliding Window ↗ -
Find the length of the longest substring without repeating characters.
Slide a window with two pointers, tracking the characters currently inside in a hash set/map; when the right pointer hits a duplicate, advance the left pointer (removing characters) until the window is valid again, recording the max length along the way. Each character enters and leaves the window at most once, so it's O(n) time, O(min(n, alphabet)) space. The classic sliding-window-plus-hashing problem.
Follow-ups they push on- Generalize to at most k distinct characters.
Red flag Restarting the scan from the duplicate instead of moving the left pointer, degrading to O(n^2).
source: LeetCode 3 — Longest Substring Without Repeating Characters (company tags) ↗ -
How do you recognize a dynamic programming problem, and what's the difference between memoization and tabulation?
DP applies when a problem has overlapping subproblems (the same smaller problem recurs) and optimal substructure (the best answer is built from best sub-answers) — counting paths, min cost, longest subsequence are typical. Memoization is top-down: write the natural recursion and cache results. Tabulation is bottom-up: fill a table in dependency order, avoiding recursion overhead. Both cut exponential brute force to polynomial; choose based on which is clearer.
Follow-ups they push on- When does tabulation let you shrink space to O(1) rows?
Red flag Reaching for greedy on a problem that needs DP (greedy gives a locally optimal but globally wrong answer).
source: NeetCode — Roadmap ↗ -
Given coin denominations and an amount, return the fewest coins to make that amount.
This is bottom-up DP:
dp[a]= fewest coins to make amounta, computed as 1 + min over coins c ofdp[a - c], withdp[0] = 0and unreachable amounts marked infinity. Answer isdp[amount]or -1 if still infinity. O(amount * numCoins) time, O(amount) space. The interviewer is watching you state the subproblem and transition clearly — and notice that greedy (largest coin first) fails for denominations like {1, 3, 4}.Follow-ups they push on- Why does the greedy 'largest coin first' approach fail here?
- Count the number of ways instead of the minimum.
Red flag Using greedy largest-coin-first, which is wrong for arbitrary denominations.
source: LeetCode 322 — Coin Change (company tags) ↗ -
What is 'binary search on the answer', and when do you apply it? (e.g. minimum capacity / Koko eating bananas)
When the input isn't sorted but the answer space is monotonic — a candidate value either works or doesn't, and 'works' is monotone (if capacity X works, every larger capacity also works) — you binary-search over the range of possible answers, using a feasibility check as the comparison. Example: find the minimum eating speed so Koko finishes in H hours — binary-search the speed and test 'can she finish at speed k?' in O(n) each, giving O(n log(max)) overall. The trick is spotting the monotone yes/no boundary you can bisect.
What a strong answer coversSearch the answer range, not the array, when answers are monotone.
Need a feasibility test: 'does candidate value X satisfy the constraint?'
Bisect toward the boundary between feasible and infeasible.
Cost = O(check · log(range)), e.g. O(n log(max)).
Follow-ups they push on- How do you prove the feasibility predicate is monotonic?
- Apply it to 'minimum days to ship all packages within D days'.
Red flag Applying it when the feasibility predicate isn't monotonic, so bisection converges to a wrong boundary.
source: LeetCode 875 — Koko Eating Bananas (company tags) ↗ -
Explain greedy vs divide-and-conquer vs dynamic programming. How do you know greedy is safe?
Divide-and-conquer splits into independent subproblems and combines results (mergesort, binary search). DP is for overlapping subproblems with optimal substructure, caching to avoid recomputation. Greedy makes the locally optimal choice at each step and never revisits it — fast and simple, but only correct when the problem has the greedy-choice property (e.g. interval scheduling, Dijkstra, Huffman). You justify greedy with an exchange argument or by proving the greedy choice is always part of some optimal solution; otherwise fall back to DP.
Follow-ups they push on- Name a problem where greedy looks right but fails, and DP is needed.
Red flag Asserting a greedy strategy is correct without an exchange argument, then being blindsided by a counterexample.
source: NeetCode — Roadmap ↗
02 Backend Engineering 111 Q's
2.1 HTTP/HTTPS deeply 14
-
401 vs 403 vs 404 — when do you return each, and why might a security-conscious API return 404 instead of 403?
401 Unauthorized means 'I don't know who you are' — no/invalid credentials; the right fix is to authenticate. 403 Forbidden means 'I know who you are, but you're not allowed' — re-authenticating won't help. 404 Not Found means the resource doesn't exist.
The security twist: a 403 on a resource you can't see still confirms it exists, leaking information (resource enumeration). Some APIs deliberately return 404 instead of 403 for unauthorized access to private resources, so an attacker can't distinguish 'exists but forbidden' from 'doesn't exist'.
What a strong answer covers401 = not authenticated (who are you?); 403 = authenticated but not permitted.
Despite the name, 401 means unauthenticated, not unauthorized — a historical misnomer.
403 confirms a resource exists, which can leak information.
Returning 404 for forbidden private resources prevents enumeration.
Quick self-checkA logged-in user requests another user's private profile they have no rights to see. What's the most information-leak-resistant response?
-
Wrong — the user IS authenticated; 401 means no/invalid credentials.
-
Correct semantically, but it confirms the resource exists, enabling enumeration.
-
Correct for leak resistance — the attacker can't tell 'forbidden' from 'nonexistent'.
-
Wrong — 200 implies success and may confuse clients about whether data exists.
Follow-ups they push on- Why is 401's name a misnomer?
- When would leaking 'this resource exists' actually matter?
Red flag Returning 401 when the user IS authenticated but lacks permission — that's 403. And 403 on private resources silently leaks their existence.
source: MDN — 403 Forbidden ↗ -
Walk me through the HTTP status code families and name a key code in each.
Five families by first digit: 1xx informational, 2xx success (200 OK, 201 Created, 204 No Content), 3xx redirection (301 permanent, 302 found, 304 Not Modified), 4xx client error (400 bad request, 401 unauthenticated, 403 forbidden, 404 not found, 409 conflict, 422 unprocessable, 429 too many requests), 5xx server error (500 internal, 502 bad gateway, 503 unavailable).
The useful instinct: 4xx means the client must change the request; 5xx means the client can retry the same request later.
Follow-ups they push on- 401 vs 403 — what is the difference?
- When would you return 422 instead of 400?
- What does 304 require the client to have sent?
Red flag Returning 200 with an error body, or using 401 when you mean 403. 401 = not authenticated (who are you?), 403 = authenticated but not allowed.
source: MDN — HTTP response status codes ↗ -
Which HTTP methods are idempotent, and why does it matter?
GET, PUT, DELETE, HEAD, and OPTIONS are idempotent: making the same call N times leaves the server in the same state as making it once. POST is not idempotent — two POSTs typically create two resources.
It matters for safe retries. When a client times out it cannot tell whether the request was processed, so it must retry. Idempotent methods can be retried freely; for non-idempotent POSTs you need an idempotency key so the server can dedupe.
Follow-ups they push on- How is idempotent different from safe?
- How would you make a payment POST safely retryable?
Red flag Conflating idempotent with safe. GET/HEAD/OPTIONS are also safe (no side effects); PUT/DELETE are idempotent but NOT safe. Also: PUT is idempotent by spec even though it changes data.
source: MDN — HTTP request methods ↗ -
What do the SameSite, HttpOnly, and Secure cookie attributes each do?
HttpOnly hides the cookie from JavaScript (
document.cookie), so an XSS payload can't read it — it mitigates token theft. Secure sends the cookie only over HTTPS, so it can't leak over plaintext. SameSite controls whether the cookie rides along on cross-site requests:Strictnever sends it cross-site,Laxsends it only on top-level navigations (the modern browser default), andNonesends it always but then requiresSecure.Together they harden a session cookie: HttpOnly+Secure stop theft and eavesdropping; SameSite is the first line of CSRF defense.
What a strong answer coversHttpOnly → unreadable by JS, blunts XSS-based token theft.
Secure → HTTPS-only transmission.
SameSite → controls cross-site sending;
Laxis the default in modern browsers.SameSite=Nonemust be paired withSecureor the browser rejects it.
Quick self-checkWhich cookie attribute most directly mitigates CSRF?
-
No — it blocks JS access (XSS theft) but the cookie is still auto-sent cross-site.
-
No — it only forces HTTPS; it doesn't restrict cross-site sending.
-
Correct — it restricts whether the cookie is sent on cross-site requests, the core CSRF vector.
-
No — it only controls cookie lifetime, not CSRF.
Follow-ups they push on- Why does SameSite=None require Secure?
- Does HttpOnly do anything against CSRF? (No — the cookie is still auto-sent.)
Red flag Thinking HttpOnly prevents CSRF. It stops JS from reading the cookie, but the browser still attaches it automatically on requests — SameSite/CSRF tokens handle CSRF.
source: MDN — Set-Cookie (SameSite) ↗ -
Trick: is GET guaranteed to have no server-side effects? Is it safe to cache and retry a GET?
By the HTTP spec GET is safe (read-only) and idempotent, so intermediaries (browsers, proxies, CDNs) freely cache and retry it. But 'safe' is a *contract you must honor*, not something the protocol enforces — a poorly designed
GET /delete?id=5will happily delete data.The danger: because GETs are prefetched, cached, and retried, a side-effecting GET can be triggered by a link prefetcher, a crawler, or a retry, causing unintended mutations. Mutations belong on POST/PUT/PATCH/DELETE; keep GET strictly read-only.
What a strong answer coversGET is defined as safe + idempotent, but the server must actually honor that.
Caches, prefetchers, and crawlers will issue GETs without user intent.
A side-effecting GET can fire from a prefetch or retry — a real source of bugs/exploits.
Put all mutations behind non-safe methods.
Quick self-checkWhy is implementing a delete behind GET /delete?id=5 dangerous?
-
False and irrelevant — method choice doesn't determine speed here.
-
Correct — GETs are assumed safe and get fetched automatically.
-
False — GET routinely carries query parameters.
-
False — GET is the most cacheable method.
Follow-ups they push on- How could a crawler or link-prefetcher trigger a side-effecting GET?
- What's the difference between 'safe' and 'idempotent' here?
Red flag Believing the protocol enforces GET's safety. It's a contract — a GET that mutates state is valid HTTP but a design bug that prefetchers and caches will exploit.
source: MDN — Safe (HTTP methods) ↗ -
What is an ETag and how does conditional caching with If-None-Match work?
An
ETagis an opaque validator (often a hash) the server attaches to a response to identify a specific version of a resource. On the next request the client sendsIf-None-Match: <etag>. If the resource is unchanged the server replies 304 Not Modified with no body, saving bandwidth; if it changed it returns 200 with the new body and a new ETag.ETags also enable optimistic concurrency on writes via
If-Match: the write is rejected with 412 Precondition Failed if someone else changed the resource first.Follow-ups they push on- Strong vs weak ETags?
- How does this compare to Last-Modified / If-Modified-Since?
Red flag Thinking 304 carries the body — it does not; the client reuses its cached copy. Also forgetting ETags can prevent lost-update races on PUT.
source: MDN — ETag ↗ -
Explain CORS. Why does a browser block a cross-origin request, and what is a preflight?
CORS (Cross-Origin Resource Sharing) is a browser security mechanism on top of the same-origin policy. By default a page at origin A cannot read responses from origin B unless B opts in via
Access-Control-Allow-Origin.For non-simple requests (custom headers, methods like PUT/DELETE, certain content types) the browser first sends a preflight
OPTIONSrequest. The server answers withAccess-Control-Allow-Methods,Allow-Headers, andAllow-Origin; only then does the browser send the real request.Follow-ups they push on- Does CORS protect the server? (No — it protects the user's browser.)
- What does Access-Control-Allow-Credentials change, and why can't you combine it with '*'?
Red flag Believing CORS is server-side security. It is enforced by the browser; curl, Postman, and a malicious server-to-server call ignore it entirely.
source: MDN — Cross-Origin Resource Sharing (CORS) ↗ -
What does HTTPS/TLS actually add over HTTP, and what is the rough handshake?
TLS adds three things: confidentiality (traffic is encrypted), integrity (tampering is detected), and server identity (the certificate, signed by a CA, proves you are talking to the real host).
Handshake sketch: client sends ClientHello (supported ciphers); server returns its certificate and key-exchange parameters; they use asymmetric crypto (e.g. ECDHE) to agree on a shared symmetric session key; the rest of the connection uses fast symmetric encryption. TLS 1.3 cuts this to roughly one round trip.
Follow-ups they push on- Why switch to a symmetric key after the handshake?
- What does the CA actually vouch for?
Red flag Saying TLS 'encrypts with the certificate'. The cert carries the public key and identity; the bulk data is encrypted with a negotiated symmetric session key.
source: Cloudflare — What happens in a TLS handshake? ↗ -
PUT vs PATCH vs POST for updating a resource — when do you use each?
PUT replaces the resource wholesale and is idempotent — send the full representation. PATCH applies a partial modification (only the changed fields) and is not guaranteed idempotent by spec. POST creates a new subordinate resource or triggers a non-idempotent action.
Rule of thumb: full replace at a known URL → PUT; partial field update → PATCH; create-and-let-the-server-assign-the-id → POST.
Follow-ups they push on- Can PUT create a resource? (Yes, if the client picks the URL/id.)
- How would you make PATCH idempotent?
Red flag Using PUT for partial updates — sending only some fields with PUT semantically blanks the rest. Use PATCH for partial.
source: MDN — PUT ↗ -
What's the difference between Connection: keep-alive and HTTP/2 multiplexing? Why isn't keep-alive enough?
Keep-alive (persistent connections, default in HTTP/1.1) reuses one TCP connection for multiple sequential requests, avoiding a new handshake each time. But requests on that connection are still serialized — request 2 waits for response 1 (head-of-line blocking), which is why browsers open ~6 parallel connections per host.
HTTP/2 multiplexing interleaves many concurrent request/response streams over a single connection, so a slow response doesn't block the others at the application layer. Keep-alive reuses the pipe; multiplexing lets many requests share it simultaneously.
What a strong answer coversKeep-alive reuses one connection but processes requests sequentially.
HTTP/1.1 pipelining tried concurrency but still suffered HOL blocking and was largely abandoned.
HTTP/2 multiplexing runs concurrent streams over one connection.
Browsers opened ~6 connections per host precisely to work around HTTP/1.1 serialization.
Follow-ups they push on- Why did HTTP/1.1 pipelining never catch on?
- Does HTTP/2 multiplexing eliminate ALL head-of-line blocking? (No — TCP-level remains.)
Red flag Conflating keep-alive with multiplexing. Keep-alive just avoids re-handshaking; it does not allow concurrent in-flight requests on the same connection.
source: MDN — Connection management in HTTP/1.x ↗ -
How does HSTS work, and what attack does it prevent that a redirect from HTTP to HTTPS does not?
HSTS (HTTP Strict Transport Security) is a response header (
Strict-Transport-Security: max-age=...) that tells the browser to only ever contact this host over HTTPS for the given duration — the browser upgrades anyhttp://request tohttps://*before* sending it.A plain 301 redirect from HTTP→HTTPS still sends that first request in cleartext, which a man-in-the-middle can intercept and strip (SSL stripping). HSTS closes that window because, after the first secure visit (or via the preload list), the browser never makes the insecure request at all.
What a strong answer coversHSTS forces the browser to upgrade requests to HTTPS before any cleartext goes out.
A redirect leaves the initial request exposed to SSL-stripping MITM.
The HSTS preload list protects even the very first visit.
Set a long
max-age;includeSubDomainsextends it to subdomains.
Quick self-checkWhat does HSTS protect against that a 301 HTTP→HTTPS redirect alone does not?
-
No — HSTS doesn't validate cert expiry.
-
Correct — HSTS upgrades to HTTPS before sending, eliminating the cleartext first hop.
-
No — XSS is unrelated to transport security.
-
No — entirely unrelated to HSTS.
Follow-ups they push on- What is SSL stripping?
- Why does the HSTS preload list matter for the first-ever visit?
Red flag Assuming an HTTP→HTTPS redirect is fully secure. The first cleartext request before the redirect is interceptable; HSTS (ideally preloaded) is what removes that gap.
source: MDN — Strict-Transport-Security ↗ -
A client uploads a large file and the server responds 100 Continue before the body. What is the Expect: 100-continue mechanism for?
When a client is about to send a large request body, it can send the headers first with
Expect: 100-continueand pause before sending the body. The server inspects the headers (auth, content-length limits, content-type) and replies 100 Continue to greenlight the body, or an error status (e.g. 401, 413) to reject it up front.The point is to avoid wasting bandwidth uploading a huge body that the server would only reject anyway. It's part of the 1xx informational family — a provisional response before the final one.
What a strong answer coversExpect: 100-continuelets the client send headers, then wait for a go-ahead.Server replies 100 Continue to accept the body, or an error to reject before upload.
Saves bandwidth on large bodies the server would reject (auth fail, too large).
1xx are provisional/informational responses preceding the final status.
Follow-ups they push on- What status would the server send instead of 100 to reject an oversized upload? (413)
- What other 1xx codes exist? (101 Switching Protocols, 103 Early Hints)
Red flag Treating 100 Continue as a final response. It's provisional — the real status comes after the body is sent and processed.
source: MDN — 100 Continue ↗ -
A client gets a 200 but you suspect the response was served stale. Which headers control caching, and how would you debug it?
Caching is governed by
Cache-Control(max-age,no-store,no-cache,private/public,must-revalidate), plus validatorsETag/Last-Modifiedand the legacyExpires.Debug path: inspect the response
Cache-ControlandAgeheaders; check whether an intermediary (CDN/proxy) addedAgeor anX-Cache: HIT.no-cachemeans 'revalidate before use', not 'do not cache' — that surprises people. To force freshness setno-storeor a shortmax-ageplus an ETag so clients revalidate cheaply with 304s.Follow-ups they push on- no-cache vs no-store — exact difference?
- What does the Vary header do for a shared cache?
Red flag Reading `no-cache` as 'never cache'. It means store but revalidate every time; `no-store` is the one that forbids storing.
source: MDN — Cache-Control ↗ -
What problems did HTTP/2 solve over HTTP/1.1, and what does HTTP/3 change?
HTTP/1.1 suffers head-of-line blocking: one response per connection at a time, so browsers open many TCP connections. HTTP/2 adds multiplexing (many concurrent streams over one connection), header compression (HPACK), and server push, removing application-layer HOL blocking.
But HTTP/2 still rides TCP, so a single lost packet stalls all streams (transport-layer HOL blocking). HTTP/3 runs over QUIC (UDP), giving independent streams, faster connection setup (0-RTT), and seamless connection migration across network changes.
Follow-ups they push on- Why doesn't HTTP/2 multiplexing fully fix HOL blocking?
- What does QUIC do that TCP can't?
Red flag Claiming HTTP/2 eliminated all head-of-line blocking. It removed it at the HTTP layer but TCP still serializes loss recovery — that's why HTTP/3 moved to QUIC.
source: Cloudflare — HTTP/3 vs HTTP/2 ↗
2.2 API design & alternatives 14
-
Why does the N+1 query problem hit GraphQL especially hard, and how do you fix it?
GraphQL resolvers run per-field, per-object. Fetch a list of 10 authors and then ask for each author's
posts, and the naive resolver fires 1 query for the authors + N queries for the posts — the classic N+1 blowup, which gets worse as clients nest deeper.The standard fix is a DataLoader: it batches the individual
postrequests made within one tick of the event loop into a singleWHERE author_id IN (...)query and caches results per request. This collapses N+1 into 2 queries while keeping the per-field resolver model.What a strong answer coversPer-field resolvers mean nested fields each trigger their own query.
A list of N parents requesting a child field → 1 + N queries.
DataLoader batches per-tick requests into one
IN (...)query and caches per request.It's worse in GraphQL than REST because clients control nesting depth dynamically.
Quick self-checkQuerying 50 users and each user's `team` name with a naive resolver issues how many DB queries, and what fixes it?
-
Wrong — the team field resolves once per user.
-
Correct — 1 for users + 50 for teams, batched into 1 with DataLoader (2 total).
-
Wrong count, and an index doesn't eliminate the per-row round trips.
-
Wrong — naive resolvers don't auto-batch, and pagination doesn't address N+1.
Follow-ups they push on- Why is per-request caching (not global) the right scope for DataLoader?
- How does query-depth/complexity limiting relate to this?
Red flag Solving N+1 by eager-loading everything regardless of the query — you over-fetch and lose GraphQL's selectivity. Batch with DataLoader instead.
source: Apollo — Optimizing resolvers with DataLoader ↗ -
Trick: what's wrong with the REST route GET /getUserById?id=5, and how should it look?
It mixes RPC-style verb-in-the-path (
getUserById) with REST, which is redundant and inconsistent. In REST the HTTP method is the verb and the URL names a resource (noun). So fetching user 5 is simplyGET /users/5; theGETalready says 'retrieve', and/users/{id}identifies the resource.Proper resource modeling:
GET /users(list),POST /users(create),GET /users/5,PUT/PATCH /users/5(update),DELETE /users/5. Keep verbs out of paths and use plural nouns consistently.What a strong answer coversThe HTTP method is the verb; the path is a noun/resource identifier.
GET /users/5, notGET /getUserById?id=5.Use plural collection nouns consistently (
/users,/orders).Verb-in-path is an RPC style, not REST.
Quick self-checkWhich is the correct RESTful way to fetch the user with id 5?
-
Wrong — verb in path is RPC-style and redundant with GET.
-
Wrong — reads should be GET, and 'fetch' is a verb in the path.
-
Correct — GET (verb) on /users/5 (resource).
-
Wrong — action query param re-encodes the verb GET already provides.
Follow-ups they push on- How would you model 'cancel an order' RESTfully? (POST /orders/5/cancel or PATCH status)
- When is an RPC-style action endpoint actually acceptable?
Red flag Putting actions/verbs in the URL (`/createUser`, `/deleteOrder`). The method conveys the action; the path names the thing.
source: MDN — REST ↗ -
What does a good API error response look like, and why is a consistent error shape worth enforcing?
Use the right status code to signal the category, then a structured body with a stable machine-readable
code, a humanmessage, and optionaldetails/field-level errors. Keep the shape identical across every endpoint so clients can handle errors generically.Example shape:
{ "error": { "code": "card_declined", "message": "Your card was declined.", "details": [] } }. Stable string codes (not just HTTP numbers) let clients branch on the specific failure without parsing prose. Never leak stack traces or internal identifiers.Follow-ups they push on- 400 vs 422 for validation errors?
- Why include a stable `code` string alongside the HTTP status?
Red flag Returning 200 with `{ success: false }`, or varying the error body per endpoint. Clients then can't handle failures uniformly.
source: Stripe — Error handling ↗ -
Design a URL-shortening API (like bit.ly). Walk me through the endpoints and the redirect.
Two core endpoints:
POST /urlswith the long URL returns a short code (201 + Location);GET /{code}issues a 301/302 redirect to the long URL.Key decisions: generate the code via a base62 encoding of an auto-increment id or a hash (handle collisions); store
code -> longURLin a fast KV store; cache hot codes (read-heavy workload). Discuss 301 (permanent, cacheable, loses analytics) vs 302 (temporary, every hit reaches you for click counts). Add rate limiting and custom-alias support as extensions.Follow-ups they push on- 301 vs 302 for the redirect — which and why?
- How do you guarantee short-code uniqueness at scale?
- How would you add click analytics without slowing the redirect?
Red flag Picking 301 then wondering why click analytics vanish — browsers cache 301 and stop hitting your server.
source: system-design-primer — Design a URL shortener ↗ -
WebSockets vs Server-Sent Events vs long polling — how do you pick for a real-time feature?
Long polling holds an HTTP request open until there's data, then the client reconnects — works everywhere but is request-heavy and laggy. Server-Sent Events (SSE) is a one-way server→client stream over a single long-lived HTTP connection, with built-in auto-reconnect and event IDs — ideal for notifications, live scores, dashboards. WebSockets give a full-duplex bidirectional channel after an HTTP upgrade — needed when the client also pushes frequently (chat, collaborative editing, multiplayer).
Rule of thumb: server-push-only → SSE (simpler, rides plain HTTP); two-way/high-frequency → WebSockets; fallback when neither is available → long polling.
What a strong answer coversSSE is unidirectional (server→client), text-only, with automatic reconnection.
WebSockets are bidirectional full-duplex after an upgrade handshake.
Long polling is the universal but least efficient fallback.
SSE works over plain HTTP/2; WebSockets need their own protocol handling.
Quick self-checkA dashboard only needs the server to push live metric updates to the browser. Best fit?
-
Overkill — full-duplex isn't needed when only the server pushes.
-
Correct — one-way server→client streaming with auto-reconnect is exactly SSE's niche.
-
Wasteful — high request overhead and still laggy.
-
No — that's a one-time fetch, not live updates.
Follow-ups they push on- Why might SSE be a better fit than WebSockets for a notifications feed?
- What HTTP mechanism upgrades a connection to a WebSocket? (101 Switching Protocols)
Red flag Reaching for WebSockets for a one-way notification stream. SSE is simpler, auto-reconnects, and rides ordinary HTTP infrastructure.
source: MDN — Server-sent events ↗ -
What makes gRPC fast, and what are the practical downsides versus REST/JSON?
gRPC rides HTTP/2 (multiplexed, persistent connections) and serializes with Protocol Buffers — a compact binary format with a strict schema, so payloads are smaller and parsing is faster than text JSON. It also generates typed client/server stubs and supports streaming in both directions.
Downsides: it's not natively callable from browsers (you need gRPC-Web + a proxy); the binary payloads aren't human-readable, so debugging needs tooling; and it adds schema/codegen overhead. That's why gRPC dominates internal service-to-service traffic while REST/JSON stays the default for public, browser-facing APIs.
What a strong answer coversHTTP/2 transport + binary Protocol Buffers → small payloads, fast parsing.
Generated typed stubs and first-class bidirectional streaming.
Not browser-native — needs gRPC-Web and a proxy.
Binary payloads are hard to eyeball/debug versus JSON.
Follow-ups they push on- Why can't a browser call a gRPC service directly?
- When is the protobuf schema requirement a benefit vs a burden?
Red flag Choosing gRPC for a public browser-facing API. Its lack of native browser support and opaque payloads make REST/JSON the friendlier public choice.
source: gRPC — Core concepts, architecture and lifecycle ↗ -
How do rate-limit response headers (X-RateLimit-* / RateLimit-*) and 429 + Retry-After help a well-behaved client?
When throttling, return 429 Too Many Requests and tell the client *how* to behave. Limit headers expose the budget: a limit, the remaining count, and a reset time (GitHub uses
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset; the IETFRateLimitdraft standardizes this). On a 429 (or 503) a Retry-After header tells the client exactly how long to wait.This lets a good client self-throttle proactively — slow down as
remainingapproaches zero and back off precisely after a 429 — instead of blindly hammering and guessing.What a strong answer covers429 = rate limited; pair it with
Retry-After(seconds or a date).Limit/Remaining/Reset headers let clients pace themselves before being blocked.
GitHub's API documents
X-RateLimit-*; an IETFRateLimitheader draft standardizes the pattern.Proactive self-throttling beats reactive retry-storms.
Follow-ups they push on- What format can Retry-After take? (delay-seconds or an HTTP date)
- Why surface Remaining/Reset instead of only a 429?
Red flag Returning 429 with no Retry-After or budget headers, leaving clients to guess and retry-storm. Tell them how long to wait and how much budget remains.
source: GitHub REST API — Rate limits ↗ -
REST vs GraphQL vs gRPC vs WebSockets — when do you reach for each?
REST: default for public CRUD over HTTP; cacheable, simple, ubiquitous. GraphQL: client picks exactly the fields it needs — kills over/under-fetching when many clients aggregate data from many resources; cost is caching and query-complexity control. gRPC: high-performance internal service-to-service calls over HTTP/2 + protobuf, with streaming; not browser-native. WebSockets: persistent bidirectional real-time channel (chat, live feeds, multiplayer).
Choose by traffic shape: public+cacheable → REST; flexible client queries → GraphQL; fast internal RPC → gRPC; push/real-time → WebSockets.
Follow-ups they push on- Why is GraphQL harder to cache than REST?
- Why isn't gRPC used directly from browsers?
Red flag Reaching for GraphQL or gRPC by default. For a simple public CRUD API, REST is usually the lower-friction, more cacheable choice.
source: ByteByteGo — REST vs GraphQL vs gRPC ↗ -
Offset pagination vs cursor (keyset) pagination — what breaks with offset at scale?
Offset/limit (
LIMIT 20 OFFSET 10000) is simple but the database must scan and discard every skipped row, so deep pages get slow, and rows shifting between requests cause duplicates or skips.Cursor/keyset pagination passes the last-seen sorted key (
WHERE id > :lastId ORDER BY id LIMIT 20). It uses the index directly, so performance is constant regardless of depth, and it is stable under inserts. Tradeoff: you can't jump to an arbitrary page number. Use cursors for infinite scroll and large/active datasets.Follow-ups they push on- Why does offset pagination skip or duplicate rows under writes?
- How do you build a cursor over a non-unique sort column?
Red flag Using OFFSET for an infinite feed — as users scroll, new inserts shift the window and they see duplicates. Cursors avoid that.
source: Hello Interview — Pagination patterns ↗ -
How do you version a public API, and how do you evolve it without breaking clients?
Three common strategies: URI versioning (
/v1/users) — explicit and cache-friendly, the most common; header versioning (Accept: application/vnd.api.v2+json) — cleaner URLs, harder to test in a browser; and query param (?version=2).The deeper answer is to avoid breaking changes at all: add fields rather than remove, treat unknown fields as ignorable, never repurpose a field's meaning, and only bump the major version for genuinely incompatible changes. Announce deprecations with timelines and
Deprecation/Sunsetheaders.Follow-ups they push on- What counts as a breaking vs non-breaking change?
- How does Stripe version without URL bumps? (dated versions pinned per account)
Red flag Bumping the version for additive changes. Adding an optional field is backward-compatible and shouldn't force clients to migrate.
source: Hello Interview — API design (versioning) ↗ -
Design a bulk-create endpoint that imports 10,000 records. Sync or async, and how do you report results?
Don't process 10k records in a synchronous request — you'll hit timeouts and tie up a worker. Accept the payload, validate it cheaply, enqueue a background job, and return 202 Accepted with a job/status URL (
Location: /imports/{id}). The client polls that URL (or subscribes) for progress and the final per-record outcome.Key decisions: define partial-failure semantics (all-or-nothing transaction vs per-record results so 9,998 succeed and 2 errors are reported), make the import idempotent via a client-supplied batch key so retries don't double-import, and cap batch size with backpressure.
Follow-ups they push on- All-or-nothing vs per-record partial success — which and why?
- How do you make the bulk import idempotent under client retries?
- What status code signals 'accepted but not yet done'? (202)
Red flag Processing the whole batch inline and returning one 200/500. Long requests time out, and a single bad row failing the entire batch is a poor contract — go async with per-record results.
source: MDN — 202 Accepted ↗ -
How do idempotency keys make a payment POST safely retryable? Walk through the server logic.
The client generates a unique key (e.g. a V4 UUID) and sends it in an
Idempotency-Keyheader. The server stores the key with the request's outcome.Logic: on first request for a key, process it and persist the resulting status + response body keyed by that idempotency key (inside the same transaction as the side effect). On any retry with the same key, return the stored response instead of re-charging. Handle the in-flight case (a retry arriving while the first is still processing) with a lock or a 409. Stripe expires keys after 24 hours. This turns a non-idempotent POST into a safely retryable one after a timeout.
Follow-ups they push on- Where do you store the key — same DB transaction as the charge? Why?
- What if two identical requests arrive concurrently?
Red flag Storing the idempotency record separately from the side effect, so a crash between the charge and the record leaves you able to double-charge. Persist them atomically.
source: Stripe — Designing robust APIs with idempotency ↗ -
Design a rate limiter for an API. Which algorithm would you use and why?
The token bucket is the common default — a bucket refills tokens at a fixed rate up to a capacity; each request consumes a token, and an empty bucket means the request is rejected with 429 Too Many Requests (plus a
Retry-Afterheader). It allows short bursts while bounding the average rate. ByteByteGo notes both Amazon and Stripe use this algorithm to throttle their APIs.Alternatives: leaky bucket (smooths to a constant outflow), fixed window (simple but allows 2x bursts at window edges), and sliding window (smooths the edge problem). For a distributed limiter, keep counters in a shared store like Redis (atomic INCR with TTL) so all nodes agree.
Follow-ups they push on- What status code and header do you return when throttled?
- How do you keep the limit consistent across many API servers?
- Why does fixed-window allow a 2x burst?
Red flag Keeping the counter in each server's local memory in a multi-node deployment — clients then get N times the limit. Use a shared/atomic store.
source: ByteByteGo — Design a rate limiter ↗ -
What is HATEOAS, and is it actually used in practice?
HATEOAS (Hypermedia As The Engine Of Application State) is the REST constraint where responses include links to the next available actions, so the client discovers transitions dynamically (
{ "_links": { "cancel": "/orders/42/cancel" } }) instead of hardcoding URLs.In practice it's the least-adopted REST constraint — most 'REST' APIs are really HTTP+JSON without hypermedia. Be honest in interviews: know what it is and the decoupling argument, but acknowledge most teams skip it because clients are coupled to the API anyway and tooling support is thin.
Follow-ups they push on- What would full HATEOAS buy you that plain JSON doesn't?
- What is the Richardson Maturity Model?
Red flag Claiming your API is 'fully RESTful' while having no hypermedia — by Fielding's definition that's level 2, not true REST.
source: MDN — REST ↗
2.3 Auth & security concepts 14
-
What is CSRF, and why does a CSRF attack work even though the attacker never sees the victim's cookie?
CSRF (Cross-Site Request Forgery) tricks a logged-in victim's browser into making a state-changing request to your site. The attacker hosts a page that auto-submits a form (or fires a request) to
yourbank.com/transfer; because the browser automatically attaches the victim's cookies to any request to that origin, the request arrives authenticated — even though the attacker never read the cookie.The core enabler is ambient authority: cookies ride along by default. Defenses:
SameSitecookies (block cross-site sends), anti-CSRF tokens (a secret the attacker's page can't know), and checking Origin/Referer.What a strong answer coversThe browser auto-sends cookies to the target origin — the attacker exploits that, not the cookie value.
Only state-changing requests matter; CSRF can't read the response (same-origin policy).
SameSite=Lax/Strictcookies are the first-line modern defense.Anti-CSRF tokens add a secret the attacker's page cannot supply.
Quick self-checkWhy does a CSRF attack succeed without the attacker ever reading the session cookie?
-
Wrong — no decryption happens; the attacker never accesses the cookie.
-
Correct — ambient cookie authority sends it on the forged request.
-
Wrong — that's XSS; CSRF doesn't read the cookie.
-
Wrong — SOP is browser-enforced and still applies.
Follow-ups they push on- Why are JWTs in the Authorization header less exposed to CSRF than cookie sessions?
- Does CSRF let the attacker read the response? (No — SOP blocks that.)
Red flag Thinking HTTPS or HttpOnly stops CSRF. They don't — the browser still auto-attaches the cookie. SameSite and CSRF tokens are the defenses.
source: OWASP — Cross-Site Request Forgery Prevention Cheat Sheet ↗ -
Authentication vs authorization — state the difference crisply with an example.
Authentication answers 'who are you?' — verifying identity (password, token, passkey). Authorization answers 'what are you allowed to do?' — checking permissions after identity is established.
Example: logging in with your password is authentication; the check that decides you can read but not delete the document is authorization. Authn always precedes authz. The corresponding status codes: 401 Unauthorized = not authenticated; 403 Forbidden = authenticated but not permitted.
Follow-ups they push on- Which HTTP status maps to each failure?
- Where does each typically live in a request pipeline?
Red flag Swapping 401 and 403, or saying 'authorization checks your password'. Authorization assumes identity is already known.
source: Auth0 — Authentication vs Authorization ↗ -
Walk through the three parts of a JWT. What does the signature guarantee — and what does it NOT?
A JWT is
header.payload.signature, each base64url-encoded and joined by dots. The header names the algorithm; the payload holds the claims (sub,exp, roles); the signature is computed over header+payload with a secret (HMAC) or private key (RSA/ECDSA).The signature guarantees integrity and authenticity — the server detects any tampering and confirms the token was issued by a holder of the key. It does not provide confidentiality: the payload is merely encoded, not encrypted, so anyone can base64-decode and read it. Never put secrets in a JWT payload, and always verify the signature server-side.
What a strong answer coversThree parts: header, payload (claims), signature — base64url, dot-separated.
Signature → integrity + authenticity (tamper-evident, proves the issuer).
Payload is encoded, not encrypted — readable by anyone; no secrets in it.
Standard claims:
sub,exp,iat,iss,aud.
Quick self-checkWhat does a valid JWT signature prove?
-
Wrong — the payload is base64-encoded plaintext, not encrypted.
-
Correct — that's integrity and authenticity.
-
Wrong — replay needs short expiry/jti, not the signature.
-
Wrong — claims drive authorization, and they must still be checked.
Follow-ups they push on- Why must the server verify the signature on every request?
- What's the difference between a signed (JWS) and an encrypted (JWE) token?
Red flag Storing sensitive data in the JWT payload assuming it's hidden. It's base64-decodable plaintext — signing protects integrity, not confidentiality.
source: jwt.io — Introduction to JSON Web Tokens ↗ -
Session cookies vs JWTs for API auth — compare the tradeoffs. How do you revoke each?
Sessions: server stores session state, the client holds an opaque session id in an
HttpOnlycookie. Stateful, but revocation is trivial — delete the server-side session. Needs shared session storage to scale horizontally.JWTs: a signed, self-contained token the server verifies without a lookup — stateless and scales easily. The catch is revocation: a valid JWT is honored until it expires, so logout/ban requires a denylist or short expiry + refresh tokens, which reintroduces state. Use short-lived access tokens (minutes) plus a refresh token to limit the blast radius.
Follow-ups they push on- How do you revoke a JWT before it expires?
- Where should the browser store a JWT — localStorage or a cookie?
Red flag Calling stateless JWTs strictly better. Their headline weakness is revocation; any real logout/ban story drags state back in.
source: Auth0 — Token-based vs session-based authentication ↗ -
OAuth2 vs OIDC — what is each actually for? Don't conflate them.
OAuth 2.0 is delegated authorization: 'let app A access my data on service B' without sharing my password — it issues access tokens scoped to resources. It says nothing about who the user is.
OIDC (OpenID Connect) is an authentication layer built on top of OAuth2. It adds an ID token (a JWT) and a standard
/userinfoendpoint, so the app learns *who* logged in — this is what powers 'Log in with Google'. So: OAuth2 = access to resources; OIDC = proof of identity.Follow-ups they push on- What does the ID token contain that the access token doesn't?
- Why is using a raw OAuth2 access token as proof of login a mistake?
Red flag Using a bare OAuth2 access token to authenticate a user. Access tokens are for resource access; identity comes from the OIDC ID token.
source: OpenID Connect — How it works ↗ -
How should passwords be stored, and why is a fast hash like SHA-256 the wrong choice?
Never store plaintext or reversible encryption. Use a slow, salted, adaptive password hash — bcrypt, scrypt, or Argon2 (the current OWASP-preferred). The salt (unique per user) defeats rainbow tables; the deliberate slowness/work factor caps how many guesses an attacker can make per second after a breach.
Fast general-purpose hashes (SHA-256, MD5) are wrong precisely because they're fast — a GPU computes billions per second, making offline brute force cheap. Choose a memory-hard function and raise the cost factor as hardware improves.
Follow-ups they push on- What does the salt protect against specifically?
- Why is Argon2 preferred over bcrypt today?
Red flag Using SHA-256/MD5 (even salted) for passwords. They're built to be fast, which is the opposite of what password hashing needs.
source: OWASP — Password Storage Cheat Sheet ↗ -
Why use short-lived access tokens with refresh tokens instead of one long-lived token?
A stateless access token can't be revoked before it expires, so you want it to live only minutes — that bounds the damage if it leaks. To avoid forcing the user to log in every few minutes, a longer-lived refresh token (stored more securely, server-trackable) is exchanged for fresh access tokens.
This splits concerns: access tokens are stateless and fast to verify; refresh tokens are the revocable, stateful part. Add refresh token rotation (issue a new refresh token each use and invalidate the old one) so a stolen refresh token is detected on reuse.
Follow-ups they push on- What is refresh token rotation and what attack does it catch?
- Where do you store the refresh token vs the access token?
Red flag Issuing a long-lived access token 'for convenience'. If it leaks you have no way to revoke it until expiry.
source: Auth0 — Refresh tokens ↗ -
Debugging: a JWT library accepts a token with alg: none and lets a forged admin token through. What happened?
This is the classic
alg: none/ algorithm-confusion vulnerability. The JWT header declares its own algorithm; if the verifier trusts that field, an attacker setsalg: none(or strips the signature) and the library skips verification, accepting a payload they forged (role: admin). A related attack swapsRS256forHS256, signing with the public key as if it were an HMAC secret.Fix: never let the token dictate the algorithm. Configure the verifier with an allowlist of expected algorithms, reject
none, and validateexp/aud/iss. Treat the header'salgas untrusted input.What a strong answer coversThe bug: the verifier trusts the attacker-controlled
algheader.alg: nonetells naive libraries to skip signature verification entirely.RS256→HS256 confusion lets the public key be abused as an HMAC secret.
Fix: pin the expected algorithm(s) server-side; reject
none; verify standard claims.
Quick self-checkWhat's the root cause of the alg:none JWT bypass?
-
No — with alg:none there's no signature check at all.
-
Correct — it should pin expected algorithms, not read them from the token.
-
No — expiry is unrelated to the signature-skip bug.
-
No — transport security doesn't affect signature verification logic.
Follow-ups they push on- Why is the RS256-to-HS256 swap dangerous when the public key is, well, public?
- Which standard claims should you always validate?
Red flag Calling a generic `verify()` that honors the token's own `alg`. Always pass an explicit algorithm allowlist; never accept `none`.
source: Auth0 — Critical vulnerabilities in JSON Web Token libraries ↗ -
RBAC vs ABAC — what's the difference, and when do you outgrow roles?
RBAC (Role-Based Access Control) grants permissions through roles: a user is an
editor, theeditorrole canupdate:article. Simple, auditable, and enough for most apps. It strains when access depends on context beyond a role — ownership, department, time of day, resource attributes — leading to a 'role explosion' (editor_team_a_readonly_weekends).ABAC (Attribute-Based Access Control) decides via policies over attributes of the user, resource, action, and environment (e.g. 'allow if
user.dept == resource.deptandtimeis business hours'). It's far more expressive but harder to reason about and audit. Start with RBAC; reach for ABAC when contextual, fine-grained rules cause role explosion.What a strong answer coversRBAC: permissions via roles — simple, auditable, sufficient for most apps.
ABAC: policies over user/resource/action/environment attributes — expressive, context-aware.
Role explosion signals you've outgrown pure RBAC.
ABAC trades simplicity/auditability for fine-grained flexibility.
Quick self-checkRequirement: 'a user may edit a document only if they are in the same department as the document.' Which model fits naturally?
-
Leads to role explosion and still can't express the per-resource match cleanly.
-
Correct — it's an attribute relationship, exactly ABAC's strength.
-
Wrong — too broad; grants everyone everything.
-
Wrong — the requirement is explicitly an access rule.
Follow-ups they push on- What's 'role explosion' and what causes it?
- How does ownership-based access (only edit your own posts) fit RBAC vs ABAC?
Red flag Encoding contextual rules as ever-more-specific roles. When permissions depend on resource attributes or context, that's an ABAC need, not more roles.
source: Auth0 — RBAC vs ABAC ↗ -
Why compare password hashes (and tokens) with a constant-time comparison instead of ==?
A normal string
==short-circuits at the first mismatching byte, so it returns faster the earlier the difference. An attacker measuring response timing can exploit this timing side channel to recover a secret (an API token or HMAC) byte by byte — try values until the comparison takes slightly longer, meaning one more byte matched.A constant-time comparison always examines the full length regardless of where bytes differ, leaking no timing information. Use the platform's
crypto.timingSafeEqual/hmac.compare_digestfor tokens, HMAC tags, and similar secrets. (Note: bcrypt/Argon2 verification already handles this for passwords.)What a strong answer covers==short-circuits, so its runtime depends on how many leading bytes match.An attacker can recover a secret byte-by-byte from timing differences.
Constant-time compare scans the full input regardless of mismatches.
Use
crypto.timingSafeEqual/hmac.compare_digestfor token/HMAC checks.
Follow-ups they push on- Why doesn't this timing concern apply to comparing two bcrypt hashes the same way?
- Where else do timing side channels show up?
Red flag Comparing secret tokens or HMAC signatures with ordinary string equality. The early-exit timing leak can let an attacker brute-force the secret one byte at a time.
source: OWASP — Cryptographic Storage Cheat Sheet ↗ -
What is a pepper, and how does it differ from a salt in password hashing?
A salt is a unique random value stored *alongside* each password hash; it ensures identical passwords produce different hashes and defeats precomputed rainbow tables. It's not secret — it lives in the database with the hash.
A pepper is a single secret value mixed into every password before hashing, but kept outside the database (in app config, a secret manager, or an HSM). The point of defense-in-depth: if an attacker steals only the database, the salts don't help them, and without the pepper they still can't crack the hashes offline. Salt = per-user, public, in DB; pepper = global, secret, outside DB.
What a strong answer coversSalt: per-user, random, stored with the hash — kills rainbow tables.
Pepper: global secret, kept out of the DB — defends against DB-only theft.
They're complementary, not alternatives.
Pepper rotation is harder, so it's stored in config/secret manager/HSM.
Quick self-checkWhat distinguishes a pepper from a salt?
-
Backwards — salt is per-user, pepper is global.
-
Correct — that's the defining difference.
-
Wrong — neither describes salt vs pepper.
-
Wrong — they serve different, complementary roles.
Follow-ups they push on- If both leak, does the pepper still help?
- Where should the pepper be stored, and why not in the DB?
Red flag Storing the pepper in the same database as the hashes — that defeats its entire purpose. The pepper's value comes from living somewhere a DB dump won't expose.
source: OWASP — Password Storage Cheat Sheet (Peppering) ↗ -
Walk through the OAuth2 authorization code flow. Why was PKCE added?
Authorization code flow: the app redirects the user to the auth server; the user authenticates and consents; the auth server redirects back with a short-lived authorization code; the app's backend exchanges that code (plus its client secret) for an access token over a back channel. Keeping tokens off the front channel is the point.
PKCE (Proof Key for Code Exchange) hardens this for public clients (SPAs, mobile) that can't keep a secret. The client sends a hashed
code_challengeup front and the originalcode_verifierat exchange time, so a stolen authorization code is useless without the verifier. PKCE is now recommended for all clients.Follow-ups they push on- Why is the implicit flow discouraged now?
- What attack does PKCE specifically stop?
Red flag Using the deprecated implicit flow (tokens in the URL fragment) for SPAs. The modern guidance is auth-code + PKCE.
source: oauth.com — Authorization Code with PKCE ↗ -
Where should a browser store an access token, and how do the choices map to XSS vs CSRF?
localStorageis readable by any JavaScript on the page, so a single XSS flaw leaks the token. AnHttpOnlycookie is invisible to JS (XSS can't read it) but is sent automatically, which opens CSRF.The pragmatic answer: store tokens in
HttpOnly,Secure,SameSite=Lax/Strictcookies and add anti-CSRF defenses (SameSite already blocks most cross-site sends; add a CSRF token for the rest). Keep access tokens short-lived. There's no storage location immune to a compromised front end — defense in depth plus a tight CSP matters more than the slot.Follow-ups they push on- How does SameSite=Strict mitigate CSRF?
- Why doesn't HttpOnly help against CSRF?
Red flag Claiming HttpOnly cookies are 'XSS-proof and safe'. They stop token theft via JS but are auto-sent, so you still need CSRF protection.
source: OWASP — JWT / token storage cheat sheet ↗ -
Common pattern: use OAuth/OIDC to log in, then issue your own session or JWT. Why do that instead of using the provider's token directly?
After OIDC verifies identity, you typically mint your own session/JWT rather than passing Google's token around. Reasons: you control expiry and revocation; you attach your app's roles/permissions and user id; you don't couple every internal service to the external provider's token format or availability; and you avoid leaking a powerful provider token across your backend.
The provider token is used once at login to establish identity; from then on your own credential governs the session.
Follow-ups they push on- What goes in your token that the provider's doesn't?
- How does this help if you later add a second identity provider?
Red flag Forwarding the raw Google/Apple token to every internal service. It couples you to the provider and complicates revocation and authorization.
source: OAuth.com — OAuth 2.0 Simplified ↗
2.4 Application architecture & patterns 14
-
What's the difference between MVC and a layered (controller/service/repository) architecture? Are they the same thing?
They overlap but aren't identical. MVC is a UI-organizing pattern: the Model holds data/state, the View renders it, and the Controller handles input and coordinates the two — its purpose is separating presentation from data.
A layered architecture stacks responsibilities by technical concern (presentation → business/service → data-access/repository), each layer depending only on the one below. In practice a server MVC framework's 'Controller' maps to the presentation layer, and the 'Model' often expands into service + repository layers. So MVC describes the request-handling triangle; layering describes the full vertical stack that the model side usually grows into.
What a strong answer coversMVC separates presentation (view) from data/state (model) via a controller.
Layered architecture separates by technical concern top-to-bottom.
MVC's 'Model' typically expands into service + repository layers.
They're complementary lenses, not competing choices.
Quick self-checkIn a layered backend, where does business logic (e.g. 'a refund can't exceed the original charge') belong?
-
No — the controller handles HTTP I/O and coordination, not domain rules.
-
Correct — business rules and orchestration live in the service layer.
-
No — the repository only abstracts data access.
-
No — that's presentation rendering.
Follow-ups they push on- Where does business logic live in a 'fat model' vs a service layer?
- Why is putting business logic in the controller a smell in both?
Red flag Cramming business logic and data access into the MVC controller. The controller is presentation/coordination; domain logic belongs in services, persistence in repositories.
source: MDN — MVC ↗ -
What is middleware in a web framework, and what does it look like in practice?
Middleware is a function in the request/response pipeline that runs before (and often after) the route handler. Each piece can inspect or mutate the request/response and either pass control to the next link or short-circuit (e.g. reject an unauthenticated request).
Classic uses: logging, authentication, body parsing, CORS, rate limiting, error handling. In Express the signature is
(req, res, next) => { ... next(); }. The ordered chain is what makes cross-cutting concerns composable instead of duplicated in every handler.Follow-ups they push on- How does calling (or not calling) next() control the chain?
- Why is error-handling middleware registered last?
Red flag Forgetting to call next() (or to send a response), which hangs the request silently.
source: Express — Using middleware ↗ -
Give a one-line 'smell it fixes' for each SOLID principle.
S — Single Responsibility: a class has one reason to change; fixes the god-class that mixes parsing, business rules, and DB code. O — Open/Closed: extend behavior without editing existing code; fixes the ever-growing switch you reopen for every new case. L — Liskov Substitution: subtypes must be usable through the base type without surprises; fixes the subclass that throws on a method the parent promises. I — Interface Segregation: many small interfaces over one fat one; fixes clients forced to implement methods they don't use. D — Dependency Inversion: depend on abstractions, not concretions; fixes high-level logic nailed to a specific DB/SDK, which kills testability.
Follow-ups they push on- Which SOLID principle most directly enables unit testing? (DIP)
- Give a concrete Liskov violation.
Red flag Reciting the names without a concrete smell. Interviewers want the problem each one removes, not the dictionary definition.
source: GeeksforGeeks — SOLID principles ↗ -
What does the Repository pattern give you, and what's the risk of a 'leaky' repository?
A Repository is a collection-like abstraction over persistence: the service asks for
userRepo.findActiveByEmail(email)and doesn't know whether that's SQL, a document store, or an in-memory list. It centralizes query logic, decouples the domain from the ORM, and makes services testable with a fake repository.The risk is a leaky abstraction: if the repository exposes
IQueryable, raw SQL fragments, or ORM-specific lazy-loading proxies, persistence concerns bleed into the service and the decoupling is gone. Keep the interface in domain terms — return domain objects, accept domain criteria — so the storage technology stays a private detail.What a strong answer coversCollection-like interface over persistence; hides the storage mechanism.
Decouples domain/service from the ORM and enables fake-based unit tests.
Centralizes query logic instead of scattering SQL across services.
Leak risk: exposing
IQueryable/raw SQL/lazy proxies re-couples callers to the DB.
Follow-ups they push on- Repository vs DAO — what's the conceptual difference?
- Why return domain objects rather than ORM entities directly?
Red flag Returning the ORM's query builder or lazy-loaded entities from the repository. Callers then depend on persistence details, defeating the abstraction.
source: Martin Fowler — Repository ↗ -
Trick: a class has 14 constructor parameters. Which design principle is being violated, and how do you fix it?
A bloated constructor (a 'too many dependencies' smell) usually signals a Single Responsibility Principle violation — the class is doing too many jobs, each pulling in its own collaborators. It's the constructor-injection symptom of a god class.
Fix by decomposing: extract cohesive groups of those dependencies into smaller focused classes (e.g. a
NotificationServicewrapping the email/SMS/push senders) so the original class depends on a few higher-level abstractions instead of fourteen low-level ones. The number of constructor args is a proxy metric; the real fix is restoring single responsibility, not hiding the args behind a service locator or a giant config object.What a strong answer coversMany constructor params → the class has too many responsibilities (SRP violation).
Constructor injection makes the bloat visible, which is a feature, not the bug.
Fix by extracting cohesive collaborators into focused sub-services.
Don't hide it with a service locator/God-config object — that masks the smell.
Quick self-checkA class needs 12 injected dependencies. The healthiest interpretation is:
-
Wrong — that hides the smell and harms testability.
-
Correct — too many dependencies signals too many responsibilities.
-
Wrong — wiring tooling doesn't reduce the responsibilities.
-
Wrong — reintroduces global state and breaks testing.
Follow-ups they push on- Why is hiding the dependencies behind a service locator the wrong fix?
- How does SRP relate to high cohesion?
Red flag 'Fixing' it by switching to a service locator so the dependencies become invisible. That hides the SRP violation instead of resolving it and hurts testability.
source: Refactoring Guru — Large Class smell ↗ -
Composition over inheritance — what does it mean and why is it usually the better default?
Inheritance models 'is-a' and binds a subclass to its parent's implementation at compile time — a rigid, white-box coupling that gets brittle with deep hierarchies (the fragile base class problem) and tempts Liskov violations. Composition builds behavior by holding other objects and delegating to them ('has-a'), which you can vary at runtime and swap for tests.
The guidance 'favor composition over inheritance' (from the Gang of Four) is about flexibility: small composed parts recombine freely, while inheritance hierarchies resist change. Use inheritance for genuine, stable is-a relationships with a real behavioral contract; prefer composition for sharing/reusing behavior.
What a strong answer coversInheritance = compile-time 'is-a', tight white-box coupling to the parent.
Composition = runtime 'has-a', delegate to swappable collaborators.
Deep hierarchies cause fragile-base-class and Liskov problems.
GoF guidance: favor composition; reserve inheritance for true, stable is-a.
Follow-ups they push on- How does the Strategy pattern embody composition over inheritance?
- When is inheritance still the right tool?
Red flag Reaching for inheritance to reuse a method, creating a deep hierarchy that's hard to change. If the relationship isn't a true is-a, compose and delegate instead.
source: Refactoring Guru — Favor composition over inheritance ↗ -
Why is the Singleton pattern considered a testability and design smell?
A Singleton enforces one global instance with global access. The problems: it's global mutable state in disguise, which hides dependencies (a class secretly reaches for
Logger.getInstance()instead of receiving it). That makes unit tests hard — you can't easily substitute a mock, tests share state and leak into each other, and parallel tests interfere.The usual fix is dependency injection: create one instance at the composition root and pass it in. You keep 'one instance' as a lifecycle policy without the hard-coded global lookup.
Follow-ups they push on- How does DI give you 'one instance' without the Singleton anti-pattern?
- When is a Singleton actually fine?
Red flag Defending Singleton as 'just one object'. The cost is the static global access point that hides dependencies and breaks test isolation.
source: GeeksforGeeks — Singleton design pattern ↗ -
Explain dependency injection and how it improves testability.
Dependency injection means a component receives its collaborators from outside (constructor/parameters) instead of constructing them itself. It's the practical expression of the Dependency Inversion Principle: code depends on an interface, and the concrete implementation is wired in at the edge.
Testability win: in a test you inject a fake/mock repository or HTTP client, so you can unit-test the service in isolation with no real database or network. It also decouples modules — swapping Postgres for an in-memory store is a wiring change, not a rewrite.
Follow-ups they push on- How does this relate to the Repository pattern?
- Constructor injection vs a service locator — which is cleaner and why?
Red flag Confusing DI with 'using a DI framework'. DI is just passing dependencies in; the container is optional sugar.
source: Martin Fowler — Inversion of Control & DI ↗ -
Walk through layered architecture (controller → service → repository). What belongs in each layer?
Controller: HTTP concerns only — parse/validate the request, call a service, map the result to a status code and response. Service: the business logic and orchestration — transactions, rules, coordinating multiple repositories; it knows nothing about HTTP. Repository: data access — encapsulates queries behind a collection-like interface so the service depends on an abstraction, not raw SQL.
The payoff is that each layer is testable and replaceable in isolation, and business logic doesn't leak into the web framework or the database.
Follow-ups they push on- Why keep HTTP concerns out of the service layer?
- Where does request validation live, and where do domain rules live?
Red flag Fat controllers with business logic and SQL inline — you lose testability and the logic gets tied to the web framework.
source: Martin Fowler — Patterns of Enterprise Application Architecture ↗ -
Strategy vs Factory vs Adapter — give a one-sentence use case for each.
Strategy: swap interchangeable algorithms behind one interface at runtime — e.g. pluggable payment processors or sort comparators, picked by configuration. Factory: centralize object creation so callers ask for *what* they want, not *how* it's built — e.g.
createParser(fileType). Adapter: wrap an incompatible third-party interface to match the one your code expects — e.g. adapting a legacy SDK to yourPaymentGatewayinterface.Mnemonic: Strategy varies behavior, Factory varies construction, Adapter reconciles interfaces.
Follow-ups they push on- Strategy vs simple if/else — when is the pattern worth it?
- How does Adapter differ from Decorator?
Red flag Applying a pattern for its own sake. A two-branch conditional doesn't need Strategy; patterns earn their cost when the variation is open-ended.
source: Refactoring Guru — Design patterns catalog ↗ -
Explain the Observer (pub/sub) pattern and the Decorator pattern. Give a real backend use of each.
Observer / pub-sub: subjects publish events and any number of subscribers react, with no direct coupling between them — e.g. on
OrderPlaced, the email service, inventory service, and analytics each subscribe independently. It decouples producers from consumers and underlies event-driven systems.Decorator: wrap an object to layer behavior without changing it, preserving the same interface — e.g. wrapping a repository with caching, then logging, then retry. Each layer adds one concern and delegates inward, so you compose features instead of editing the core class.
Follow-ups they push on- How does Observer relate to a message broker like Kafka?
- Decorator vs subclassing for adding logging — why prefer the decorator?
Red flag Confusing Decorator with Adapter. Decorator keeps the same interface and adds behavior; Adapter changes one interface into another.
source: Refactoring Guru — Observer ↗ -
Explain hexagonal (ports & adapters) architecture. What problem does it solve over a plain layered design?
Hexagonal architecture puts the domain/application core at the center and defines ports (interfaces) for everything it talks to. Adapters implement those ports for specific technologies — a Postgres adapter, a REST adapter, a Kafka adapter — and plug in at the edges. The dependency rule points inward: the core never imports a framework or driver.
Versus a strict top-down layered design (where the business layer still depends on a concrete data layer beneath it), hexagonal inverts those edge dependencies so the database, web framework, and message bus are all swappable, interchangeable details. The payoff is testability (drive the core through fake adapters) and decoupling the domain from infrastructure churn.
What a strong answer coversDomain core + ports (interfaces) + adapters (tech-specific implementations).
Dependencies point inward; the core knows nothing about frameworks/drivers.
DB, web, and messaging become swappable adapters, not foundational layers.
Enables testing the core in isolation through fake adapters.
Follow-ups they push on- What's a 'driving' (primary) adapter vs a 'driven' (secondary) adapter?
- How does this relate to the Dependency Inversion Principle?
Red flag Letting domain code import the ORM/web framework directly 'for convenience'. That re-couples the core to infrastructure and defeats the whole ports-and-adapters point.
source: Alistair Cockburn — Hexagonal Architecture ↗ -
What is inversion of control (IoC), and how is dependency injection a specific form of it?
Inversion of control is the general principle that a framework or container — not your code — drives the flow: instead of your code calling into a library, the framework calls your code at the right moments ('don't call us, we'll call you', the Hollywood Principle). Event loops, middleware pipelines, and template method patterns are all IoC.
Dependency injection is one specific kind of IoC: inverting *who supplies a component's dependencies*. Rather than a class constructing its own collaborators, something external (a container or the composition root) provides them. So DI inverts dependency acquisition; IoC is the broader family of 'the framework controls the flow, you fill in the parts'.
What a strong answer coversIoC: the framework controls flow and calls your code ('Hollywood Principle').
DI is a specific form of IoC — inverting how dependencies are supplied.
Other IoC examples: event loops, callbacks, middleware, template method.
DI ≠ a DI container; the container is just one way to do DI.
Quick self-checkWhich statement is correct about IoC and DI?
-
Wrong — DI is a specific form of the broader IoC principle.
-
Correct — DI is a subset of inversion of control.
-
Wrong — an event loop is IoC with no DI container.
-
Wrong — it's a general principle (callbacks, template method, etc.).
Follow-ups they push on- Give a non-DI example of inversion of control.
- Why is 'IoC container' a slightly misleading name for a DI framework?
Red flag Using IoC and DI as synonyms. DI is one instance of IoC (inverting dependency supply); IoC is the broader idea of the framework owning control flow.
source: Martin Fowler — Inversion of Control ↗ -
Give a concrete Liskov Substitution Principle violation and how you'd fix it.
Classic example:
Square extends Rectangle. Setting width and height independently is part of Rectangle's contract, but a Square forces them equal, so code that doesrect.setWidth(5); rect.setHeight(4); assert area == 20breaks when handed a Square. The subtype violates the base type's expectations.Fix: drop the inheritance — model
Shapewith anarea()method and make Square and Rectangle siblings, or use immutable value objects so the mutating contract that conflicts never exists. The lesson: 'is-a' in English isn't enough; the subtype must honor the supertype's behavioral contract.Follow-ups they push on- Why is 'a square is a rectangle' true in math but wrong here?
- How does LSP relate to using exceptions in overridden methods?
Red flag Treating LSP as just 'subclasses should work'. The real test is behavioral substitutability — preconditions can't strengthen, postconditions can't weaken.
source: GeeksforGeeks — Liskov Substitution Principle ↗
2.5 Concurrency & parallelism 13
-
What is a deadlock vs a livelock vs starvation? Distinguish all three.
Deadlock: threads are blocked forever, each waiting on a resource another holds — nobody moves (e.g. the AB/BA lock-ordering cycle). Livelock: threads aren't blocked and keep *changing state* in response to each other, but make no progress — like two people stepping aside in the same direction repeatedly in a hallway. Starvation: a thread *can* run but is perpetually denied the resource because others keep winning it (e.g. a low-priority thread under a greedy scheduler).
Key distinction: deadlock = stuck and idle; livelock = busy but unproductive; starvation = some progress overall, but one thread is unfairly shut out.
What a strong answer coversDeadlock: mutual blocking, zero activity, circular wait.
Livelock: active state changes but no forward progress.
Starvation: a thread is runnable but perpetually denied the resource.
Fairness/aging fixes starvation; lock ordering fixes deadlock.
Quick self-checkTwo threads each detect a conflict, both back off and immediately retry in lockstep, repeating forever without blocking. This is:
-
No — deadlocked threads are blocked and idle, not actively retrying.
-
Correct — they keep changing state in response to each other but make no progress.
-
No — starvation is one thread unfairly denied, not symmetric busy-retrying.
-
No — a race is a timing-dependent correctness bug, not a no-progress loop.
Follow-ups they push on- How can a naive retry-on-conflict loop cause livelock?
- How does priority aging address starvation?
Red flag Calling any 'no progress' situation a deadlock. Livelock threads are actively running, and starvation still has overall progress — different causes, different fixes.
source: GeeksforGeeks — Deadlock, Starvation, and Livelock ↗ -
Concurrency vs parallelism — what's the difference?
Concurrency is about *dealing with* many tasks by interleaving them — making progress on several by switching between them, even on a single core. Parallelism is *doing* many tasks at the same instant on multiple cores.
Rob Pike's line: concurrency is about structure, parallelism is about execution. A single-threaded async server is concurrent but not parallel; a CPU-bound job split across 8 cores is parallel. You can have concurrency without parallelism and vice versa.
Follow-ups they push on- Can you have parallelism without concurrency?
- Where does Node's event loop sit on this axis?
Red flag Using the terms interchangeably. Interleaving on one core is concurrency, not parallelism.
source: GeeksforGeeks — Concurrency vs parallelism ↗ -
Debugging: a Node.js endpoint that does heavy synchronous JSON crypto makes ALL other requests slow. Why, and how do you fix it?
Node runs your JavaScript on a single event-loop thread. A heavy *synchronous* CPU task (a big loop, sync crypto, JSON over megabytes) doesn't yield, so it blocks the event loop — every other pending request, timer, and callback stalls until it finishes. Async I/O isn't the issue; CPU-bound sync work is.
Fixes: move the CPU work off the loop — use a worker thread (or
worker_threads/a child process), the async variant of the crypto API, or offload to a separate service/queue. The rule: never run long synchronous CPU work on the event loop thread.What a strong answer coversNode's JS executes on one event-loop thread; sync CPU work blocks everything.
Async I/O is fine — the culprit is synchronous CPU-bound code.
Fix: worker threads / child process / async crypto / offload to a service.
Symptom: latency spikes across unrelated endpoints during the heavy call.
Quick self-checkA synchronous CPU-heavy handler slows all Node.js requests. The correct fix is to:
-
Wrong — await doesn't yield mid-CPU-loop; the loop still blocks.
-
Correct — that frees the event-loop thread to serve other requests.
-
Wrong — it hides nothing; other requests still queue behind the blocked loop.
-
Wrong — irrelevant; all routes share the one event loop.
Follow-ups they push on- Why doesn't adding more async/await help a CPU-bound loop?
- When would you reach for a separate service vs a worker thread?
Red flag Trying to fix it by sprinkling `async/await`. Awaiting doesn't yield during a synchronous CPU loop — you must move the computation off the event-loop thread.
source: Node.js — Don't block the event loop ↗ -
Thread-per-request vs event-loop (reactive) servers — what's the tradeoff at high concurrency?
Thread-per-request (classic Java/Tomcat, Apache prefork) assigns each connection a thread. The model is simple — blocking code reads top-to-bottom — but each thread costs ~1MB+ of stack and context-switch overhead, so tens of thousands of concurrent connections exhaust memory and the scheduler (the C10k problem).
Event-loop / reactive servers (Node, Netty, nginx) handle many connections on a few threads via non-blocking I/O and callbacks, scaling to huge connection counts with low memory. The cost is programming complexity (callbacks/async) and the danger that any blocking call freezes the loop. Threads suit moderate concurrency with blocking dependencies; event loops suit massive I/O-bound concurrency.
What a strong answer coversThread-per-request: simple blocking code, but per-thread memory + context switches cap concurrency.
Event loop: few threads, non-blocking I/O, scales to huge connection counts.
This is the classic C10k scaling story.
Event loops demand non-blocking code; one blocking call stalls everyone.
Follow-ups they push on- What is the C10k problem?
- How do virtual/green threads (e.g. Java loom, goroutines) blur this divide?
Red flag Assuming 'more threads = more scale'. Past a point, thread memory and context-switching dominate; that's exactly what event-loop models were built to avoid.
source: GeeksforGeeks — Thread per request vs event-driven model ↗ -
Threads vs processes — what's shared, what's isolated, and when do you pick each?
A process has its own isolated memory space; threads within a process share the same heap/address space. Threads are cheaper to create and communicate through shared memory; processes are heavier but isolated — a crash or memory corruption in one process can't directly corrupt another.
Pick threads for fine-grained shared-memory work where communication cost matters; pick processes for isolation and fault containment (and, in languages with a GIL like CPython, to get true CPU parallelism). The tradeoff is shared-memory speed vs. isolation and safety.
Follow-ups they push on- How do processes communicate without shared memory? (IPC, pipes, sockets)
- Why does the GIL push CPython to multiprocessing for CPU-bound work?
Red flag Assuming threads always parallelize CPU work — a global interpreter lock (CPython) serializes bytecode, so threads help I/O but not CPU-bound loops.
source: GeeksforGeeks — Difference between process and thread ↗ -
What is a race condition? Show a classic example and how to fix it.
A race condition is when the result depends on the unpredictable timing/interleaving of concurrent operations on shared state. Classic case: two threads run
balance = balance + 100. That's read-modify-write: both read the same old value, both add 100, both write back — one update is lost.Fix by making the critical section atomic: guard it with a mutex/lock, use an atomic increment, or use a compare-and-swap. The general principle is to serialize access to shared mutable state so only one thread is in the critical section at a time.
Follow-ups they push on- Why isn't `x++` atomic?
- What's a check-then-act race (e.g. 'if not exists, create')?
Red flag Assuming a single statement like `count++` is atomic — it compiles to load/add/store, which can interleave.
source: GeeksforGeeks — Race condition ↗ -
Mutex vs semaphore — define each and when you'd use which.
A mutex provides mutual exclusion: one holder at a time, and ownership matters — the thread that locks it should unlock it. Use it to protect a single shared resource / critical section.
A semaphore is a counter that permits up to N concurrent holders (
acquiredecrements,releaseincrements, block at zero). A binary semaphore (N=1) resembles a lock but has no ownership and is often used for signaling between threads. Use a counting semaphore to cap concurrency — e.g. limit to 10 simultaneous DB connections.Follow-ups they push on- What does 'ownership' give a mutex that a semaphore lacks?
- How would you bound a connection pool with a semaphore?
Red flag Treating a binary semaphore as a drop-in mutex. Without ownership, any thread can release it, which permits subtle bugs a mutex prevents.
source: GeeksforGeeks — Mutex vs semaphore ↗ -
Why does async I/O let a single thread handle thousands of connections?
Most server work is I/O-bound — waiting on the network, disk, or a database. With blocking I/O each connection ties up a thread that just sits idle during the wait, so 10k connections need ~10k threads (expensive memory + context switching).
Non-blocking async I/O flips this: the thread issues the I/O and immediately moves on; the OS notifies it (epoll/kqueue) when data is ready, and a callback/continuation resumes. One thread multiplexes thousands of in-flight waits because no thread blocks on the wait. The catch: a CPU-bound task blocks the loop and starves everyone, so async shines for I/O-bound, not CPU-bound, work.
Follow-ups they push on- When does async hurt? (CPU-bound work blocking the event loop)
- How is this different from a thread-per-request server?
Red flag Believing async is faster for everything. It wins on I/O concurrency; a heavy CPU computation still blocks the single event-loop thread.
source: MDN — Asynchronous JavaScript / event loop ↗ -
What is thread starvation in a connection/thread pool, and how does it cause a 'deadlock' without any locks?
Pool starvation: every thread (or DB connection) in a bounded pool is busy waiting on a resource that can only be supplied by *another* task that's now stuck in the pool's queue with no thread to run it. No mutex is involved, yet the system wedges — a 'pool-induced deadlock'.
Classic case: a request handler holds a pooled thread and synchronously calls back into the same service/pool, which is exhausted; the inner call waits for a thread that will only free up when the outer call returns. Fixes: never block a pooled thread waiting on the same pool, size pools to account for nested calls, separate pools for distinct workloads (bulkheading), and add timeouts so waiters fail fast instead of hanging forever.
What a strong answer coversAll pool threads/connections busy → queued work can't get a worker.
A task blocking on work that needs the same exhausted pool wedges the system.
It looks like a deadlock but has no locks — it's resource exhaustion.
Fixes: bulkhead separate pools, avoid nested same-pool blocking, add timeouts.
Follow-ups they push on- How does the bulkhead pattern prevent one workload from starving others?
- Why do checkout/borrow timeouts help even if they don't fix the root cause?
Red flag Blocking a pooled worker on a call that itself needs a worker from the same exhausted pool. Use separate pools and timeouts, and avoid nested same-pool blocking.
source: Microsoft — Bulkhead pattern ↗ -
Why is double-checked locking for lazy singleton initialization subtly broken without proper memory visibility?
Double-checked locking checks
instance == null, locks only if null, checks again inside the lock, then constructs. The subtle bug is memory visibility / instruction reordering: object construction isn't atomic — the reference can become visible to other threads *before* the constructor's writes are flushed, so a second thread sees a non-null but partially-initialized object.The fix depends on the memory model: in Java, mark the field
volatile(which since JMM 5 establishes the needed happens-before ordering); other languages need their equivalent memory barrier / acquire-release semantics. The deeper lesson: correctness under concurrency needs the language's memory model guarantees, not just mutual exclusion.What a strong answer coversThe flaw is reordering/visibility, not the locking logic itself.
A thread can publish the reference before the constructor's writes are visible.
Fix in Java:
volatilefield (post-Java-5 memory model).Lesson: concurrency correctness requires memory-model guarantees, not just locks.
Quick self-checkWhat makes naive double-checked locking unsafe?
-
Wrong — the lock is taken at most once, only when instance is null.
-
Correct — without volatile/barriers a thread can see a partially-initialized object.
-
Misleading — the issue is visibility/ordering, not the check itself.
-
Wrong — they can; the issue is doing it safely.
Follow-ups they push on- What does `volatile` guarantee that a plain field doesn't?
- Why is a static holder/initialization-on-demand idiom often cleaner than DCL?
Red flag Believing the second null-check alone makes DCL safe. Without volatile/memory barriers, reordering can expose a half-constructed instance.
source: Wikipedia — Double-checked locking ↗ -
What is a deadlock, what four conditions cause it, and how do you prevent it?
A deadlock is when threads wait forever on each other's locks. It needs all four Coffman conditions simultaneously: mutual exclusion, hold-and-wait, no preemption, and circular wait.
Break any one to prevent it. The most practical: impose a global lock-ordering so all threads acquire locks in the same order (kills circular wait); or acquire all locks at once (kills hold-and-wait); or use lock timeouts /
tryLockand back off. Example deadlock: thread A holds lock 1 and wants lock 2 while thread B holds lock 2 and wants lock 1.Follow-ups they push on- Which Coffman condition is easiest to remove in practice? (circular wait via ordering)
- Deadlock vs livelock vs starvation?
Red flag Adding 'just one more lock' to fix a race and creating a deadlock instead. Inconsistent lock acquisition order is the usual culprit.
source: GeeksforGeeks — Deadlock and conditions ↗ -
What is optimistic vs pessimistic locking, and when do you pick each?
Pessimistic locking assumes conflicts are likely, so it locks the row/resource up front (
SELECT ... FOR UPDATE) and others wait. Safe but reduces concurrency and risks deadlocks.Optimistic locking assumes conflicts are rare: read freely, and at write time check a version number (or timestamp/ETag) — if it changed, someone else won, so reject and retry. Great for low-contention, read-heavy workloads; wasteful retries under high contention. Pick pessimistic for hot, highly-contended rows; optimistic for mostly-independent updates.
Follow-ups they push on- How does a version column implement optimistic locking?
- How does this map to HTTP's If-Match / 412?
Red flag Using optimistic locking on a hotly-contended counter — you'll thrash on retries. High contention favors pessimistic locking.
source: Martin Fowler — Optimistic Offline Lock ↗ -
Why are atomic operations and compare-and-swap (CAS) faster than locks for simple shared counters?
A mutex involves OS-level machinery: contended threads may block and be parked/woken by the scheduler, which costs context switches. CAS is a single hardware instruction — 'if memory still holds the value I read, swap in the new value, else fail' — so a lock-free counter just loops
read → compute → CAS, retry on failureentirely in user space with no blocking.Under low-to-moderate contention this is much cheaper. The tradeoff: CAS works for small single-word updates; complex multi-variable invariants still need locks, and very high contention makes CAS retry loops spin wastefully. This is the basis of lock-free/atomic data structures.
What a strong answer coversLocks can block threads → context-switch and scheduler overhead.
CAS is one atomic CPU instruction; lock-free loops stay in user space.
Great for single-word updates (counters, flags); not for multi-variable invariants.
Under heavy contention, CAS retry loops can spin and waste CPU.
Follow-ups they push on- What is the ABA problem in CAS-based algorithms?
- When does a spinning CAS loop perform worse than a mutex?
Red flag Assuming lock-free is always faster. CAS shines for tiny updates under modest contention; complex invariants and extreme contention can favor locks.
source: GeeksforGeeks — Compare and Swap (CAS) ↗
2.6 Messaging & event-driven architecture 13
-
Point-to-point queue vs publish/subscribe — what's the difference and when do you use each?
In a point-to-point queue, each message is delivered to exactly one consumer among possibly many competing workers — it's a work queue for distributing tasks (e.g. resize one image once, no matter how many workers are running). In publish/subscribe, each message is fanned out to *every* subscriber, so N independent services all react to the same event.
Use point-to-point to load-balance work across a pool (competing consumers); use pub/sub to broadcast an event to multiple independent consumers. Kafka models pub/sub via consumer groups: across groups it's fan-out, within a group it's point-to-point load balancing.
What a strong answer coversQueue (point-to-point): one message → exactly one of the competing consumers.
Pub/sub: one message → every subscriber (fan-out).
Queues load-balance work; pub/sub broadcasts events.
Kafka consumer groups: fan-out across groups, load-balance within a group.
Quick self-checkYou need an OrderPlaced event to trigger email, inventory, AND analytics services independently. Which model?
-
Wrong — each message goes to only one consumer; two services miss it.
-
Correct — every subscriber receives its own copy of the event.
-
Wrong — that distributes, not broadcasts; each event hits one service.
-
Wrong — doesn't fan out and recouples the producer.
Follow-ups they push on- How do Kafka consumer groups give you both models?
- What's the 'competing consumers' pattern?
Red flag Using a single shared queue when you actually need every service to see the event — only one consumer will get each message, and the others silently miss it.
source: AWS — Pub/sub messaging vs message queues ↗ -
Kafka vs RabbitMQ vs SQS — what's the conceptual difference and when does each fit?
Kafka is a durable, append-only distributed log: consumers track an offset and can replay; built for high-throughput streaming and event sourcing, with retention so multiple consumer groups read the same stream independently. RabbitMQ is a traditional broker with smart routing (exchanges, queues, bindings) and per-message acks — great for complex routing and classic task queues, but messages typically vanish once consumed. SQS is a fully managed AWS queue — minimal ops, at-least-once delivery, near-infinite scale, but no replay and limited ordering (FIFO queues excepted).
Pick Kafka for streaming/replay/high-throughput, RabbitMQ for rich routing and work queues, SQS when you want managed simplicity on AWS.
Follow-ups they push on- Why can Kafka replay events but RabbitMQ usually can't?
- What does a consumer offset give you that an ack doesn't?
Red flag Calling Kafka 'just a queue'. It's a retained log — consumers read by offset and can replay, which a delete-on-consume queue can't.
source: Hello Interview — Kafka deep dive ↗ -
Trick: does Kafka delete a message once a consumer reads it? What actually controls retention?
No — this is the key mental-model shift from traditional queues. Kafka is a durable log; reading a message does not remove it. The consumer just advances its offset (a bookmark), and the data stays on disk for everyone else to read. Multiple consumer groups can read the same messages independently, and a group can rewind its offset to replay.
Retention is governed by configured policy, not consumption: time-based (
retention.ms, e.g. 7 days) or size-based (retention.bytes), or log compaction (keep the latest value per key). Messages age out by policy regardless of whether anyone consumed them.What a strong answer coversReading does NOT delete — consumers advance an offset (a bookmark).
Data persists for all consumer groups; rewinding the offset replays.
Retention is by time/size policy or log compaction, independent of consumption.
Contrast: a traditional queue typically deletes on consume.
Quick self-checkWhat happens to a Kafka message after a consumer reads it?
-
Wrong — that's a traditional queue, not Kafka's log.
-
Correct — Kafka is a retained log; reading moves a bookmark.
-
Wrong — DLQs are for failures, not normal reads.
-
Wrong — retention is policy-based, not consumption-based.
Follow-ups they push on- What is log compaction and when do you use it?
- How does offset-as-bookmark enable replay and reprocessing?
Red flag Treating Kafka like a delete-on-read queue. Messages persist until the retention policy expires them — consumption only moves an offset.
source: Confluent — Kafka topics and retention ↗ -
Why does at-least-once delivery force you to build idempotent consumers?
Most brokers guarantee at-least-once delivery: if a consumer processes a message but crashes before acking, the broker redelivers it, so duplicates are inevitable. If processing isn't idempotent, a duplicate means double-charging, double-emailing, or double-incrementing.
Make the consumer idempotent: dedupe on a stable message id (record processed ids and skip repeats), or design the operation so reapplying it is a no-op (upsert, set-to-value instead of increment). Then redelivery is harmless.
Follow-ups they push on- How would you dedupe by message id, and where do you store seen ids?
- Why is 'set status = SHIPPED' safer than 'increment count'?
Red flag Assuming each message arrives exactly once. At-least-once is the norm; build for duplicates.
source: AWS — SQS at-least-once delivery ↗ -
What is a dead-letter queue and when does a message land there?
A dead-letter queue (DLQ) is a side queue where messages go after they repeatedly fail to be processed (exceeding a max-receive/retry count) or can't be delivered. It stops a single 'poison' message from being redelivered forever and blocking the main queue.
Operationally you alert on DLQ depth, inspect the failed messages, fix the bug or bad data, and replay them back to the main queue. Without a DLQ, a permanently-failing message either loops endlessly or gets silently dropped.
Follow-ups they push on- How do you decide the max-receive count before dead-lettering?
- What's a poison message and how does a DLQ contain it?
Red flag Having no DLQ, so a poison message either blocks the queue with infinite retries or is lost silently. Always have a parking lot.
source: AWS — Amazon SQS dead-letter queues ↗ -
What is consumer lag in Kafka, and what does growing lag tell you?
Consumer lag is the gap between the latest offset produced to a partition (the log-end offset) and the offset the consumer group has committed — i.e. how many messages are produced-but-not-yet-processed. Steady or near-zero lag means consumers keep up; growing lag means the consumers can't process as fast as producers write.
It's a primary health/alerting signal. Remedies for chronic lag: add consumers (up to the partition count — that's the parallelism ceiling), add partitions, speed up per-message processing, or batch. Spiky lag that drains is fine; monotonically rising lag predicts an eventual backlog blowup.
What a strong answer coversLag = latest produced offset − consumer's committed offset (unprocessed backlog).
Rising lag = consumers slower than producers.
Max parallelism is bounded by partition count — more consumers than partitions sit idle.
A core metric to alert on for streaming health.
Follow-ups they push on- Why can't you scale consumers beyond the partition count?
- How would you reduce lag without adding partitions?
Red flag Adding consumers beyond the number of partitions to cut lag. Extra consumers in a group just idle — you must increase partitions to raise parallelism.
source: Confluent — Monitoring consumer lag ↗ -
Choreography vs orchestration for the Saga pattern — how do you keep a multi-service transaction consistent?
With no distributed ACID transaction across microservices, a Saga breaks a business transaction into a sequence of local transactions, each publishing an event; if a step fails, compensating transactions undo the prior steps (e.g. refund a charge after inventory reservation fails).
Choreography: services react to each other's events with no central coordinator — loosely coupled but the end-to-end flow is implicit and hard to trace. Orchestration: a central orchestrator explicitly drives each step and triggers compensations — easier to reason about and monitor, but the orchestrator is a coupling point. Choose choreography for simple, few-step flows; orchestration as step count and error handling grow.
Follow-ups they push on- What's a compensating transaction, and why isn't it the same as a rollback?
- When does choreography's implicit flow become a liability?
Red flag Trying to span microservices with one ACID transaction (e.g. distributed 2PC everywhere). Sagas with compensations are the practical model; 2PC scales and fails poorly across services.
source: microservices.io — Saga pattern ↗ -
How do RabbitMQ acks and the prefetch (QoS) setting affect throughput and reliability?
With manual acks, RabbitMQ keeps a message 'unacked' until the consumer confirms it; if the consumer dies first, the message is requeued — that's how at-least-once delivery and crash safety work. Auto-ack trades that safety for speed (a crash mid-processing loses the message).
Prefetch (basic.qos) caps how many unacked messages a consumer may hold at once. Prefetch=1 gives the fairest load distribution (a slow consumer won't hoard a backlog) but adds round-trip overhead; a higher prefetch boosts throughput by pipelining but can let one consumer grab a big batch while others idle. Tune prefetch to balance fairness against throughput for your processing time.
What a strong answer coversManual ack = message redelivered if the consumer dies before acking (at-least-once).
Auto-ack is faster but loses in-flight messages on crash.
Prefetch limits unacked messages per consumer.
Low prefetch → fair distribution; high prefetch → throughput but possible hoarding.
Follow-ups they push on- Why does prefetch=1 give the fairest distribution but lower throughput?
- What happens to unacked messages when a consumer connection drops?
Red flag Using auto-ack for work you can't afford to lose, or leaving prefetch unbounded so one consumer grabs the whole queue while others starve.
source: RabbitMQ — Consumer Acknowledgements and Publisher Confirms ↗ -
How does a message queue provide back-pressure and load leveling, and what's the risk if you ignore queue depth?
A queue decouples producer rate from consumer rate: during a spike, messages buffer in the queue instead of overwhelming the downstream service, which keeps processing at its sustainable rate — that's load leveling (the queue-based load-leveling pattern). It smooths bursts into a steady drain.
But a queue is finite. If producers persistently outpace consumers, queue depth grows unbounded: latency climbs (messages wait longer), memory/disk fills, and you risk hitting limits or processing hours-stale data. Back-pressure is signaling producers to slow down (reject, throttle, or block) when depth crosses a threshold. Always monitor and alert on queue depth/age, cap the queue, and decide a shed/back-pressure policy — a queue defers overload, it doesn't eliminate it.
What a strong answer coversQueue buffers bursts so consumers drain at a sustainable rate (load leveling).
Back-pressure = signaling producers to slow when the queue fills.
Unbounded growth → rising latency, stale data, resource exhaustion.
Monitor depth/age; cap the queue and define a shedding/back-pressure policy.
Follow-ups they push on- How do you implement back-pressure when producers and consumers are decoupled?
- Why is a growing queue a latency problem even before it's a capacity problem?
Red flag Treating the queue as infinite elastic buffer. If consumers are chronically slower than producers, the queue just defers the overload while latency and staleness balloon.
source: Microsoft — Queue-Based Load Leveling pattern ↗ -
Is exactly-once delivery real? Explain the nuance.
Exactly-once *network delivery* is generally impossible — you can't simultaneously guarantee no loss and no duplicates across an unreliable network (two-generals problem). What systems offer is exactly-once processing / effective-once: at-least-once delivery plus idempotent or transactional handling so the observable effect happens once.
Kafka's 'exactly-once semantics' works this way: idempotent producers and transactions tie the consume-process-produce cycle together so duplicates don't produce duplicate effects. The honest framing: dedup + transactions give exactly-once *effects*, not magically-once delivery.
Follow-ups they push on- How does Kafka achieve its exactly-once semantics? (idempotent producer + transactions)
- Why is the consumer side still your responsibility for external side effects?
Red flag Claiming a broker delivers exactly once over the wire. Real systems get exactly-once *effects* via idempotency/transactions, not exactly-once delivery.
source: Confluent — Exactly-once semantics in Kafka ↗ -
Event-driven vs request/response — what do you gain and what do you give up?
Request/response is synchronous and simple: the caller waits and gets an answer or an error, with a clear linear flow that's easy to reason about and debug. But it couples services temporally — if the callee is down, the caller fails — and it doesn't absorb spikes.
Event-driven publishes events and lets consumers react asynchronously: it decouples producers from consumers, buffers load (the queue absorbs spikes), and lets you add new consumers without touching the producer. The costs are eventual consistency, harder end-to-end debugging/tracing, and the need for idempotency and ordering handling. Use events for fan-out, decoupling, and load-leveling; use request/response when you need an immediate answer.
Follow-ups they push on- How does a queue provide back-pressure / load-leveling?
- What new failure modes does async introduce?
Red flag Going event-driven everywhere and losing the simple synchronous read paths. Async adds eventual consistency and tracing complexity — use it where decoupling actually pays.
source: Netflix Tech Blog — event-driven architecture ↗ -
How does Kafka preserve message ordering, and what's the catch?
Kafka guarantees ordering only within a partition, not across a topic. Messages with the same partition key (e.g.
userId) always land in the same partition and are consumed in order, so per-key ordering holds.The catch: you only get parallelism by having multiple partitions, and across partitions there's no global order. So you trade total ordering for throughput. If you need strict global ordering you're limited to one partition (no parallelism) — the usual move is to choose a partition key that makes per-key ordering sufficient.
Follow-ups they push on- How do you pick a partition key for per-entity ordering?
- Why can't you both have many partitions and total ordering?
Red flag Assuming a Kafka topic is globally ordered. Ordering is per-partition; cross-partition order is undefined.
source: Hello Interview — Kafka deep dive ↗ -
How would you reliably publish an event after committing a DB write (the dual-write problem)?
The trap (a dual write) is committing to the DB and then publishing to the broker as two separate steps — a crash in between leaves them inconsistent (event lost, or published but DB rolled back).
The standard fix is the transactional outbox: in the same DB transaction as the business write, insert the event into an
outboxtable. A separate relay (polling or change-data-capture like Debezium) reads the outbox and publishes to the broker, marking rows sent. Because the write and the outbox insert commit atomically, the event is never lost; the relay gives at-least-once publishing, so consumers stay idempotent.Follow-ups they push on- Why not just publish then write, or write then publish?
- How does CDC / Debezium read the outbox?
Red flag Doing a naive dual write (commit DB, then send to Kafka). A failure between the two desynchronizes your DB and your event stream.
source: microservices.io — Transactional Outbox ↗
2.7 Distributed systems & scaling 15
-
Why is an idempotency key essential for a client retry after a timeout, and what subtlety makes timeouts dangerous?
A timeout is ambiguous: when a client's request times out, it cannot tell whether the server never received it, processed it but the response was lost, or is still processing. So a retry might be a true retry or an accidental duplicate of a request that already succeeded.
That's why a non-idempotent operation (charge a card, place an order) needs an idempotency key: the client sends a stable key, the server dedupes on it, and a retry of an already-applied request returns the original result instead of re-applying it. Without the key, the safe-looking retry can double-charge. The subtlety: the failure you can see (timeout) hides whether the side effect happened.
What a strong answer coversA timeout doesn't tell you if the operation succeeded — it's inherently ambiguous.
Retrying a non-idempotent op risks a duplicate side effect.
An idempotency key lets the server dedupe and return the first result.
Idempotent methods (GET/PUT/DELETE) are safe to retry without a key.
Follow-ups they push on- Why can a request that 'timed out' have actually succeeded server-side?
- How does this connect to at-least-once delivery in messaging?
Red flag Treating a timeout as a definite failure and blindly retrying a charge/order. The request may have completed; without an idempotency key you double-apply it.
source: AWS Builders' Library — Making retries safe with idempotent APIs ↗ -
Explain the CAP theorem. Under a partition, what are you actually choosing between?
CAP says that when a network partition happens, a distributed data store must choose between Consistency (every read sees the latest write) and Availability (every request gets a non-error response). You can't have both during a partition; without a partition you get both.
So it's really a choice made *when partitioned*. CP systems (e.g. HBase, MongoDB in its default config) refuse or block to stay consistent; AP systems (e.g. Cassandra, CouchDB) keep serving and reconcile later (eventual consistency). Important caveat: CAP says nothing about latency or scalability — it's strictly about behavior under partition.
Follow-ups they push on- Why is the 'pick 2 of 3' framing misleading?
- What does PACELC add to CAP?
Red flag Saying 'pick 2 of 3' as if you choose freely. Partition tolerance is mandatory in a distributed system; the real choice is C vs A only when a partition occurs.
source: system-design-primer — CAP theorem ↗ -
Compare load-balancing algorithms: round robin, least connections, and consistent hashing. When does each shine?
Round robin sends each request to the next server in rotation — simple and fine when requests are uniform and servers identical, but blind to actual load. Least connections routes to the server with the fewest active connections — better when request durations vary, since it adapts to real load instead of assuming uniformity.
Consistent hashing (hash the client/key to a server) keeps a given key/session on the same server — essential for cache affinity or sticky routing, and it minimizes remapping when servers are added/removed. Round robin for stateless uniform work; least connections for variable work; consistent hashing when affinity/locality matters.
What a strong answer coversRound robin: simple rotation, ignores load; good for uniform requests.
Least connections: adapts to variable request durations.
Consistent hashing: routes a key to a stable server (cache/session affinity).
Weighted variants account for heterogeneous server capacity.
Quick self-checkRequests have highly variable processing times. Which LB algorithm adapts best to real server load?
-
Wrong — it ignores how busy each server actually is.
-
Correct — it routes to the least-busy server, adapting to variable durations.
-
No better than round robin at tracking actual load.
-
Not a real load-aware strategy; concentrates load.
Follow-ups they push on- When does round robin distribute poorly?
- Why does consistent hashing help cache hit rates behind a load balancer?
Red flag Defaulting to round robin when request costs vary wildly — a few expensive requests pile onto one server while others idle. Least connections adapts better.
source: Cloudflare — What is load balancing? ↗ -
What is the difference between latency and throughput, and why can optimizing one hurt the other?
Latency is how long a single operation takes (time per request); throughput is how many operations complete per unit time. They're related but distinct — a system can have high throughput and high latency at once.
They trade off because techniques that raise throughput often add per-request latency: batching many requests amortizes overhead (more throughput) but each request waits for the batch to fill (more latency); deep queues keep workers busy (throughput) but messages wait longer (latency). The discipline is to measure latency as a distribution (p50/p95/p99), not a mean, since tail latency is what users feel, and to choose the tradeoff per workload.
What a strong answer coversLatency = time per operation; throughput = operations per unit time.
Batching/queuing raise throughput but add per-request latency.
Report latency as percentiles (p95/p99), not averages — tails matter.
Little's Law links them: concurrency ≈ throughput × latency.
Quick self-checkWhy is p99 latency usually more informative than mean latency?
-
Wrong — percentiles are harder to compute than a mean.
-
Correct — averages mask the worst-case requests.
-
Wrong — p99 is the 99th percentile, not the max.
-
Wrong — it specifically surfaces the high-latency tail.
Follow-ups they push on- Why report p99 instead of the mean?
- How does batching trade latency for throughput?
Red flag Reporting only average latency. A good mean can hide a terrible p99 that real users hit; and maxing throughput via batching can quietly wreck per-request latency.
source: Hello Interview — Latency vs throughput ↗ -
Horizontal vs vertical scaling — and why does statelessness matter for scaling out?
Vertical scaling means a bigger machine (more CPU/RAM) — simple but bounded by the largest box and a single point of failure. Horizontal scaling means more machines behind a load balancer — effectively unbounded and fault-tolerant, but only if requests can hit any node.
That's why statelessness matters: if a server keeps user session state in local memory, the load balancer must pin a user to one node (sticky sessions), which breaks failover and uneven load. Push state to a shared store (Redis/DB) so any node can serve any request, and horizontal scaling becomes trivial.
Follow-ups they push on- What breaks if you keep sessions in local server memory?
- How do load balancers route — round robin, least connections, hashing?
Red flag Storing session state in process memory and then trying to scale horizontally — you're forced into sticky sessions, which undermine failover and balancing.
source: system-design-primer — Scalability ↗ -
What is eventual consistency, and why do distributed systems accept it?
Eventual consistency means replicas may temporarily disagree, but if writes stop, they converge to the same value given enough time. AP systems accept it as the price of staying available and low-latency under partitions and across regions.
Why accept it: strong consistency requires coordination (consensus, quorums) on every write, which adds latency and reduces availability when nodes can't reach each other. For many features — a like count, a social feed, a shopping cart — a few seconds of staleness is fine, and the availability/latency win is worth it. For money movement you choose strong consistency instead.
Follow-ups they push on- Give a feature where eventual consistency is fine and one where it's not.
- What is read-your-own-writes consistency?
Red flag Using eventual consistency for invariants that must hold immediately (e.g. account balances). Match the consistency level to the business need.
source: AWS — Eventual consistency ↗ -
What roles do an API gateway and service discovery play in a microservices system?
An API gateway is the single entry point for clients: it routes to the right service and handles cross-cutting concerns — auth, rate limiting, TLS termination, request aggregation, and sometimes response shaping — so each service doesn't reimplement them and clients don't need to know the internal topology.
Service discovery lets services find each other's network locations as instances scale up/down and move. A registry (Consul, Eureka, or Kubernetes DNS/Services) maps a logical service name to current healthy instances, so callers resolve a name instead of hardcoding IPs. Together they decouple clients from the shifting set of backend instances.
Follow-ups they push on- Client-side vs server-side discovery — what's the difference?
- What concerns belong in the gateway vs each service?
Red flag Putting business logic in the gateway. It handles routing and cross-cutting concerns; domain logic stays in the services.
source: microservices.io — API gateway & service discovery ↗ -
Design read scaling for a heavily-read database. How do replication and the read-your-writes problem interact?
For read-heavy load, add read replicas: writes go to the primary, which asynchronously replicates to replicas that serve reads, spreading read load and adding redundancy. The catch is replication lag — a replica may be milliseconds-to-seconds behind, so a user who just wrote can read a replica and not see their own change.
Fix the read-your-writes experience by routing a user's reads to the primary for a short window after they write, pinning their session to the primary, tracking a write timestamp/LSN and only reading replicas caught up past it, or using synchronous replication for the critical path (at a latency cost). Layer caching and, if writes also dominate, consider sharding.
Follow-ups they push on- What is replication lag and how do you measure it?
- When would you shard instead of (or in addition to) adding replicas?
Red flag Sending a user's immediate post-write read to an async replica and showing them stale data ('I just saved it — where did it go?'). Route recent writers to the primary or track their write position.
source: system-design-primer — Replication & federation ↗ -
Trick: a service adds aggressive client retries to improve reliability and the whole system gets less reliable under load. What happened?
This is a retry storm / metastable failure. When a dependency slows or briefly fails under load, every client retries — often multiplying traffic 3x or more right when the service is least able to handle it. The added load keeps the service overloaded, so it keeps failing, so clients keep retrying: a self-sustaining feedback loop that doesn't recover even after the original trigger passes.
Fixes: bound retries with a retry budget (cap retries as a fraction of traffic, not per-request), add exponential backoff with jitter, use circuit breakers to fail fast, and only retry idempotent operations. Retries help with isolated transient blips; unbounded retries under systemic load amplify the failure.
What a strong answer coversMass retries multiply load exactly when the service is already struggling.
Creates a self-sustaining (metastable) overload that outlasts the trigger.
Fix: retry budgets, backoff + jitter, circuit breakers, retry only idempotent ops.
Retries help isolated blips, not systemic overload.
Quick self-checkAggressive unconditional retries make a system LESS reliable under load because:
-
Not the core mechanism — the problem is amplified load, not memory.
-
Correct — that's the retry-storm / metastable failure loop.
-
Irrelevant and false; routing doesn't cause the storm.
-
False — retries don't disable load balancing.
Follow-ups they push on- What is a retry budget and why cap retries as a fraction of total traffic?
- How does a circuit breaker break the feedback loop?
Red flag Adding per-request retries everywhere as a blanket reliability boost. Under correlated failure they amplify load into a retry storm — bound them with budgets, backoff, and breakers.
source: AWS Builders' Library — Timeouts, retries, and backoff with jitter ↗ -
What is sharding (horizontal partitioning), and why is choosing a good shard key the hard part?
Sharding splits one logical dataset across multiple databases/nodes by a shard key, so each shard holds a subset and the system scales writes and storage beyond one machine. Reads/writes route to the shard owning the key.
The shard key is the hard part because a bad one creates hotspots — picking a low-cardinality or monotonically-increasing key (like a timestamp) funnels traffic to one shard, defeating the point. You want a key that spreads load evenly *and* keeps commonly-joined data co-located so you avoid expensive cross-shard queries. Cross-shard transactions and re-sharding as you grow are the recurring pains, which is why teams delay sharding until replicas and caching are exhausted.
What a strong answer coversSharding = horizontal partitioning by a shard key across nodes.
Scales writes/storage past a single machine.
Bad shard key → hotspots (monotonic or low-cardinality keys are traps).
Cross-shard joins/transactions and re-sharding are the ongoing costs.
Follow-ups they push on- Why does a timestamp or auto-increment id make a poor shard key?
- How does consistent hashing reduce re-sharding pain?
Red flag Sharding on a monotonically increasing key (timestamp/sequence id) so all new writes hit the newest shard — a hotspot that recreates the single-node bottleneck.
source: MongoDB — Sharding and shard keys ↗ -
Monolith vs microservices — what are the real tradeoffs, and why not default to microservices?
A monolith is one deployable: simpler local dev, easy refactors across boundaries, in-process calls, one transaction — at the cost of coupled deploys and scaling the whole app together. Microservices give independent deploys, team autonomy, and targeted scaling — but you pay with distributed-systems tax: network failures, eventual consistency, distributed transactions/sagas, harder debugging and tracing, and heavy ops.
The seasoned answer: don't reach for microservices by default. Most teams should start with a well-modularized monolith and split out services only when a clear scaling, team-ownership, or deploy-cadence boundary justifies the added operational cost.
Follow-ups they push on- What forces a split — scaling, team size, or deploy cadence?
- What is a 'distributed monolith' and why is it the worst of both?
Red flag Starting greenfield with microservices for resume-driven reasons, inheriting distributed-systems complexity before you have the scale or teams to need it.
source: Martin Fowler — Monolith First ↗ -
Why retry failed calls with exponential backoff AND jitter? What goes wrong without jitter?
Retries handle transient failures, but naive retries cause two problems. Exponential backoff (wait 1s, 2s, 4s…) stops a struggling service from being hammered every few milliseconds while it tries to recover.
Jitter (randomizing each wait) prevents a thundering herd: if many clients fail at the same instant and all back off by the exact same schedule, they retry in synchronized waves that keep knocking the service over. Adding randomness spreads the retries out. Pair this with retry budgets/circuit breakers and only retry idempotent or idempotency-keyed operations.
Follow-ups they push on- Why only retry idempotent operations?
- What does a circuit breaker add on top of backoff?
Red flag Backoff without jitter — synchronized clients retry in lockstep, creating a self-reinforcing herd that prevents recovery.
source: AWS Builders' Library — Timeouts, retries, and backoff with jitter ↗ -
What is a circuit breaker and how does it protect a distributed system?
A circuit breaker wraps calls to a dependency and tracks failures. In closed state calls pass through; once failures cross a threshold it opens and fails fast (returns an error or fallback immediately) instead of piling up calls on a sick service. After a cooldown it goes half-open, lets a trial request through, and closes again if it succeeds.
It prevents cascading failures: without it, a slow dependency exhausts the caller's threads/connections waiting on timeouts, which then takes the caller down, propagating upstream. Failing fast contains the blast radius and lets the dependency recover.
Follow-ups they push on- Walk through closed → open → half-open transitions.
- How does this complement timeouts and bulkheads?
Red flag Relying on retries alone with no breaker — retries against a failing dependency amplify load and accelerate the cascade.
source: Martin Fowler — CircuitBreaker ↗ -
What is consistent hashing and why do distributed caches and databases use it?
With plain
hash(key) % N, changing the number of nodes N remaps almost every key — catastrophic for a cache (mass misses) or a sharded DB (mass data movement). Consistent hashing maps both keys and nodes onto a ring; a key belongs to the next node clockwise. Adding or removing a node only relocates the keys in that node's arc — about 1/N of keys — instead of nearly all of them.Virtual nodes (each physical node placed at many ring positions) smooth out uneven distribution. This is why Cassandra, DynamoDB, and Memcached-style caches use it.
Follow-ups they push on- What do virtual nodes solve?
- How much data moves when you add the (N+1)th node?
Red flag Using modulo hashing for a sharded cluster, so adding one node reshuffles nearly all keys and stampedes the backing store.
source: system-design-primer — Consistent hashing ↗ -
Why are leader election and quorum used in distributed coordination?
Many tasks need exactly one node in charge (assigning work, ordering writes) to avoid conflicts — so the cluster elects a leader via a consensus protocol (Raft, Paxos, ZooKeeper/ZAB). If the leader dies, a new one is elected.
To agree despite failures, decisions use a quorum — a majority (
N/2 + 1). Requiring a majority for writes and reads guarantees any two quorums overlap, so the system never commits two conflicting decisions and can tolerate a minority of nodes failing. This is the backbone of consistent distributed stores and coordination services.Follow-ups they push on- Why a majority specifically? (overlapping quorums prevent split decisions)
- What is split-brain and how does quorum prevent it?
Red flag Allowing writes without a majority quorum, enabling split-brain where two partitions both think they have a leader and diverge.
source: The Raft Consensus Algorithm ↗
2.8 Caching 14
-
On a write, should you update the cache in place or delete (invalidate) the key? Why is delete usually safer?
Prefer delete (invalidate) over update-in-place. Updating the cache directly on write opens a race: two concurrent writers can set the cache in the opposite order from how they hit the DB, leaving the cache holding the older value permanently. Deleting the key sidesteps that — the next read just repopulates from the source of truth.
Delete is also cheaper (you don't recompute a value that may never be read) and avoids caching an intermediate state. The cost is one guaranteed cache miss after each write. For very hot keys you can refresh asynchronously, but the default rule is invalidate, don't update.
What a strong answer coversUpdate-in-place risks concurrent writers leaving a stale value forever.
Delete forces the next read to repopulate from the source of truth.
Delete avoids recomputing values that may never be read.
Cost: one cache miss after each write.
Quick self-checkOn a database write, the more robust cache strategy is usually to:
-
Risky — concurrent writers can reorder and leave a stale value permanently.
-
Correct — avoids the reorder race and stale-value pinning.
-
Wrong — that prolongs staleness, not fixes it.
-
Wrong — relying solely on TTL serves stale data until expiry.
Follow-ups they push on- Walk through the concurrent-writer race that update-in-place causes.
- When might you refresh the cache asynchronously instead of deleting?
Red flag Writing the new value straight into the cache on every update. Concurrent writes can reorder and pin a stale value; deleting the key is the robust default.
source: AWS Builders' Library — Caching challenges and strategies ↗ -
Where can caches live across a request's path, and what does each layer cache?
Caching exists at many layers, each closer to the user is cheaper: browser cache (per-user, static assets via Cache-Control/ETag); CDN / edge (shared, geo-distributed static and cacheable responses); application / in-process (a local in-memory map — fastest but per-instance and not shared); distributed cache (Redis/Memcached — shared across app servers); and the database query/buffer cache.
The instinct: cache as close to the user as the data's freshness allows. Each layer trades reach (shared vs per-instance) against latency and invalidation difficulty.
Follow-ups they push on- Trade-off of in-process vs distributed cache?
- Why is CDN caching great for static but tricky for personalized content?
Red flag Caching personalized/private data in a shared CDN or proxy layer, leaking one user's data to another. Mark it private/no-store.
source: system-design-primer — Caching ↗ -
Compare cache-aside, write-through, and write-back. When do you use each?
Cache-aside (lazy loading): app checks cache; on a miss it reads the DB, populates the cache, and returns. Most common for read-heavy workloads; only requested data is cached, but the first hit is a miss and stale data is possible without invalidation.
Write-through: writes go to cache and DB together, so the cache is always consistent — at the cost of higher write latency and caching data that may never be read. Write-back (write-behind): write to cache immediately and flush to the DB asynchronously — fast writes and great for write-heavy bursts, but a cache crash before flush loses data. Pick cache-aside by default; write-through when reads must never be stale; write-back when write latency dominates and you can tolerate the durability risk.
Follow-ups they push on- Which strategy risks data loss and why?
- How do you keep cache-aside from serving stale data?
Red flag Using write-back for data you can't afford to lose. An async-flush cache that dies before flushing loses the unwritten writes.
source: AWS Builders' Library — Caching challenges and strategies ↗ -
What makes a good cache key, and why is cache hit ratio the metric that matters most?
A good cache key is deterministic (same logical request → same key), specific enough to avoid collisions (include the parameters that change the result — user, locale, version), and normalized (sort query params, lowercase where appropriate) so equivalent requests share a key. Over-specific keys (including irrelevant params like a request id) fragment the cache and tank the hit rate; under-specific keys serve the wrong data.
Hit ratio (hits / total lookups) is the headline metric because a cache only pays off when most reads avoid the backend. A low hit ratio means you're spending memory and adding a layer for little benefit — investigate whether keys are too granular, TTLs too short, or the working set exceeds cache capacity.
What a strong answer coversKeys must be deterministic, normalized, and include exactly the result-affecting params.
Over-specific keys fragment the cache; under-specific keys serve wrong data.
Hit ratio = hits / lookups — the core measure of cache value.
Low hit ratio → keys too granular, TTL too short, or working set > capacity.
Follow-ups they push on- How can including a request id or timestamp in the key destroy the hit ratio?
- What does a sudden hit-ratio drop usually indicate?
Red flag Baking a unique/volatile value (request id, current timestamp) into the cache key, so every request misses — you've added overhead with a near-zero hit ratio.
source: AWS — Caching best practices ↗ -
Debugging: after deploying a new code version, users report seeing old data that won't refresh. Where would you look in the caching layers?
Stale data after a deploy almost always means a cache somewhere is serving the old version. Walk the layers from the client inward: the browser cache (
Cache-Control/Expireson the asset — a missing hash/versioned filename means the browser reuses the old bundle), the CDN/edge (needs a purge/invalidation for changed assets), the application/distributed cache (Redis/Memcached entries not invalidated by the deploy), and finally the DB query cache.Debug tooling: inspect response headers for
Age,X-Cache: HIT, andCache-Control; force-reload to bypass the browser; check whether the CDN was purged. The durable fix for static assets is cache-busting — content-hashed filenames so a new version is a new URL and old caches simply don't apply.What a strong answer coversCheck layers client→server: browser → CDN/edge → app/distributed cache → DB.
Inspect
Age,X-Cache, andCache-Controlto locate the serving cache.Static assets need content-hashed filenames (cache-busting) per deploy.
A CDN may need an explicit purge/invalidation after deploy.
Quick self-checkUsers keep getting the old JS bundle after a deploy. The most reliable fix is to:
-
Wrong — unscalable and doesn't fix CDN/proxy caches.
-
Correct — old caches simply don't match the new URL.
-
Partial at best — still serves stale until expiry and wastes revalidation.
-
Throws away all caching benefits; unnecessary with cache-busting.
Follow-ups they push on- How does a content hash in the filename make cache invalidation automatic?
- What does the Age header tell you about which cache served the response?
Red flag Serving versioned JS/CSS under a stable filename with a long max-age, so browsers and CDNs keep the old bundle after deploy. Hash the filename so a new build is a new URL.
source: MDN — HTTP caching (cache busting) ↗ -
What's the difference between read-through and cache-aside caching? Who is responsible for the database read in each?
Both are lazy-loading read strategies, but they differ in *who* loads on a miss. In cache-aside (lazy loading) the application owns the logic: it checks the cache, and on a miss it reads the DB and writes the result back to the cache itself. The cache is just a dumb store; the app is the orchestrator.
In read-through, the cache sits inline and loads from the DB on a miss transparently — the application only ever talks to the cache, and a provider/loader function populates it. Read-through centralizes the load logic (less duplicated code, consistent behavior) but needs cache support for it; cache-aside is more flexible and the most common pattern. Both still need a write strategy and TTLs to manage staleness.
What a strong answer coversCache-aside: the application reads the DB on a miss and populates the cache.
Read-through: the cache loads from the DB on a miss transparently.
Read-through centralizes load logic; cache-aside is more flexible/common.
Both are lazy (load on miss) and still need write/TTL strategies.
Quick self-checkIn cache-aside, who reads the database on a cache miss?
-
Wrong — that describes read-through, not cache-aside.
-
Correct — the app orchestrates the miss in cache-aside.
-
Wrong — databases don't push into caches in this pattern.
-
Wrong — a miss triggers a load, just by the application here.
Follow-ups they push on- Why is cache-aside more common despite read-through's cleaner app code?
- How does write-through pair with read-through?
Red flag Conflating the two and assuming the cache auto-loads in cache-aside. In cache-aside the application must explicitly read the DB and repopulate on every miss.
source: AWS — Database caching strategies (lazy loading vs read-through) ↗ -
LRU vs LFU vs FIFO eviction, plus TTL — how do you choose?
When a cache is full it evicts by policy. LRU drops the least-recently-used entry — the default; great when recent access predicts future access (temporal locality). LFU drops the least-frequently-used — better when some items are persistently hot regardless of recency, but it can keep stale 'once-popular' items. FIFO evicts the oldest inserted regardless of use — simple but ignores access patterns.
TTL is orthogonal: it bounds staleness by expiring entries after a time, independent of capacity pressure. Typical setup: LRU for capacity eviction plus a TTL for freshness.
Follow-ups they push on- When does LFU beat LRU?
- How does TTL interact with an eviction policy?
Red flag Treating TTL as an eviction policy. TTL bounds staleness over time; LRU/LFU/FIFO decide what to drop under memory pressure — they solve different problems.
source: GeeksforGeeks — Cache eviction policies ↗ -
Implement an LRU cache with O(1) get and put. What data structures do you use?
Combine a hash map (key → node) with a doubly linked list ordered by recency. The map gives O(1) lookup; the linked list gives O(1) move-to-front and O(1) eviction at the tail.
On
get: look up the node in the map, unlink it, move it to the head (most recent), return its value. Onput: if present, update and move to head; if new, insert at head and add to the map; if over capacity, remove the tail node and delete its key from the map. Both operations are O(1) because every step is a constant-time pointer/map update.Map (key -> node) + DLL: head=newest ... tail=evictFollow-ups they push on- Why a doubly (not singly) linked list?
- How would you make it thread-safe?
- In an interview, can you use a language built-in like LinkedHashMap?
Red flag Using an array or scanning the list to find the LRU item — that's O(n). The hash map + DLL pairing is what keeps both operations O(1).
source: LeetCode — LRU Cache (146) ↗ -
Redis vs Memcached — when would you pick each?
Memcached is a simple, multithreaded, in-memory key→blob cache — extremely fast and easy to scale for pure caching of opaque values. Redis is a richer in-memory data store: it has data structures (lists, sets, sorted sets, hashes, streams), optional persistence, replication, pub/sub, Lua scripting, and clustering.
Pick Memcached when you just need a fast, large, simple cache and want multithreaded throughput per node. Pick Redis when you need those data structures, durability, atomic operations, pub/sub, rate-limiter counters, leaderboards, or built-in replication/clustering — which is most modern use cases.
Follow-ups they push on- When does Memcached's multithreading actually win?
- What Redis features make it more than a cache?
Red flag Saying 'Redis is just a faster Memcached'. The real difference is Redis's data structures, persistence, and clustering, not raw speed.
source: AWS — Redis vs Memcached ↗ -
Why does Redis need a persistence and eviction policy, and what's the difference between RDB and AOF?
Redis holds data in memory, so two policies matter. Eviction (
maxmemory-policy) decides what happens when memory fills —noeviction(reject writes),allkeys-lru,volatile-ttl, etc. Pick LRU/LFU variants when using Redis as a cache;noevictionwhen it's a primary store you can't silently drop from.Persistence decides what survives a restart. RDB takes periodic point-in-time snapshots — compact, fast to load, but you lose writes since the last snapshot. AOF (append-only file) logs every write operation — far better durability (down to per-write fsync) at the cost of larger files and slower restart. Many run both: AOF for durability, RDB for fast restores. Treating Redis purely as a cache means you may not need persistence at all.
What a strong answer coversEviction policy governs behavior at
maxmemory; choose LRU/LFU for cache use.RDB = periodic snapshots: compact and fast to load, but loses recent writes.
AOF = append-only write log: stronger durability, bigger/slower.
Often run both; a pure cache may skip persistence entirely.
Follow-ups they push on- When would you choose noeviction over allkeys-lru?
- What's the durability/performance tradeoff of AOF fsync-everysec vs always?
Red flag Running Redis as a primary datastore with `noeviction` unset and no persistence, then losing data on a restart or silently dropping writes at maxmemory.
source: Redis — Persistence ↗ -
When is adding a cache the WRONG move? Name cases where caching hurts more than it helps.
Caching is not free — it adds a consistency problem and an extra failure mode. It's the wrong move when the data changes more often than it's read (you invalidate constantly, getting a near-zero hit ratio while paying the cost), when staleness is unacceptable (account balances, inventory at checkout, anything where a wrong value causes real harm), when the working set is far larger than memory so you thrash with evictions, or when the backend is already fast enough that the cache only adds complexity and a coherence bug surface.
The instinct: reach for caching when reads dominate writes and a little staleness is tolerable; otherwise the extra layer buys complexity, not speed.
What a strong answer coversWrite-heavy / frequently-changing data → constant invalidation, low hit ratio.
Strict-correctness data (balances, inventory) → staleness causes real harm.
Working set >> cache memory → thrashing evictions, little benefit.
Already-fast backend → cache adds complexity and a new failure/coherence surface.
Quick self-checkFor which workload is adding a cache LEAST likely to help?
-
Wrong — this is the ideal case for caching.
-
Correct — constant invalidation yields a near-zero hit ratio and added cost.
-
Wrong — caching a hot read key is highly effective.
-
Wrong — caching memoizes that cost; a clear win.
Follow-ups they push on- Why does a write-heavy workload defeat most caching strategies?
- How do you decide the read:write ratio threshold where caching pays off?
Red flag Adding a cache reflexively 'for performance' on write-heavy or correctness-critical data. You inherit invalidation bugs and a new failure mode for little or negative gain.
source: AWS Builders' Library — Caching challenges and strategies ↗ -
What is a cache stampede (thundering herd) and how do you prevent it?
A cache stampede happens when a hot key expires and many concurrent requests all miss simultaneously, then all hit the database at once to recompute the same value — a spike that can overload the backend.
Mitigations: request coalescing / locking so only one request recomputes while others wait for or briefly serve the old value; early/probabilistic expiration so one request refreshes the key slightly before it expires; stale-while-revalidate, serving the old value while refreshing in the background; and jittering TTLs so many keys don't expire at the same instant.
Follow-ups they push on- How does a per-key lock prevent the dogpile?
- What is probabilistic early expiration?
Red flag Giving many hot keys the same fixed TTL, so they expire together and trigger a synchronized backend stampede. Add jitter and coalesce recomputation.
source: AWS Builders' Library — Caching challenges and strategies ↗ -
Why is cache invalidation hard, and what are the failure modes (stale reads)?
There are really two hard problems: deciding *when* a cached value is no longer valid, and making the cache and source of truth agree across concurrent updates. With cache-aside, a classic race: reader gets a miss and starts loading the old value; a writer updates the DB and deletes the cache key; the reader then writes its stale value back — now the cache is wrong indefinitely.
Approaches: delete (don't update) the key on write so the next read repopulates fresh; use short TTLs to bound staleness; version keys; or for tighter consistency use write-through. There's no free lunch — you trade consistency, latency, and complexity.
Follow-ups they push on- Why delete the key on write instead of updating it?
- How does a short TTL bound the damage?
Red flag Updating the cache in place on writes (instead of deleting) and ignoring the read-load/write-delete interleaving — you cache a stale value that never self-heals.
source: AWS Builders' Library — Caching challenges and strategies ↗ -
What are cache penetration and cache avalanche, and how do they differ from a stampede?
Cache penetration: requests for keys that don't exist anywhere always miss the cache and hit the DB — common in scraping/attacks. Fix by caching the negative result (a short-TTL 'null' marker) or screening with a Bloom filter of valid keys.
Cache avalanche: a large set of keys expire at once (or the cache itself goes down), so traffic floods the DB en masse. Fix by jittering TTLs, layering caches, and adding rate limiting / circuit breakers in front of the DB. Versus a stampede, which is many concurrent requests for *one* expiring hot key — avalanche is many keys at once, penetration is keys that never exist.
Follow-ups they push on- How does a Bloom filter stop penetration cheaply?
- Why does jittering TTLs help avalanche?
Red flag Not caching negative lookups, so a flood of requests for nonexistent keys bypasses the cache entirely and hammers the database.
source: GeeksforGeeks — Cache penetration, avalanche, stampede ↗
03 Databases 98 Q's
3.1 Relational model & SQL basics 14
-
What's the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?
COUNT(*)counts rows, including rows where every column is NULL.COUNT(col)counts rows wherecolis not NULL — NULLs are skipped.COUNT(DISTINCT col)counts the number of distinct non-NULL values.So on a column with NULLs,
COUNT(*)>=COUNT(col)>=COUNT(DISTINCT col). This trips people up in 'how many customers placed an order' style questions, where a LEFT JOIN leaves NULLs andCOUNT(*)over-counts.What a strong answer coversCOUNT(*)counts every row regardless of NULLs.COUNT(col)ignores rows wherecol IS NULL.COUNT(DISTINCT col)ignores NULLs and collapses duplicates.After a LEFT JOIN, count a non-null right-side column (not
*) to avoid counting unmatched rows.
Quick self-checkA `votes(id, choice)` column has 5 rows; `choice` is NULL in 2 of them, and the 3 non-null values are 'A','A','B'. What are COUNT(*), COUNT(choice), COUNT(DISTINCT choice)?
-
No — COUNT(choice) skips the 2 NULLs, so it's 3, not 5.
-
Correct — 5 rows total; 3 non-null values; 2 distinct non-null values (A, B).
-
No — COUNT(*) counts all 5 rows, and there are only 2 distinct values.
-
No — only two distinct non-null values exist ('A' and 'B'), so COUNT(DISTINCT) is 2.
Follow-ups they push on- After a LEFT JOIN, why does COUNT(*) over-report and COUNT(right_col) fix it?
- Is COUNT(1) any different from COUNT(*)? (no — same thing)
Red flag Assuming COUNT(col) counts all rows like COUNT(*), or using COUNT(*) after a LEFT JOIN and counting the NULL-filled unmatched rows.
source: PostgreSQL docs — Aggregate Functions ↗ -
What is the difference between a PRIMARY KEY, a UNIQUE constraint, and a FOREIGN KEY?
A primary key uniquely identifies a row: it is
UNIQUEandNOT NULL, and there is exactly one per table.A UNIQUE constraint also forbids duplicates but *does* allow a
NULL(one, in most engines), and a table can have many of them.A foreign key is a column whose values must exist as a key in another table — it enforces referential integrity (you cannot insert an order for a customer that does not exist, and the DB can block/cascade deletes).
Follow-ups they push on- What is a composite key?
- Can a foreign key reference a UNIQUE column instead of a primary key?
Red flag Saying a primary key is 'just a unique column' and forgetting the implicit NOT NULL, or claiming a table can have several primary keys (it has one, possibly composite).
source: DataLemur — Amazon SQL Interview Questions ↗ -
What is the difference between CHAR, VARCHAR, and TEXT, and when does the choice matter?
CHAR(n)is fixed-length — it pads with spaces ton, so it suits truly fixed codes (a 2-char country code, a fixed hash).VARCHAR(n)is variable-length with a declared max, erroring if you exceed it.TEXTis variable-length with no practical limit.In PostgreSQL there is no performance difference between them — the manual recommends
textorvarcharand noteschar(n)is usually the *slowest* due to padding. The length limit is mainly a data-integrity constraint, not an optimization. (In some other engines, like older MySQL row formats, fixed vs variable length had storage implications.)What a strong answer coversCHAR(n): fixed length, space-padded — only for genuinely fixed-width values.VARCHAR(n): variable length with an enforced maximum.TEXT: variable length, effectively unlimited.In Postgres these perform the same; a length cap is a constraint, not a speed win.
Follow-ups they push on- Does a VARCHAR(255) store faster than VARCHAR(1000) in Postgres? (no)
- When is CHAR(n) actually the right choice?
Red flag Believing a smaller VARCHAR(n) is faster or saves space in Postgres, or using CHAR for general text and getting surprised by trailing-space padding.
source: PostgreSQL docs — Character Types ↗ -
How do you classify employees into salary bands ('low'/'mid'/'high') in a single SELECT?
Use a
CASEexpression, which is SQL's inline if/else:SELECT name, salary, CASE WHEN salary < 50000 THEN 'low' WHEN salary < 100000 THEN 'mid' ELSE 'high' END AS band FROM employee;The searched
CASEevaluatesWHENbranches top-to-bottom and returns the first match, so order the boundaries carefully. With noELSE, unmatched rows getNULL. You can also wrapCASEinside an aggregate (SUM(CASE WHEN … THEN 1 ELSE 0 END)) for conditional counts — the classic pivot trick.What a strong answer coversCASE WHEN … THEN … [WHEN …] ELSE … ENDreturns the first matching branch.Branches are evaluated in order — overlapping conditions resolve to the first true one.
Omitting
ELSEyieldsNULLfor unmatched rows.SUM(CASE WHEN cond THEN 1 ELSE 0 END)does conditional counting / pivoting.
Follow-ups they push on- Rewrite a conditional COUNT using SUM(CASE WHEN …).
- What's the difference between a simple CASE and a searched CASE?
Red flag Ordering CASE branches so a broad condition shadows a narrower one, or forgetting that without ELSE the result is NULL, not 0.
source: PostgreSQL docs — Conditional Expressions (CASE) ↗ -
Write a query to return all employees in the Engineering department earning more than 100000, sorted by salary descending.
Straight
SELECT … WHERE … ORDER BY:SELECT name, salary FROM employees WHERE department = 'Engineering' AND salary > 100000 ORDER BY salary DESC;Watch the clause order —
WHEREfilters rows,ORDER BYruns last. String literals are single-quoted; double quotes mean an identifier in standard SQL.Follow-ups they push on- Add a tie-breaker so equal salaries sort by name.
- Return only the top 5 — LIMIT vs TOP vs FETCH FIRST?
Red flag Using double quotes around the string literal (an identifier in standard SQL/Postgres), or putting ORDER BY before WHERE.
source: PG Exercises — Basic ↗ -
What is the difference between WHERE and HAVING, and why can't you put an aggregate in WHERE?
WHEREfilters individual rows before grouping;HAVINGfilters groups after theGROUP BYruns.An aggregate like
COUNT(*)is not known until rows are grouped, so it cannot appear inWHERE— it belongs inHAVING. Example:SELECT dept, COUNT(*) FROM emp WHERE active = true GROUP BY dept HAVING COUNT(*) > 5;—activeis filtered per-row, the head-count per-group.Follow-ups they push on- Logical order of evaluation of FROM/WHERE/GROUP BY/HAVING/SELECT/ORDER BY?
- Can you reference a SELECT alias in HAVING?
Red flag Putting `WHERE COUNT(*) > 5`, or believing HAVING is just 'WHERE for the GROUP BY query' with no semantic difference.
source: PostgreSQL docs — GROUP BY and HAVING ↗ -
What does NULL mean in SQL, and why does `WHERE col = NULL` return nothing?
NULLis 'unknown', not a value. Any comparison withNULLusing=/<>yieldsUNKNOWN(not true), so the row is dropped —WHERE col = NULLalways returns zero rows.Use the dedicated operators:
WHERE col IS NULL/IS NOT NULL. Note aggregates skip NULLs (AVG,COUNT(col)) butCOUNT(*)counts the row regardless.Follow-ups they push on- What does `NULL = NULL` evaluate to?
- How does NULL behave inside NOT IN (subquery) — and why is that a trap?
Red flag Treating NULL as a value you can equality-test, or assuming `NOT IN` works when the subquery can yield a NULL (it then returns no rows).
source: PostgreSQL docs — Comparison Functions and Operators ↗ -
Find all duplicate email addresses in a Person table (emails appearing more than once).
Group by the column and keep groups of size > 1:
SELECT email FROM person GROUP BY email HAVING COUNT(*) > 1;This is the canonical 'GROUP BY + HAVING COUNT' pattern. To actually delete dupes you would keep
MIN(id)per group and remove the rest.Follow-ups they push on- Now delete the duplicates, keeping the row with the smallest id.
- Could a self-join solve this too? Compare it to GROUP BY.
Red flag Using `WHERE COUNT(*) > 1`, or `SELECT DISTINCT` (which hides duplicates rather than finding them).
source: LeetCode 196 — Duplicate Emails ↗ -
What is the difference between DELETE, TRUNCATE, and DROP?
DELETEremoves rows one at a time, can have aWHERE, fires triggers, is fully transactional and rollback-able.TRUNCATEempties the whole table in one fast metadata operation — no per-row WHERE, usually resets identity counters, minimal logging.DROPremoves the table definition itself (and its data) from the schema.Mnemonic: DELETE = some/all rows, TRUNCATE = all rows fast, DROP = the table is gone.
Follow-ups they push on- Is TRUNCATE transactional in Postgres? (Yes.) In other engines?
- Which of these can you roll back?
Red flag Claiming TRUNCATE can take a WHERE clause, or that DELETE and TRUNCATE are interchangeable (triggers, identity reset, and speed differ).
source: PostgreSQL docs — TRUNCATE ↗ -
What is the difference between UNION and UNION ALL, and which is faster?
UNIONconcatenates two result sets and removes duplicates (an implicit DISTINCT, which costs a sort/hash).UNION ALLkeeps every row, including duplicates.UNION ALLis faster because it skips the dedup step — prefer it whenever you know the inputs are already disjoint or duplicates are acceptable. Both require the same column count and compatible types in each branch.Follow-ups they push on- When is UNION (with dedup) actually required?
- Difference between UNION and a FULL OUTER JOIN?
Red flag Defaulting to UNION everywhere and paying for a needless dedup, or assuming the column lists must have identical names (only count/type must match).
source: StrataScratch — Meta SQL Interview Questions ↗ -
Find the employee(s) with the highest salary in each department.
The robust way is a window rank so ties are kept:
WITH r AS (SELECT name, department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rk FROM employee) SELECT name, department, salary FROM r WHERE rk = 1;A correlated-subquery form also works:
WHERE salary = (SELECT MAX(salary) FROM employee e2 WHERE e2.department = e1.department). UseRANK(notROW_NUMBER) so two employees tied at the top of a department both appear.What a strong answer coversPartition by department, order by salary DESC, keep the top rank.
RANK() = 1keeps all ties;ROW_NUMBER() = 1arbitrarily keeps just one.Equivalent correlated subquery: compare each salary to that department's
MAX.A global
ORDER BY salary DESC LIMIT 1is wrong — it returns one row overall, not one per department.
Follow-ups they push on- Why RANK rather than ROW_NUMBER if ties should all be returned?
- Rewrite it without a window function using a correlated subquery.
Red flag Using `ORDER BY salary DESC LIMIT 1` (top earner overall, not per department) or ROW_NUMBER, which silently drops tied top earners.
source: LeetCode 184 — Department Highest Salary ↗ -
What does `WHERE status NOT IN ('shipped', 'delivered')` do to rows where status is NULL, and why?
It excludes them — a row with
status = NULLis *not* returned, even though NULL is obviously 'not shipped and not delivered' to a human.NOT IN (...)expands tostatus <> 'shipped' AND status <> 'delivered'. ComparingNULL <> anythingyieldsUNKNOWN, so the wholeANDisUNKNOWN, andWHEREkeeps only rows that areTRUE. To include NULLs you must say so explicitly:WHERE status NOT IN (...) OR status IS NULL.What a strong answer coversNOT INis sugar for a chain of<>comparisons joined byAND.Any comparison with NULL is
UNKNOWN, andWHEREkeeps onlyTRUErows — so NULL rows drop.The far more dangerous case: a NULL inside the list makes
NOT INreturn *no rows at all*.Add
OR col IS NULLto include NULLs, or preferNOT EXISTSwhich is NULL-safe.
Quick self-check`orders` has statuses 'shipped', 'pending', and NULL. `SELECT * FROM orders WHERE status NOT IN ('shipped');` returns…
-
Correct — `NULL <> 'shipped'` is UNKNOWN, so NULL rows fail the WHERE and only 'pending' survives.
-
No — NULL doesn't satisfy the comparison; it evaluates to UNKNOWN, not TRUE.
-
No — that catastrophe happens when a NULL is *inside the IN list*, not when a row's column is NULL.
-
No — the NULL rows are also dropped, so it isn't simply 'everything but shipped'.
Follow-ups they push on- What happens if the value list itself contains a NULL (e.g. from a subquery)?
- How does NOT EXISTS avoid this NULL trap?
Red flag Expecting NULL rows to satisfy a `NOT IN`/`<>` filter, or using `NOT IN (subquery)` where the subquery can yield NULL and silently returning zero rows.
source: PostgreSQL docs — Row and Array Comparisons (IN / NOT IN and NULL) ↗ -
Write a SQL query to get the average star rating for each product for each month.
Extract the month from the timestamp, group by it and the product, average the stars:
SELECT EXTRACT(MONTH FROM submit_date) AS mth, product_id, ROUND(AVG(stars), 2) AS avg_stars FROM reviews GROUP BY mth, product_id ORDER BY mth, product_id;Every non-aggregated SELECT column must appear in GROUP BY. ROUND tidies the output.
Follow-ups they push on- Group by year-and-month so January 2023 and January 2024 don't collapse together.
- How would NULL stars affect AVG?
Red flag Selecting a column that is neither grouped nor aggregated (errors in Postgres, silently picks an arbitrary row in old MySQL), or grouping by month only so different years merge.
source: DataLemur — Amazon 'Average Review Ratings' ↗ -
What is the logical order of evaluation of a SELECT's clauses, and why does it explain why you can't use a SELECT alias in WHERE?
Although you *write*
SELECT … FROM … WHERE … GROUP BY … HAVING … ORDER BY, the engine *logically* evaluates them as:FROM/joins ->WHERE->GROUP BY->HAVING->SELECT(where aliases are assigned) ->DISTINCT->ORDER BY->LIMIT.Because
SELECTruns afterWHERE, an alias defined inSELECTdoesn't yet exist whenWHEREis evaluated — soWHERE total > 100referencing aSELECT … AS totalerrors.ORDER BYruns last, which is why it *can* see SELECT aliases. (MySQL leniently allows aliases in some clauses as an extension, but the standard doesn't.)What a strong answer coversLogical order: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT.
Aliases are created in the
SELECTstep, soWHERE/GROUP BY/HAVINGgenerally can't see them.ORDER BYis the one clause that *can* reference SELECT aliases — it runs after SELECT.Workaround: repeat the expression, or wrap the query in a subquery/CTE and filter on the alias outside.
Quick self-checkGiven `SELECT price * qty AS total FROM line_items WHERE total > 100;`, what happens?
-
No — WHERE runs before SELECT assigns the `total` alias, so the alias isn't visible there.
-
Correct — the alias is created in the SELECT step, which is evaluated after WHERE.
-
No — the database raises an error rather than silently ignoring an unknown column.
-
No — Postgres errors; MySQL has extensions that allow aliases in some clauses, so behavior differs.
Follow-ups they push on- Why can ORDER BY use a SELECT alias but WHERE cannot?
- Does MySQL deviate from this? (it allows aliases in GROUP BY/HAVING as an extension)
Red flag Assuming the written clause order is the execution order, then being confused why a SELECT alias is 'not recognized' in WHERE or GROUP BY.
source: PostgreSQL docs — SELECT (clause processing order) ↗
3.2 JOINs 16
-
Walk through INNER, LEFT, RIGHT, FULL OUTER, and CROSS JOIN — what rows does each keep?
All join two tables on a predicate; they differ in which unmatched rows survive:
- INNER — only rows that match on both sides.
- LEFT (OUTER) — all left rows; right columns areNULLwhere no match.
- RIGHT (OUTER) — all right rows; left columnsNULLwhere no match (a LEFT JOIN with the tables swapped).
- FULL OUTER — all rows from both sides;NULLon whichever side is missing.
- CROSS — every left row paired with every right row (Cartesian product, noON).Mental model: INNER is the intersection, LEFT/RIGHT keep one side whole, FULL keeps the union, CROSS multiplies.
What a strong answer coversINNER = matches only; the unmatched rows on both sides are dropped.
LEFT keeps all left rows; RIGHT keeps all right rows (mirror images).
FULL OUTER keeps unmatched rows from both sides, padding the missing side with NULL.
CROSS produces n*m rows with no join condition.
RIGHT JOIN is rarely written by hand — people flip the tables and use LEFT for readability.
Quick self-check`customers` has 10 rows; `orders` has 4 rows, all belonging to just 2 of those customers (one order each... wait, 4 orders across 2 customers). How many rows does `customers LEFT JOIN orders ON customers.id = orders.cust_id` return?
-
No — the 2 customers with orders contribute one row per order, so they expand beyond a single row each.
-
Correct — 8 customers with no order give 8 rows (NULL order), plus 4 order rows for the matched customers = 12.
-
That's what an INNER JOIN returns; LEFT JOIN also keeps the 8 customers with no orders.
-
No — only the 8 unmatched customers add NULL rows; total is 8 + 4 = 12, not 14.
Follow-ups they push on- Which join would you use to find rows present in A but missing in B?
- How do you reproduce a FULL OUTER JOIN in MySQL, which lacks it?
Red flag Describing LEFT/RIGHT as 'returns more rows' rather than 'preserves the unmatched rows of one side', or thinking CROSS JOIN needs an ON clause.
source: PostgreSQL docs — Joined Tables (join types) ↗ -
What's the difference between an INNER JOIN and a LEFT JOIN, and what's the classic LEFT JOIN bug?
INNER JOINkeeps only rows that match in both tables;LEFT JOINkeeps all left rows, fillingNULLwhere the right side has no match.The bug: a
WHEREpredicate on a *right-table* column silently turns a LEFT JOIN into an INNER JOIN, becauseNULLfails the filter and those unmatched rows vanish. Fix by moving the condition into theONclause:LEFT JOIN orders o ON o.cust_id = c.id AND o.status = 'paid'keeps customers with no paid order.Follow-ups they push on- Where do you put a filter on the *left* table — does it matter?
- Emulate a FULL OUTER JOIN in MySQL, which lacks it.
Red flag Saying LEFT JOIN 'returns more rows' instead of 'preserves unmatched left rows', and not catching the WHERE-vs-ON filter trap.
source: DataLemur — SQL Interview Questions ↗ -
What does a CROSS JOIN do, and name a legitimate use for it.
A
CROSS JOINproduces the Cartesian product — every row of A paired with every row of B (n*m rows, no ON clause). 10k x 10k = 100M rows, so it's usually a bug from a missing join condition.Legit uses: generating a complete grid (every store x every day to fill gaps for a report), pairing each row against a small constants/calendar table, or building combinations. Often written deliberately as
CROSS JOIN generate_series(...).Follow-ups they push on- How does an unintended CROSS JOIN usually sneak in?
- Difference between CROSS JOIN and an INNER JOIN with `ON 1=1`?
Red flag Not recognizing that a comma-join with no WHERE join condition is effectively a CROSS JOIN that explodes row counts.
source: PostgreSQL docs — Joined Tables (CROSS JOIN) ↗ -
A LEFT JOIN with `WHERE right_table.col = 'x'` returns fewer rows than expected. What happened, and what's the fix?
The
WHEREon a right-table column silently demotes the LEFT JOIN to an INNER JOIN. Unmatched left rows haveNULLinright_table.col, andNULL = 'x'isUNKNOWN, soWHEREdiscards exactly the rows the LEFT JOIN was meant to preserve.Fix: move the predicate into the
ONclause —LEFT JOIN r ON r.fk = l.id AND r.col = 'x'— so it filters which right rows *match* without dropping unmatched left rows. The rule: conditions that should *preserve* the outer side go inON; conditions that should *filter the final result* go inWHERE. (AWHERE right.col IS NULLis the deliberate exception — that's the anti-join idiom.)What a strong answer coversA WHERE predicate on the null-able (right) side turns LEFT JOIN into INNER JOIN.
Cause:
NULL = 'x'evaluates to UNKNOWN, so the padded unmatched rows are filtered out.Fix: put the right-side condition in
ON, notWHERE.ONcontrols matching (preserves the outer side);WHEREfilters the joined result.Exception:
WHERE right.col IS NULLis intentional — it's the anti-join pattern.
Quick self-checkYou want every customer plus their 2024 orders (customers with no 2024 order should still appear). Which is correct?
-
Wrong — the WHERE on o.year drops customers with no 2024 order (their o.year is NULL), making it effectively an INNER JOIN.
-
Correct — moving the year condition into ON keeps all customers, attaching only 2024 orders (NULL where none).
-
Wrong — INNER JOIN drops customers who have no 2024 order entirely.
-
Close but flawed — it keeps customers with no orders at all, yet still drops customers whose only orders are non-2024.
Follow-ups they push on- Why is a predicate on the LEFT (preserved) table the same in ON or WHERE here?
- How does this differ for an INNER JOIN, where ON vs WHERE are interchangeable?
Red flag Putting a right-table filter in WHERE and not realizing you've turned an outer join into an inner join, losing the unmatched rows you wanted.
source: PostgreSQL docs — Joined Tables (ON vs WHERE for outer joins) ↗ -
Find customers who have never placed an order — and explain three ways to write it.
This is the canonical anti-join. Three idioms:
1. LEFT JOIN / IS NULL:
SELECT c.name FROM customers c LEFT JOIN orders o ON o.cust_id = c.id WHERE o.id IS NULL;
2. NOT EXISTS (usually the planner's favourite, and NULL-safe):SELECT name FROM customers c WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.cust_id = c.id);
3. NOT IN — works *only* if the subquery column can't be NULL:WHERE c.id NOT IN (SELECT cust_id FROM orders WHERE cust_id IS NOT NULL);Prefer
NOT EXISTSfor safety and performance; reach forLEFT JOIN … IS NULLwhen you also need columns from the joined table.What a strong answer coversAnti-join = 'rows in A with no match in B'.
LEFT JOIN +
WHERE matched_col IS NULLkeeps only the unmatched left rows.NOT EXISTSis NULL-safe and typically optimizes to an efficient anti-join.NOT INbreaks (returns nothing) if the subquery yields a single NULL — guard withWHERE col IS NOT NULL.
Follow-ups they push on- Why is NOT EXISTS safer than NOT IN here?
- Which form lets you also return data from the orders table?
Red flag Using `NOT IN (SELECT cust_id FROM orders)` when `cust_id` can be NULL — one NULL makes the predicate UNKNOWN for every row and the query returns nothing.
source: LeetCode 183 — Customers Who Never Order ↗ -
What's the difference between joining in the ON clause versus filtering in WHERE for an INNER JOIN — does it matter?
For an INNER JOIN, a predicate produces the same result whether you put it in
ONorWHERE— both filter the matched set, and the optimizer treats them equivalently.For OUTER joins it matters enormously: an
ONcondition decides which rows *match* (unmatched outer rows are still kept and padded with NULL), while aWHEREcondition filters the *final* result *after* the NULLs are added — which can erase the preserved rows. So the safe habit is: join keys and match conditions inON; result-set filters inWHERE; and remember the distinction only collapses for INNER joins.What a strong answer coversINNER JOIN: ON vs WHERE give identical results — equivalent to the optimizer.
OUTER JOIN: ON affects *matching* (preserves unmatched rows); WHERE filters *after* padding.
Best practice: put the relationship/keys in ON, post-join filters in WHERE.
The 'it doesn't matter' rule applies *only* to inner joins.
Quick self-checkFor `a INNER JOIN b ON a.id = b.aid AND b.active = true` vs `a INNER JOIN b ON a.id = b.aid WHERE b.active = true`, the results are…
-
Correct — for an INNER JOIN the extra predicate filters the matched set the same way in either clause.
-
No — that's the OUTER-join behavior; for INNER joins both forms are equivalent.
-
No — both filter to the same matched rows for an inner join.
-
No — the optimizer treats them equivalently; there's no inherent performance difference for inner joins.
Follow-ups they push on- Show a case where moving a predicate from WHERE to ON changes a LEFT JOIN's output.
- Does the optimizer reorder ON vs WHERE predicates for an inner join?
Red flag Over-generalizing 'ON and WHERE are the same' from inner joins to outer joins, where they produce different result sets.
source: Use The Index, Luke! — Join Operations ↗ -
Using a USING clause or NATURAL JOIN instead of ON — what are they and why are they risky?
JOIN … USING (col)joins on equally-named columns and merges them into one output column (so you writecol, nota.col).NATURAL JOINgoes further and joins on all identically-named columns automatically, with noON/USINGat all.USINGis fine and concise.NATURAL JOINis dangerous: adding an unrelated same-named column later (acreated_atoridon both tables) silently changes the join key and corrupts results with no error. Most style guides banNATURAL JOINand prefer an explicitON(orUSING) so the join condition is visible and stable against schema changes.What a strong answer coversUSING (col)joins on a shared column name and collapses it to a single output column.NATURAL JOINauto-joins on *every* commonly-named column — implicit and fragile.A later schema change (new same-named column) silently alters a NATURAL JOIN's keys.
Prefer explicit
ON;USINGis acceptable,NATURAL JOINis widely discouraged.
Quick self-checkWhy do most style guides discourage NATURAL JOIN?
-
No — performance is the same; the objection is correctness/maintainability, not speed.
-
Correct — the join condition is implicit, so schema changes can alter results with no error.
-
No — it uses indexes like any equi-join; the issue is the hidden, mutable join condition.
-
No — NATURAL JOIN uses *all* commonly-named columns, which is precisely the problem.
Follow-ups they push on- How does USING change which columns appear in `SELECT *`?
- Why can adding a column break an existing NATURAL JOIN with no error?
Red flag Relying on NATURAL JOIN, then having a future migration add a same-named column that silently joins on it and quietly changes the result set.
source: PostgreSQL docs — Joined Tables (USING and NATURAL) ↗ -
Identify the top two highest-grossing products within each category in 2022, returning category, product, and total spend.
Aggregate spend per (category, product), rank within each category, keep rank <= 2:
WITH g AS (SELECT category, product, SUM(spend) AS total FROM product_spend WHERE EXTRACT(YEAR FROM tx_date) = 2022 GROUP BY category, product), r AS (SELECT *, RANK() OVER (PARTITION BY category ORDER BY total DESC) AS rk FROM g) SELECT category, product, total FROM r WHERE rk <= 2;This is the 'top-N-per-group' pattern: GROUP BY for the metric, a window RANK to rank within partitions.
Follow-ups they push on- RANK vs DENSE_RANK vs ROW_NUMBER for breaking ties on equal spend?
- Why can't you filter on the window function in the same SELECT's WHERE?
Red flag Using a global ORDER BY + LIMIT 2 (gives the top 2 overall, not per category), or referencing the window alias in WHERE instead of wrapping it in a CTE/subquery.
source: DataLemur — Amazon 'Highest-Grossing Items' ↗ -
Write a self-join to list each employee alongside their manager's name from an employees(id, name, manager_id) table.
Join the table to itself with two aliases:
SELECT e.name AS employee, m.name AS manager FROM employees e LEFT JOIN employees m ON e.manager_id = m.id;Use
LEFT JOIN(not INNER) so the CEO, whosemanager_idis NULL, still appears with a NULL manager. Aliases (e,m) are mandatory to disambiguate the two copies.Follow-ups they push on- How would you go more than one level up (whole chain to the CEO)?
- Recursive CTE for an arbitrary-depth org chart?
Red flag Using INNER JOIN and silently dropping the top-level employee, or forgetting aliases so the columns are ambiguous.
source: PG Exercises — JOINs ↗ -
MySQL has no FULL OUTER JOIN. How do you emulate one?
Take the union of a LEFT JOIN and a RIGHT JOIN:
SELECT * FROM a LEFT JOIN b ON a.id = b.id UNION SELECT * FROM a RIGHT JOIN b ON a.id = b.id;The
LEFThalf gives all ofaplus matches; theRIGHThalf gives all ofbplus matches;UNION(not UNION ALL) dedups the rows that matched on both sides.Follow-ups they push on- Why UNION and not UNION ALL here?
- How to find rows present in exactly one side (anti-join / symmetric difference)?
Red flag Using UNION ALL and double-counting matched rows, or assuming MySQL silently supports FULL OUTER JOIN.
source: PostgreSQL docs — Joins (table expressions) ↗ -
Find products that exist in Amazon's catalog but NOT in the partner's catalog (an anti-join).
Three idiomatic ways; the
LEFT JOIN … IS NULLanti-join is the workhorse:SELECT a.product FROM amazon a LEFT JOIN partner p ON a.product = p.product WHERE p.product IS NULL;Alternatives:
NOT EXISTS (SELECT 1 FROM partner p WHERE p.product = a.product)(NULL-safe, often the planner's favourite) orEXCEPT. PreferNOT EXISTSoverNOT INwhen the right column can be NULL.Follow-ups they push on- Why is NOT IN dangerous when the subquery may return a NULL?
- Performance: NOT EXISTS vs LEFT JOIN/IS NULL vs EXCEPT?
Red flag Using `NOT IN` with a nullable column (a single NULL makes the whole predicate return no rows), or forgetting the `IS NULL` filter in the LEFT-JOIN form.
source: StrataScratch — Amazon 'Exclusive Amazon Products' ↗ -
Why can a JOIN return more rows than either input table, and how do you avoid accidental row explosion?
A join multiplies rows wherever the join key is not unique on the other side: if one customer has 3 orders, joining customers->orders yields 3 rows for that customer. A many-to-many join multiplies both sides — fan-out.
This silently corrupts aggregates:
SUM(amount)double-counts if you joined in a second one-to-many table first. Guard against it by joining on unique/PK columns, pre-aggregating one side in a CTE before joining, or checking the grain of every join.Follow-ups they push on- What is the 'grain' of a result set and why track it?
- A CROSS JOIN of 10k x 10k rows — how many rows, and when is that intentional?
Red flag Blaming 'duplicate data' when the real cause is joining on a non-unique key, or summing a measure after a fan-out join and reporting inflated totals.
source: StrataScratch — Amazon SQL Interview Questions ↗ -
An index exists on the join column of one table but the JOIN is still slow. What index considerations apply to joins?
For a nested-loop join the engine iterates the outer table and probes the inner table once per row, so the index that matters is on the inner table's join column — the side being looked up. If only the outer table's column is indexed, each probe still scans the inner table.
Checklist: (1) index the inner/probed side's join key; (2) make the join columns the same type — an implicit cast (e.g.
intvsvarchar) makes the predicate non-sargable and skips the index; (3) for big unindexed equi-joins a hash join may be the right plan, not a fix; (4) readEXPLAINto see whether it chose nested-loop vs hash and whether the index is actually used.What a strong answer coversNested-loop joins need the index on the inner (probed) table's join column.
Mismatched column types force an implicit cast -> non-sargable -> index ignored.
Both join columns should share a type and, ideally, collation.
A hash join on large unindexed inputs can be the correct plan, not a bug.
Use EXPLAIN to confirm which join algorithm and index the planner actually picked.
Follow-ups they push on- Which table's column should carry the index in a nested-loop join?
- How does an int-vs-varchar join key defeat an index?
Red flag Indexing only the driving (outer) table and expecting fast probes, or joining columns of different types and silently losing the index to an implicit cast.
source: Use The Index, Luke! — Nested Loops / indexing joins ↗ -
Why does summing a measure go wrong after joining two one-to-many tables, and how do you fix the double-counting?
Joining a parent to two child tables (orders has many line_items *and* many payments) creates a Cartesian fan-out: each order's rows = items x payments. Now
SUM(payment.amount)is multiplied by the number of line items, andSUM(item.qty)is multiplied by the number of payments — every total is inflated.Fix by pre-aggregating each child to the parent's grain in its own subquery/CTE before joining:
WITH it AS (SELECT order_id, SUM(qty) q FROM items GROUP BY order_id), pm AS (SELECT order_id, SUM(amount) a FROM payments GROUP BY order_id) SELECT … FROM orders o LEFT JOIN it … LEFT JOIN pm …. Each child is now one row per order, so no fan-out. Always know the grain of each table you join.What a strong answer coversJoining one parent to two one-to-many children multiplies rows (items x payments).
Aggregates over the fanned-out rows double/triple-count.
Fix: pre-aggregate each child to the parent grain in separate CTEs/subqueries, *then* join.
COUNT(DISTINCT …)can patch a single measure but doesn't fix multiple measures cleanly.Track the 'grain' (one row per what?) at every join step.
Follow-ups they push on- Why doesn't COUNT(DISTINCT) fully solve it when you need two sums?
- What is the 'grain' of a result set and how do you reason about it?
Red flag Joining several one-to-many tables in one flat query and trusting SUM — the totals are inflated by the cross-product of the child rows.
source: StrataScratch — SQL JOIN Interview Questions ↗ -
Explain the three physical join algorithms (nested loop, hash join, merge join) and when a planner picks each.
Nested loop: for each outer row, probe the inner table — great when one side is tiny or there's an index on the inner join key; O(n*m) without an index.
Hash join: build a hash table on the smaller input's key, probe with the larger — best for large, unindexed equality joins; needs memory and only does equi-joins.
Merge join: sort both inputs on the key, then walk them in lockstep — wins when inputs are already sorted (e.g. from an index) or for range conditions.
The planner chooses by estimated row counts and available indexes; you see them in
EXPLAIN.Follow-ups they push on- Why can't a hash join serve `a.x < b.y`?
- How does a missing index push a join from nested-loop to a costly hash join?
Red flag Thinking the JOIN keyword maps to one fixed algorithm — the optimizer picks the physical operator based on stats and indexes.
source: PostgreSQL docs — Planner / Optimizer (join methods) ↗ -
For each Friday, count the total likes a post received from the poster's friends, where the like happened after the post was created.
Friendship is usually stored one-directional, so first symmetrize it with
UNION ALLof (a,b) and (b,a). Then join posts to that friend list and to likes, requiring the liker to be a friend andlike_ts > post_ts, and filter to Fridays:... WHERE EXTRACT(DOW FROM like_date) = 5 AND like_ts > post_tsthenGROUP BY like_date. UseCOUNT(DISTINCT …)if a friend could like the same post twice.This is a Meta-style 'SQL as a tool for product reasoning' question: the schema modelling (bidirectional friendship, temporal ordering) is the real test.
Follow-ups they push on- Why UNION ALL rather than UNION when symmetrizing friendships?
- How does the day-of-week number differ across MySQL/Postgres?
Red flag Treating friendship as already bidirectional and undercounting, or forgetting the `like_ts > post_ts` temporal guard.
source: StrataScratch — Meta "Friday's Likes Count" ↗
3.3 Advanced querying 14
-
Why can't you put a window function in a WHERE clause, and how do you filter on its result?
Window functions are computed in the SELECT step, which logically runs *after*
WHERE,GROUP BY, andHAVING. SoWHERE rn = 1referencingROW_NUMBER() … AS rnerrors — the window result doesn't exist yet whenWHEREis evaluated.The fix is to compute the window function in an inner query (a subquery or CTE) and filter on its alias in the outer query:
WITH r AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn FROM emp) SELECT * FROM r WHERE rn = 1;. This 'rank-then-filter' wrapper is the single most common window-function pattern in interviews.What a strong answer coversWindow functions evaluate in SELECT, after WHERE/GROUP BY/HAVING.
Referencing a window alias in the same query's WHERE/HAVING is an error.
Wrap it in a CTE/subquery and filter on the alias in the outer query.
This 'rank in inner, filter in outer' is the top-N-per-group backbone.
Quick self-checkYou want the single highest-paid employee per department. Which is valid?
-
Invalid — `r` is a window alias evaluated in SELECT, not visible to WHERE.
-
Correct — the window function is computed in the inner query, and the outer query filters on its alias.
-
Invalid — HAVING also runs before SELECT, so the window function can't be filtered there either.
-
This computes a boolean column but doesn't filter rows — you'd still get every employee.
Follow-ups they push on- Could you ever use a window function in HAVING? (no — same reason)
- How does this relate to the top-N-per-group pattern?
Red flag Writing `WHERE ROW_NUMBER() OVER (...) = 1` directly and being surprised by a syntax/semantic error instead of wrapping it in a subquery.
source: PostgreSQL docs — Window Function Processing ↗ -
Find the second-highest distinct salary in an Employee table; return NULL if there isn't one.
Order distinct salaries and skip the top one:
SELECT (SELECT DISTINCT salary FROM employee ORDER BY salary DESC LIMIT 1 OFFSET 1) AS second_highest;Wrapping it in an outer
SELECTmakes the resultNULL(not an empty set) when there's no second salary. Alternative:DENSE_RANK() OVER (ORDER BY salary DESC)and keep rank = 2.DISTINCT/DENSE_RANKmatters so duplicate top salaries don't count as two ranks.Follow-ups they push on- Generalize to the Nth-highest salary.
- Why DENSE_RANK rather than RANK or ROW_NUMBER here?
Red flag Using `MAX(salary) WHERE salary < MAX(salary)` incorrectly, or forgetting DISTINCT so two employees tied at the top hide the real second salary; also returning an empty set instead of NULL.
source: LeetCode 176 — Second Highest Salary ↗ -
What's the difference between EXISTS and IN with a subquery, and when does each win?
IN (subquery)materializes the subquery's values and checks membership;EXISTS (subquery)is a correlated semi-join that returns true as soon as one matching row is found (short-circuits).Semantically the big difference is NULL handling:
NOT INreturns no rows if the subquery yields a NULL, whereasNOT EXISTSis NULL-safe — so preferNOT EXISTSfor anti-joins. Performance-wise, modern optimizers often rewrite both into the same semi-/anti-join, butEXISTStends to win when the subquery is large (it can stop early) andINreads fine for small, NULL-free value lists. UseEXISTSwhen you only test *existence*; useINfor a short, known set.What a strong answer coversINtests membership in a value set;EXISTStests whether any correlated row exists (short-circuits).NOT IN+ a NULL in the subquery returns zero rows;NOT EXISTSis NULL-safe.Optimizers frequently rewrite both to semi-joins, so results — not raw form — usually drive the plan.
Rule of thumb: EXISTS for existence tests / large subqueries; IN for small NULL-free lists.
Quick self-check`SELECT * FROM a WHERE a.x NOT IN (SELECT b.y FROM b)` where `b.y` contains one NULL. Result?
-
No — the NULL poisons the comparison; you don't get the 'sensible' answer.
-
Correct — `x NOT IN (…, NULL)` evaluates to UNKNOWN for every row (never TRUE), so nothing is returned.
-
No — the NULL makes the predicate UNKNOWN, not TRUE, so rows are filtered out, not all kept.
-
No — that's the point: NOT EXISTS would be NULL-safe and return the sensible rows, unlike NOT IN here.
Follow-ups they push on- Show the NULL case where NOT IN and NOT EXISTS diverge.
- Why can EXISTS stop scanning after the first match?
Red flag Treating IN and EXISTS as always identical and getting burned by `NOT IN` with a nullable subquery column returning no rows.
source: PostgreSQL docs — Subquery Expressions (EXISTS / IN) ↗ -
Pivot a tall table (one row per month) into a wide one (a column per month) in SQL.
The portable, engine-agnostic way is conditional aggregation —
SUM(CASE WHEN …)per target column:SELECT product, SUM(CASE WHEN month = 'Jan' THEN revenue END) AS jan, SUM(CASE WHEN month = 'Feb' THEN revenue END) AS feb FROM sales GROUP BY product;Each
CASEisolates one month's value; theGROUP BYcollapses to one row per product. You must enumerate the target columns explicitly — SQL's result shape is fixed at plan time, so a truly dynamic pivot needs generated SQL or an engine extension (Postgrescrosstab, SQL ServerPIVOT).What a strong answer coversConditional aggregation: one
SUM(CASE WHEN key = 'X' THEN val END)per output column.GROUP BYthe row dimension; each CASE picks out one pivot value.Output columns must be hard-coded — SQL can't return a runtime-variable number of columns.
Dynamic pivots need generated SQL or extensions (Postgres
crosstab, T-SQLPIVOT).
Follow-ups they push on- How would you handle a column set that isn't known until query time?
- How do you un-pivot (wide back to tall)?
Red flag Expecting a single SQL statement to produce a dynamic, data-dependent number of columns — the column list is fixed at plan time.
source: PostgreSQL docs — tablefunc (crosstab / pivot) ↗ -
Compare INTERSECT, EXCEPT (MINUS), and UNION — and how do they handle duplicates?
All three are set operators combining two result sets with matching column counts/types, and all remove duplicates by default (each has an
ALLvariant to keep them):-
UNION— rows in either set.
-INTERSECT— rows in both sets.
-EXCEPT(Oracle calls itMINUS) — rows in the first set not in the second.They compare whole rows and treat
NULLs as equal to each other for this purpose (unlike=).EXCEPTis a clean way to express an anti-join, andINTERSECTa semi-join, when you're comparing identically-shaped queries.What a strong answer coversUNION = either, INTERSECT = both, EXCEPT/MINUS = first-minus-second.
All dedup by default;
UNION ALL/INTERSECT ALL/EXCEPT ALLkeep duplicates.They match on the entire row and treat NULL = NULL (unlike
=).EXCEPT is a tidy anti-join; INTERSECT a tidy semi-join for same-shaped queries.
Oracle uses
MINUS; most others useEXCEPT.
Quick self-check`SELECT id FROM a EXCEPT SELECT id FROM b` returns…
-
No — that's INTERSECT; EXCEPT subtracts the second set.
-
Correct — EXCEPT returns first-set rows absent from the second and dedups by default.
-
No — that's UNION.
-
No — plain EXCEPT dedups; you'd need EXCEPT ALL to keep duplicates.
Follow-ups they push on- How do set operators treat NULLs differently from a `=` comparison?
- Rewrite an EXCEPT query as a NOT EXISTS anti-join.
Red flag Forgetting these dedup by default (surprising row counts), or assuming `EXCEPT` exists in Oracle, where it's `MINUS`.
source: PostgreSQL docs — Combining Queries (UNION/INTERSECT/EXCEPT) ↗ -
Use NTILE / percentile window functions to bucket users into quartiles by spend.
NTILE(n)splits ordered rows intonroughly-equal buckets and labels each row 1..n:SELECT user_id, spend, NTILE(4) OVER (ORDER BY spend DESC) AS quartile FROM users;Quartile 1 is the top quarter of spenders. NTILE distributes any remainder to the earliest buckets, so groups can differ by one row. If you instead want a *value* threshold (the spend at the 90th percentile), use
PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY spend)(an ordered-set aggregate), not NTILE — NTILE buckets *rows*, percentiles compute a *value*.What a strong answer coversNTILE(n) OVER (ORDER BY …)assigns each row a bucket number 1..n of near-equal size.Uneven counts: the first buckets get the extra rows.
NTILE labels *rows by rank position*; it does not compute a threshold value.
For a percentile *value*, use
PERCENTILE_CONT/PERCENTILE_DISC … WITHIN GROUP.
Follow-ups they push on- Difference between NTILE(4) and PERCENTILE_CONT(0.25)?
- How does NTILE distribute rows when the count isn't divisible by n?
Red flag Using NTILE to get a percentile *threshold value* (it returns bucket labels, not the value at a percentile) or assuming all NTILE buckets have exactly equal size.
source: PostgreSQL docs — Window Functions (NTILE) & Aggregate (percentile) ↗ -
What's the difference between a correlated and a non-correlated subquery, and why does it matter for performance?
A non-correlated subquery is self-contained — it runs once and its result is reused (e.g.
WHERE salary > (SELECT AVG(salary) FROM emp)).A correlated subquery references a column from the outer query, so conceptually it re-runs once per outer row (e.g.
WHERE salary > (SELECT AVG(salary) FROM emp e2 WHERE e2.dept = e1.dept)). That can be O(n) executions and slow, though modern planners often rewrite simple cases into joins.Follow-ups they push on- Rewrite a correlated subquery as a JOIN or window function.
- When is EXISTS preferable to IN with a subquery?
Red flag Calling every subquery 'correlated', or assuming a correlated subquery always re-executes literally (optimizers may decorrelate it).
source: LeetCode 185 — Department Top Three Salaries (correlated subquery) ↗ -
When would you use a CTE (WITH clause) over a subquery or a temp table?
A CTE names an intermediate result so you can reference it (sometimes multiple times) and read the query top-to-bottom — mainly a readability win, and the only way to write a recursive query (
WITH RECURSIVE).Vs a subquery: same logic, clearer structure. Vs a temp table: a CTE is scoped to the single statement and (usually) not materialized to disk. Note: in some engines a CTE is an optimization fence (older Postgres materialized them); Postgres 12+ inlines non-recursive CTEs unless you say
MATERIALIZED.Follow-ups they push on- Write a recursive CTE to walk an org hierarchy.
- When does a CTE act as an optimization barrier?
Red flag Claiming CTEs are always faster — pre-12 Postgres materialized them, which could be slower than an inlined subquery.
source: PostgreSQL docs — WITH Queries (Common Table Expressions) ↗ -
How does a window function differ from GROUP BY?
GROUP BYcollapses each group into one row — you lose the individual rows. A window function (… OVER (PARTITION BY …)) computes an aggregate/rank across a window of rows but keeps every row, attaching the result alongside.So to show each employee *and* their department's average salary in the same row, you need
AVG(salary) OVER (PARTITION BY dept), not GROUP BY. Window functions also give youROW_NUMBER/RANK/LAG/LEADand running totals, which GROUP BY can't express.Follow-ups they push on- Give a running total with `SUM(x) OVER (ORDER BY d)`.
- Difference between PARTITION BY and a plain GROUP BY?
Red flag Saying they're interchangeable — GROUP BY reduces row count, a window function preserves it.
source: PostgreSQL docs — Window Functions ↗ -
What's the difference between ROW_NUMBER, RANK, and DENSE_RANK on tied values?
On a tie of two rows ranked 1st:
-
ROW_NUMBER— always unique, arbitrary among ties: 1, 2, 3, 4 …
-RANK— ties share a rank, then it skips: 1, 1, 3, 4 …
-DENSE_RANK— ties share a rank, no gap: 1, 1, 2, 3 …Pick ROW_NUMBER for 'one row per group / dedup', RANK/DENSE_RANK for leaderboards. 'Top 3 salaries including ties' usually wants DENSE_RANK <= 3.
Follow-ups they push on- Which one for 'top N salaries, ties count as one place'?
- How to make ROW_NUMBER deterministic when the ORDER BY has ties?
Red flag Using ROW_NUMBER for a 'top N including ties' question and arbitrarily dropping tied rows, or confusing RANK's gaps with DENSE_RANK's continuity.
source: StrataScratch — Amazon 'Top-Rated Support Employees' (DENSE_RANK) ↗ -
Write a running (cumulative) total of daily sales ordered by date.
A windowed
SUMwith anORDER BYgives a running total:SELECT sale_date, amount, SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total FROM sales;Adding
ORDER BYinsideOVERswitches the default frame from 'whole partition' to 'start … current row', which is exactly a cumulative sum. AddPARTITION BY regionto get one running total per region.Follow-ups they push on- Default window frame with vs without ORDER BY?
- RANGE vs ROWS for the frame — when do they differ?
Red flag Omitting ORDER BY in OVER (you get the grand total on every row, not a running one), or being surprised by RANGE's behavior on duplicate dates.
source: PostgreSQL docs — Window Function Calls (frames) ↗ -
Compute the month-over-month percentage change in revenue using a window function.
Aggregate to monthly revenue, then use
LAGto reach the previous month:WITH m AS (SELECT DATE_TRUNC('month', tx) AS mth, SUM(amount) AS rev FROM orders GROUP BY 1) SELECT mth, ROUND(100.0 * (rev - LAG(rev) OVER (ORDER BY mth)) / LAG(rev) OVER (ORDER BY mth), 2) AS pct_change FROM m ORDER BY mth;LAG(rev) OVER (ORDER BY mth)pulls the prior row's value; the first month is NULL (no prior). Multiply by100.0to force float division.Follow-ups they push on- Use LEAD instead — what changes?
- Why might integer division give you 0% everywhere?
Red flag Integer division truncating the ratio to 0, or self-joining the table to itself on month-1 instead of the cleaner LAG.
source: StrataScratch — Amazon 'Monthly Percentage Difference' ↗ -
Find users with three or more consecutive days of activity (a gap-and-islands problem).
Classic 'gaps and islands': subtract a
ROW_NUMBERfrom the date to give every consecutive run the same anchor:WITH d AS (SELECT user_id, day, day - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY day))::int AS grp FROM activity) SELECT user_id, COUNT(*) AS streak FROM d GROUP BY user_id, grp HAVING COUNT(*) >= 3;Within one user, on consecutive days both
dayand the row number increase by 1, soday - row_numberis constant across a streak and changes at each gap — grouping on it isolates each island.Follow-ups they push on- Adapt it to find the *longest* streak per user.
- How would LAG-based gap detection compare?
Red flag Trying to detect consecutiveness with a single self-join on day+1 (breaks for runs longer than 2), or forgetting to PARTITION BY user.
source: StrataScratch — Meta 'User Streaks' (gap-and-island, LAG/DENSE_RANK) ↗ -
What is a recursive CTE, and how does it walk an org hierarchy down to arbitrary depth?
A recursive CTE has two parts joined by
UNION [ALL]: an anchor (the starting rows) and a recursive member that references the CTE itself, iterating until it adds no new rows.WITH RECURSIVE chain AS (SELECT id, name, manager_id, 1 AS lvl FROM emp WHERE id = :root UNION ALL SELECT e.id, e.name, e.manager_id, c.lvl + 1 FROM emp e JOIN chain c ON e.manager_id = c.id) SELECT * FROM chain;The anchor seeds the root; each pass joins employees onto the rows found so far, descending one level. It's the standard way to traverse trees/graphs (org charts, category trees, bill-of-materials) that a fixed number of self-joins can't handle.
What a strong answer coversStructure: anchor member
UNION ALLrecursive member that references the CTE.The recursive step runs repeatedly, feeding its output back in, until it produces no new rows.
Used for arbitrary-depth trees/graphs: org charts, category trees, BOM, threaded comments.
Guard against cycles (a depth cap or a visited-set) or the recursion never terminates.
Follow-ups they push on- How do you prevent infinite recursion if the hierarchy has a cycle?
- UNION vs UNION ALL in a recursive CTE — what changes?
Red flag Forgetting the termination behavior on cyclic data (infinite loop) or trying to express arbitrary depth with a fixed chain of self-joins.
source: PostgreSQL docs — WITH Queries (Recursive Queries) ↗
3.4 Indexes & query performance 15
-
What is the difference between a clustered and a non-clustered (secondary) index, and how does it affect lookups?
A clustered index *is* the table: the rows are physically stored in the index's key order (the leaf nodes hold the full row). There's at most one per table — in MySQL/InnoDB it's the primary key. A range scan on the clustered key reads contiguous data, which is very fast.
A non-clustered (secondary) index is a separate structure whose leaves hold the key plus a pointer back to the row. So a secondary-index lookup that needs columns not in the index does a second hop — a bookmark lookup (InnoDB: look up the PK, then the clustered index). That's why a *covering* secondary index (one that includes all needed columns) is so much faster: it skips the second hop.
What a strong answer coversClustered index = the table's rows stored in key order; at most one per table (InnoDB: the PK).
Secondary index = separate B-tree of key -> row locator; many allowed.
A secondary lookup needing extra columns does a second fetch (bookmark / clustered-index lookup).
InnoDB secondary indexes store the PK as the row pointer — so a fat PK bloats every secondary index.
Postgres heap tables differ: there's no clustered index, just heap + index TIDs.
Quick self-checkIn MySQL/InnoDB, what does a secondary index's leaf node store as the pointer to the full row?
-
No — InnoDB uses the primary key as the row locator, not a physical address (which would break on row moves).
-
Correct — InnoDB secondary indexes store the PK, so a non-covering lookup then probes the clustered (PK) index.
-
No — that's the clustered index's leaf; secondary leaves store key + PK pointer.
-
No — the secondary index does store a locator (the PK); it doesn't fall back to a scan.
Follow-ups they push on- In InnoDB, why does a large primary key make every secondary index bigger?
- How does Postgres's storage model differ from InnoDB's clustered table?
Red flag Assuming every index is a separate copy with a fixed pointer (engine-specific), or ignoring that a secondary lookup may need a costly second fetch to the base row.
source: MySQL docs — Clustered and Secondary Indexes ↗ -
What is connection pooling and what problem does it solve?
Opening a DB connection is expensive — TCP handshake, TLS, auth, a backend process/thread. Doing it per request adds latency and can exhaust the DB's connection limit under load.
A connection pool keeps a set of pre-opened connections and hands one to each request, returning it to the pool when done. This caps total connections (protecting the DB), amortizes setup cost, and smooths spikes. Tools: PgBouncer (external), HikariCP (Java), plus most ORMs/drivers' built-in pools.
Follow-ups they push on- How do you size a pool, and why isn't bigger always better?
- Transaction vs session pooling mode in PgBouncer?
Red flag Thinking a bigger pool always means more throughput — past the DB's CPU/IO capacity it causes contention; serverless functions multiplying pools is a classic source of connection exhaustion.
source: PostgreSQL docs — Connections and Authentication ↗ -
What is a B-tree index and why does it support range queries when a hash index doesn't?
A B-tree keeps keys in sorted order across a balanced, shallow tree, giving O(log n) lookups. Because the keys are ordered, the engine can do equality and range scans (
>,<,BETWEEN), prefixLIKE 'abc%', and serveORDER BYfor free.A hash index maps a key to a bucket via a hash — O(1) equality lookups, but the hash destroys ordering, so it can only do
=, never ranges or sorts. B-tree is the default for exactly this versatility.Follow-ups they push on- Why can a B-tree help `ORDER BY` avoid a sort step?
- When is a hash index actually the better choice?
Red flag Saying hash indexes are 'always faster' (only for point equality) or that B-trees are O(1).
source: Use The Index, Luke! — Anatomy of an Index ↗ -
What's the difference between a B-tree, a hash, and a GIN/inverted index, and what is each best for?
B-tree — the default; ordered keys, serves equality, ranges, prefix
LIKE, andORDER BY. Use for almost everything scalar.Hash — O(1) equality only, no ranges or ordering; rarely worth it over a B-tree except niche equality-heavy cases.
GIN (Generalized Inverted Index) — maps each *element* of a composite value to the rows containing it, so it's built for 'contains' queries over multi-valued columns: full-text search (
tsvector), JSONB containment (@>), array membership. (GiST is its cousin for ranges/geometry/nearest-neighbor.)Choose by the query shape: scalar ranges/sorts -> B-tree; element-membership in documents/arrays/text -> GIN.
What a strong answer coversB-tree: ordered, the all-purpose default (equality, range, sort, prefix LIKE).
Hash: equality-only, no ordering — niche.
GIN: inverted index for multi-valued columns — full-text, JSONB
@>, array containment.GiST: ranges, geometry, nearest-neighbor / fuzzy search.
Match the index type to the query operator, not by habit.
Follow-ups they push on- Why can't a B-tree efficiently answer JSONB `@>` containment?
- When would you reach for GiST over GIN?
Red flag Putting a plain B-tree on a JSONB or array column and wondering why containment queries don't use it — those need a GIN index.
source: PostgreSQL docs — Index Types ↗ -
Explain the leftmost-prefix rule for a composite index on (a, b, c). Which queries can it serve?
A concatenated index is sorted by
a, thenb, thenc— like a phone book by surname then first name. So it serves queries that filter on a leading prefix:a;a AND b;a AND b AND c.It does not efficiently serve
balone or(b, c)— there's no leadinga, so the order is useless and the engine falls back to a scan. Practical rule: put equality/most-selective columns first, the column you range/sort on last.Follow-ups they push on- Where do you place a column you do a range scan on within the index?
- Can the index still help an `a = ? AND c = ?` query (skipping b)?
Red flag Believing a composite index helps any subset of its columns, especially trailing ones like `(b, c)` without `a`.
source: Use The Index, Luke! — Concatenated Keys ↗ -
What is a covering index / index-only scan, and why is it fast?
A covering index contains every column a query needs — both the filter/sort columns and the selected columns — so the engine can answer entirely from the index and never touch the heap/table. That's an 'index-only scan'.
It's fast because it skips the random table fetch per matched row (the most expensive part of an index lookup). Postgres lets you tack non-key payload columns on with
INCLUDE (...); MySQL/InnoDB exposes it via the EXPLAIN 'Using index' note.Follow-ups they push on- Difference between key columns and INCLUDE columns in a covering index?
- Why might Postgres still hit the heap despite a covering index (visibility map)?
Red flag Thinking any index that matches the WHERE is 'covering' — it must also contain the SELECTed columns to avoid the table fetch.
source: Use The Index, Luke! — Index-Only Scan ↗ -
When do indexes hurt rather than help?
Indexes are not free:
- Writes slow down — every
INSERT/UPDATE/DELETEmust update every affected index.
- Storage — each index is a copy of its columns plus row pointers.
- Low cardinality — an index on a boolean/statuswith few distinct values rarely beats a scan (the planner may ignore it).
- Tiny tables — a seq scan of a few pages is faster than an index round-trip.
- Unused/redundant indexes still cost on every write.So index for read patterns you actually have, and drop ones EXPLAIN never picks.
Follow-ups they push on- How would you find unused indexes in production?
- Why might the planner ignore an index on a low-selectivity column?
Red flag 'Just index every column' — it bloats writes and storage and the planner won't use most of them.
source: Use The Index, Luke! — The Where Clause / index downsides ↗ -
What is the N+1 query problem and how do you fix it?
An ORM lazy-loads a relationship inside a loop: 1 query for the list of N parents, then 1 more query per parent to fetch its children —
1 + Nround-trips. With 100 posts you fire 101 queries, each paying network + planning latency.Fix with eager loading — pull the children in a single JOIN or batched
INquery: Rails.includes, Djangoselect_related/prefetch_related, HibernateJOIN FETCH, SQLAlchemyjoinedload. It's a query-count problem, not a slow-query problem; detect it by counting queries per request, not by EXPLAINing one.Follow-ups they push on- How would you detect N+1 in a running app?
- Trade-off: a single huge JOIN vs a batched IN(...) of two queries?
Red flag Trying to optimize the individual child query when the real issue is firing it N times; or not knowing the eager-load API for the ORM in use.
source: Use The Index, Luke! — N+1 problem (Join Operations) ↗ -
Your query filters on `status = 'active'` (95% of rows) and the planner does a Seq Scan instead of using the index. Is that a bug?
No — that's the planner being correct. The predicate is low-selectivity: it matches almost every row, so an index scan would do millions of random single-row fetches plus the index read, which is *slower* than one sequential pass. Indexes win only when they eliminate most of the table.
If instead you query the rare value (
status = 'pending', 0.1% of rows), the index becomes worthwhile — that asymmetry is why a partial index (CREATE INDEX … WHERE status = 'pending') is the right tool for skewed columns. Verify withEXPLAIN (ANALYZE, BUFFERS); if the planner *wrongly* avoids an index, suspect stale stats and runANALYZE.What a strong answer coversLow selectivity (matching most rows) makes a full scan cheaper than scattered index fetches.
Indexes pay off when they exclude the large majority of rows.
Skewed columns: a partial index on the rare value(s) beats a full-column index.
If the planner avoids an index it *should* use, suspect stale statistics — run ANALYZE.
Quick self-checkA boolean `is_deleted` column is true for 0.5% of rows. The best index strategy for `WHERE is_deleted = true` is…
-
Suboptimal — it indexes the 99.5% of false rows too, wasting space; the planner may still scan for `= false` queries.
-
Correct — it indexes only the rare true rows, staying tiny and highly selective for that query.
-
No — the true rows are rare (selective), so an index genuinely helps for that predicate.
-
Poor — low cardinality means few buckets; it offers no advantage over a partial B-tree and can't help ranges.
Follow-ups they push on- What is selectivity, and roughly what threshold flips the planner to a seq scan?
- When does a partial index beat a full index on the same column?
Red flag Force-hinting an index onto a non-selective predicate and making the query slower, or assuming 'index not used' is always a problem.
source: PostgreSQL docs — Partial Indexes ↗ -
For `WHERE status = 'active' AND created_at > '2024-01-01' ORDER BY created_at`, what composite index would you build and in what column order?
Index
(status, created_at)— equality column first, range/sort column last. The leadingstatus =narrows the index to the matching slice, and within that slice the entries are already ordered bycreated_at, so the engine satisfies both the range filter and theORDER BYfrom the index with no separate sort.Flip it to
(created_at, status)and the leading range column scatters thestatusvalues, so it can't use the index for the equality efficiently and may need a sort. The rule (from Use The Index, Luke!): equalities first, then the one range/order-by column — and you only get a sort-freeORDER BYif its column trails the equality columns in the index.What a strong answer coversOrder: equality predicate columns first, then the range/
ORDER BYcolumn.(status, created_at)lets one index serve filter + range + ordering with no sort step.A leading range column ruins the ability to use trailing equality columns and to skip the sort.
Only one range column can be 'used' to bound the scan; further columns only refine within it.
Quick self-checkBest single composite index for `WHERE status = 'active' AND created_at > :d ORDER BY created_at`:
-
Wrong — the leading range column scatters status values and can't cleanly serve the equality; a sort may still be needed.
-
Correct — equality first narrows the slice, and created_at trailing gives the range filter and ORDER BY for free.
-
Incomplete — it filters status but leaves the created_at range and the ORDER BY to a scan/sort.
-
Inferior — the engine usually picks one or does a costly bitmap combine; a composite handles all three needs in order.
Follow-ups they push on- Why can an index serve ORDER BY only when the sort column trails the equality columns?
- What happens to this index if you add a second range predicate?
Red flag Putting the range/sort column before the equality column, which forces a scan-and-sort and wastes the composite index.
source: Use The Index, Luke! — The Equality-First Rule (concatenated keys, ORDER BY) ↗ -
Why is keyset (seek) pagination better than OFFSET for deep pages?
LIMIT 20 OFFSET 100000still reads and discards the first 100,000 rows before returning 20 — cost grows linearly with the page number, so deep pages crawl. It can also skip or repeat rows if data changes between page loads.Keyset (seek) pagination remembers the last row's sort key and asks for the next slice directly:
WHERE (created_at, id) < (:last_ts, :last_id) ORDER BY created_at DESC, id DESC LIMIT 20. With an index on the sort key the database *seeks* straight to the spot — constant time regardless of depth — and it's stable under concurrent inserts. The cost is you can't jump to an arbitrary page number, only next/previous.What a strong answer coversOFFSET scans and throws away all skipped rows — O(offset), so deep pages get slower.
Keyset filters on the last seen sort key and seeks via the index — roughly constant time.
Keyset is stable when rows are inserted/deleted between page requests; OFFSET can skip/duplicate.
Trade-off: keyset supports next/prev, not random 'jump to page N'.
Needs a unique, indexed tiebreaker (e.g. id) appended to the sort key.
Follow-ups they push on- Why include `id` as a tiebreaker in the keyset comparison?
- When is OFFSET pagination still acceptable?
Red flag Using OFFSET for infinite scroll / deep pages (slow and prone to skipping rows under concurrent writes) and not realizing seek pagination needs a unique sort tiebreaker.
source: Use The Index, Luke! — Paging Through Results (seek method) ↗ -
The estimated rows in EXPLAIN say 12 but actual says 4,000,000. What's wrong and how do you fix it?
A large gap between estimated and actual rows means the planner is working from stale or missing statistics, so it's likely choosing a bad plan (e.g. a nested loop sized for 12 rows that actually runs 4M times).
First fix:
ANALYZE the_table;(orVACUUM ANALYZE) to refresh the stats the planner samples. If it's still off, the column may have correlated predicates the default per-column stats can't model — create extended statistics (CREATE STATISTICS … (dependencies/ndistinct)), or raise the sampling resolution withALTER TABLE … ALTER COLUMN … SET STATISTICS. Always read these numbers withEXPLAIN (ANALYZE, BUFFERS)so you compare estimate vs actual on the same run.What a strong answer coversEstimate-vs-actual divergence = the planner's row-count model is wrong, usually stale stats.
First action:
ANALYZEto refresh statistics.Correlated columns defeat per-column stats — use extended statistics (
CREATE STATISTICS).Bad estimates cause bad join-method/order choices (nested loop where a hash join was right).
Use
EXPLAIN (ANALYZE, BUFFERS)to see estimate, actual rows, and real I/O together.
Follow-ups they push on- Why does autovacuum sometimes not keep stats fresh enough on a hot table?
- What are extended statistics and when do you need them?
Red flag Rewriting the query when the real problem is stale stats, or trusting the planner's row estimate without checking it against EXPLAIN ANALYZE's actual rows.
source: PostgreSQL docs — Row Estimation / Statistics Used by the Planner ↗ -
How would you find and fix slow queries in a production database?
Find: turn on query collection —
pg_stat_statements(Postgres) or the slow query log (MySQL) — and sort by total time (frequency x latency), not just single slowest, since a moderately slow query run millions of times dominates. APM traces help spot N+1 patterns.Diagnose: run
EXPLAIN (ANALYZE, BUFFERS)on the worst offenders; look for seq scans on big tables, bad row estimates, nested loops over many rows, and high buffer reads.Fix: add/adjust an index (composite, covering, partial), rewrite to be sargable, fix N+1 with eager loading, refresh stats with
ANALYZE, or cache/materialize expensive aggregates. Then re-measure — optimize the query that costs the most aggregate time first.What a strong answer coversCapture queries with
pg_stat_statements/ the slow query log; rank by total time, not single-run time.Diagnose the top offenders with
EXPLAIN (ANALYZE, BUFFERS).Common fixes: indexing, sargable rewrites, fixing N+1, refreshing stats, caching/materializing.
Re-measure after each change — never optimize blind.
A medium-slow query run constantly often beats the single slowest in total cost.
Follow-ups they push on- Why rank by total time rather than the single slowest query?
- How do you catch an N+1 that no single EXPLAIN reveals?
Red flag Optimizing the single slowest query while ignoring a moderately slow one executed orders of magnitude more often, or tuning without measuring before/after.
source: PostgreSQL docs — pg_stat_statements ↗ -
You run EXPLAIN and see a Seq Scan with 'Rows Removed by Filter: 9,900,000'. What does that tell you and what do you do?
A sequential scan read the whole table and the filter threw away almost all of it — the query is selective but there's no index for it, so it's reading 10M rows to keep 100. Add an index on the filtered column(s) so the planner can do an index scan instead.
Read the plan bottom-up (inner nodes run first). Watch for a big gap between estimated and actual rows — that means stale statistics, so run
ANALYZE. Remember thecost=numbers are arbitrary planner units, not milliseconds; useEXPLAIN (ANALYZE, BUFFERS)for real timings.Follow-ups they push on- Estimated rows say 5, actual say 5,000,000 — what's wrong and what's the fix?
- When is a Seq Scan actually the right plan?
Red flag Reading cost as milliseconds, ignoring the estimate-vs-actual divergence, or 'optimizing' a query the planner already handles well.
source: PostgreSQL docs — Using EXPLAIN ↗ -
A query filters `WHERE LOWER(email) = 'a@b.com'` (or `WHERE created_at::date = '2024-01-01'`) and ignores the index on the column. Why, and how do you fix it?
Wrapping the indexed column in a function makes the predicate non-sargable — the index is sorted on
email, not onLOWER(email), so the engine can't use it and seq-scans.Fixes: (1) create a functional/expression index matching the expression:
CREATE INDEX ON users (LOWER(email));. (2) Rewrite to keep the column bare: for the date case,WHERE created_at >= '2024-01-01' AND created_at < '2024-01-02'is sargable and uses the plain index. Same trap with leading-wildcardLIKE '%x'.Follow-ups they push on- Why is `LIKE 'abc%'` sargable but `LIKE '%abc'` not?
- Implicit type casts (string column compared to a number) — same problem?
Red flag Not recognizing that a function/cast on the indexed column defeats the index, and reaching for query hints instead of an expression index or a sargable rewrite.
source: Use The Index, Luke! — Functions / sargable predicates ↗
3.5 Schema design & transactions 14
-
Optimistic vs pessimistic concurrency control — how do they work and when do you pick each?
Pessimistic: assume conflicts are likely, so lock the row up front (
SELECT … FOR UPDATE) and hold it until commit; others wait. Correct and simple, but locks reduce concurrency and risk deadlocks and lock-wait timeouts.Optimistic: assume conflicts are rare, so don't lock — read a
version/timestamp, and at write time doUPDATE … WHERE id = ? AND version = :read_version. If zero rows update, someone else changed it: abort and retry. No locks held during the user's think-time.Pick pessimistic for high contention / short critical sections where retries would thrash; optimistic for low contention and long read-think-write cycles (web forms, APIs) where holding a lock across a round-trip is unacceptable.
What a strong answer coversPessimistic = lock first (
FOR UPDATE); others block until commit.Optimistic = no lock; detect conflict at write via a version/timestamp check, then retry.
High contention favors pessimistic (avoid retry storms); low contention favors optimistic.
Optimistic avoids holding a lock across user think-time / network round-trips.
Both need a transaction; optimistic additionally needs retry logic in the app.
Quick self-checkA web 'edit profile' form is open for minutes before submit; conflicts are rare. Best concurrency strategy?
-
Bad — it would hold a row lock for the entire minutes-long edit, blocking others and risking timeouts.
-
Correct — no lock is held across think-time; the rare conflict is caught at submit and retried.
-
Overkill and impractical — it doesn't help hold state across a stateless web round-trip and adds abort overhead.
-
Risky — silently overwrites a concurrent edit (lost update) with no detection.
Follow-ups they push on- How does a `version` column implement optimistic locking?
- Why can optimistic locking thrash under high contention?
Red flag Using optimistic locking under heavy contention (constant retry/abort churn), or holding a pessimistic lock across a user's think-time and serializing everyone.
source: PostgreSQL docs — Concurrency Control / Explicit Locking ↗ -
What does referential integrity mean, and what are ON DELETE CASCADE / RESTRICT / SET NULL?
Referential integrity is the guarantee that a foreign key always points at a row that exists (or is NULL) — you can't have an order for a customer who was deleted. The DB enforces it for you.
The
ON DELETE(andON UPDATE) clause decides what happens to children when the parent is deleted:-
RESTRICT/NO ACTION— block the delete if children exist (the safe default).
-CASCADE— delete the children too.
-SET NULL— keep children but null out their FK (requires a nullable column).
-SET DEFAULT— set the FK to its default.Choose CASCADE only when children are truly owned by the parent (an order's line items); use RESTRICT for shared/important references to avoid accidental mass deletes.
What a strong answer coversReferential integrity: every FK value must match an existing PK (or be NULL).
RESTRICT/NO ACTIONblocks deleting a parent that still has children.CASCADEdeletes the children with the parent — powerful but easy to mass-delete by accident.SET NULL/SET DEFAULTkeep the child but clear/replace its FK.FK enforcement requires an index (often the child FK column) for the check to be efficient.
Follow-ups they push on- Why is CASCADE risky in production, and how do you make deletes auditable?
- Does the child's FK column need its own index? (yes, for the check and for joins)
Red flag Adding `ON DELETE CASCADE` everywhere and triggering a surprise mass-delete, or forgetting that SET NULL needs the FK column to be nullable.
source: PostgreSQL docs — Foreign Keys (referential actions) ↗ -
Explain 1NF, 2NF, and 3NF each in a sentence, with an example violation.
1NF — atomic columns, no repeating groups/arrays in a cell (violated by a comma-separated
phonescolumn).2NF — 1NF plus no non-key column depends on only part of a composite key (in
(order_id, product_id) -> product_name,product_namedepends onproduct_idalone — split it out).3NF — 2NF plus no transitive dependency: non-key columns depend only on the key (storing
zipandcity, wherezip -> city, is a transitive dependency; move it to a zip table).Mnemonic: 'the key, the whole key, and nothing but the key.'
Follow-ups they push on- What does BCNF add over 3NF?
- Give an anomaly (insert/update/delete) that normalization removes.
Red flag Reciting the names without being able to name a concrete violation, or conflating 2NF (partial dependency) with 3NF (transitive dependency).
source: Wikipedia — Database normalization ↗ -
What is the difference between surrogate and natural primary keys, and what are the trade-offs?
A natural key is a real-world attribute already unique (SSN, ISBN, email, country code). A surrogate key is a system-generated, meaningless identifier (auto-increment
id, UUID) added solely to identify the row.Surrogates win in practice: they're stable (a natural key like email can change, breaking every FK referencing it), compact, and uniform. Naturals avoid an extra column and can prevent duplicate business rows. Common pattern: use a surrogate PK for joins/FKs and a
UNIQUEconstraint on the natural key to enforce business uniqueness. Note the UUID choice matters: random UUIDv4 PKs fragment a clustered index (random insert order); UUIDv7/ULID are time-ordered to avoid that.What a strong answer coversNatural key = meaningful real-world attribute; surrogate = synthetic id (serial/UUID).
Surrogates are stable under business changes; natural keys can mutate and cascade.
Best practice: surrogate PK + a UNIQUE constraint on the natural key.
Random UUIDv4 as a clustered PK hurts insert locality; prefer UUIDv7/ULID or bigserial.
Follow-ups they push on- Why does a random UUIDv4 primary key hurt write performance on a clustered table?
- When is a composite natural key genuinely the better PK?
Red flag Using a mutable natural key (email/phone) as the PK so a single change cascades through every foreign key, or choosing random UUIDv4 PKs and fragmenting the clustered index.
source: Wikipedia — Surrogate key ↗ -
When would you deliberately denormalize a schema?
Denormalize to trade write/consistency cost for read speed when reads dominate and joins are the bottleneck. Common cases: duplicating a
category_nameonto anorderstable to avoid a join on every report; precomputed counts/totals (acomment_countcolumn) to skip aggregation; materialized views; read-optimized analytics tables.The cost: every duplicated fact must be kept in sync on write (triggers, app logic, or background jobs), risking drift. Rule of thumb: normalize first for correctness, denormalize surgically where a measured read path demands it.
Follow-ups they push on- How do you keep denormalized copies consistent?
- Materialized view vs a denormalized column — trade-offs?
Red flag Denormalizing prematurely 'for performance' without a measured hot path, then fighting update anomalies and data drift.
source: Wikipedia — Denormalization ↗ -
What does ACID stand for, and what does each property actually guarantee?
Atomicity — a transaction is all-or-nothing; partial failure rolls the whole thing back.
Consistency — a committed transaction moves the DB from one valid state to another, preserving constraints/invariants.
Isolation — concurrent transactions don't see each other's uncommitted, in-flight state (degree set by the isolation level).
Durability — once committed, the change survives a crash (write-ahead log / fsync).
Classic example: a bank transfer must debit and credit atomically, never leaving money half-moved.
Follow-ups they push on- Which property does the isolation level tune?
- How is durability implemented (WAL / fsync)?
Red flag Conflating ACID's 'Consistency' (constraint preservation) with the distributed-systems 'consistency' of CAP — different concepts.
source: PostgreSQL docs — Transactions ↗ -
ORM vs raw SQL — what are the trade-offs and when do you drop to raw SQL?
ORM wins on productivity and safety: less boilerplate, parameterized queries (SQL-injection resistant by default), migrations, mapping rows to objects, DB portability.
Raw SQL wins on control and performance: complex joins, window functions, CTEs, query-plan tuning, and bulk operations the ORM expresses poorly or N+1's.
Practical stance: ORM for the 90% of CRUD, drop to raw/handwritten SQL (most ORMs allow it) for hot, complex, or analytical queries. The ORM's biggest footgun is hidden N+1 queries.
Follow-ups they push on- How does an ORM protect against SQL injection?
- Name an ORM performance pitfall besides N+1.
Red flag Treating it as religious all-or-nothing, or not knowing the ORM's N+1 / lazy-loading traps and over-fetching.
source: StrataScratch — SQL Interview Questions: The Ultimate Guide ↗ -
What is BCNF and how does it differ from 3NF? Give a case where a table is in 3NF but not BCNF.
BCNF (Boyce-Codd Normal Form) is a stricter 3NF: for *every* non-trivial functional dependency
X -> Y,Xmust be a superkey. 3NF allows a narrow exception — a dependency is OK if its right side is a *prime* attribute (part of some candidate key) — and BCNF removes that exception.The textbook case needs overlapping candidate keys. Table
(student, course, instructor)where each course is taught by one instructor (instructor -> course) and a student takes a course with one instructor ({student, course} -> instructor). Candidate keys are{student, course}and{student, instructor}. The dependencyinstructor -> coursehas a non-superkey left side, so it violates BCNF — yet the table is in 3NF becausecourseis a prime attribute. Fix: split into(instructor, course)and(student, instructor).What a strong answer coversBCNF: every non-trivial FD's determinant (left side) must be a superkey — no exceptions.
3NF permits a dependency whose right side is a prime (key) attribute; BCNF forbids it.
Violations require overlapping/composite candidate keys.
BCNF decomposition can occasionally sacrifice dependency-preservation — a real trade-off.
Follow-ups they push on- Why is dependency preservation sometimes lost when decomposing to BCNF?
- When is staying at 3NF the pragmatic choice over BCNF?
Red flag Claiming 3NF and BCNF are identical — they diverge precisely when a non-key attribute determines a prime attribute under overlapping candidate keys.
source: Wikipedia — Boyce-Codd normal form ↗ -
Why should long-running transactions be avoided, especially under MVCC?
Under MVCC, an
UPDATE/DELETEdoesn't overwrite — it creates a new row version and leaves the old one as a 'dead tuple' until no transaction could still need it. A long-running (or idle-in-transaction) transaction holds an old snapshot open, so the vacuum/garbage-collector can't reclaim those dead tuples — leading to table/index bloat, slower scans, and transaction-ID wraparound pressure in Postgres.Long transactions also hold locks longer (more contention and deadlock risk) and amplify lost-update windows. The fix: keep transactions short, never leave one open across user think-time or external API calls, batch large mutations, and watch for
idle in transactionconnections.What a strong answer coversMVCC keeps old row versions until no open snapshot needs them.
A long/idle transaction pins an old snapshot, blocking VACUUM from reclaiming dead tuples -> bloat.
It also holds locks longer (contention, deadlocks) and, in Postgres, raises wraparound risk.
Keep transactions short; never span user think-time or slow external calls; batch big writes.
Follow-ups they push on- What is 'idle in transaction' and why is it dangerous?
- How does table bloat hurt query performance, and how do you measure it?
Red flag Opening a transaction, then making a slow external API call or waiting on user input inside it — pinning the MVCC snapshot, blocking vacuum, and bloating the table.
source: PostgreSQL docs — Routine Vacuuming (dead tuples / bloat) ↗ -
Define dirty read, non-repeatable read, and phantom read, and map each to the isolation level that prevents it.
Dirty read — you read another transaction's uncommitted change (which may be rolled back). Prevented at
READ COMMITTEDand above.Non-repeatable read — you read a row twice and get different values because another committed transaction updated it between reads. Prevented at
REPEATABLE READand above.Phantom read — you re-run a range query and new rows appear (or vanish) because another transaction inserted/deleted matching rows. Prevented at
SERIALIZABLE.So the ladder is READ UNCOMMITTED -> READ COMMITTED -> REPEATABLE READ -> SERIALIZABLE, each forbidding one more anomaly.
Follow-ups they push on- Postgres prevents phantoms at REPEATABLE READ — why is that stronger than the SQL standard?
- What is a write-skew anomaly and which level stops it?
Red flag Swapping non-repeatable (an UPDATE to existing rows) with phantom (INSERT/DELETE changing which rows match), or assuming every engine maps the levels identically.
source: PostgreSQL docs — Transaction Isolation ↗ -
PostgreSQL's REPEATABLE READ prevents phantom reads, which the SQL standard doesn't require at that level. Why?
Because Postgres implements isolation with MVCC + snapshots, not range locks. At
REPEATABLE READit takes one consistent snapshot at the first statement and every read in the transaction sees the database exactly as of that snapshot — so new rows inserted by others are invisible, eliminating phantoms too.The SQL standard only *requires* REPEATABLE READ to block dirty + non-repeatable reads; Postgres is strictly stronger. (Its
SERIALIZABLEadds Serializable Snapshot Isolation to also catch write-skew.) Takeaway: the named levels are minimum guarantees — engines often exceed them, so verify per-engine.Follow-ups they push on- What anomaly does Postgres SERIALIZABLE catch that REPEATABLE READ still allows (write skew)?
- How does MySQL/InnoDB REPEATABLE READ differ (gap locks)?
Red flag Assuming the SQL-standard anomaly table is literally true for every database — engine implementations (MVCC vs locking) change the real guarantees.
source: PostgreSQL docs — Repeatable Read Isolation Level ↗ -
Explain shared vs exclusive locks and how a deadlock arises.
A shared (read) lock lets many transactions hold it at once but blocks writers. An exclusive (write) lock is held by exactly one transaction and blocks everyone else on that resource. Shared/shared is compatible; anything with exclusive is not.
A deadlock is a cycle of waits: T1 holds A and wants B; T2 holds B and wants A — neither can proceed. The DB detects the cycle and aborts one transaction (the 'deadlock victim'); your code should catch the error and retry. Avoid them by acquiring locks in a consistent order and keeping transactions short.
Follow-ups they push on- How does lock ordering prevent deadlocks?
- Optimistic vs pessimistic locking — when to pick each?
Red flag Thinking the DB hangs forever on a deadlock — it detects the cycle and kills a victim; the app must handle the retry. Also confusing a deadlock with a long lock-wait.
source: PostgreSQL docs — Explicit Locking / Deadlocks ↗ -
Two users buy the last item in stock at the same time and you oversell. How do you prevent the race with the database?
It's a lost-update / check-then-act race: both read
stock = 1, both decrement. Fixes:- Pessimistic lock:
SELECT stock FROM items WHERE id = ? FOR UPDATEinside a transaction — the second buyer blocks until the first commits, then sees0.
- Atomic conditional write:UPDATE items SET stock = stock - 1 WHERE id = ? AND stock > 0and check the affected-row count — zero rows means it was already sold out. No separate read needed.
- Optimistic concurrency: aversioncolumn,UPDATE … WHERE version = ?; retry on conflict. Best under low contention.The atomic conditional UPDATE is usually the simplest correct answer.
Follow-ups they push on- Optimistic vs pessimistic — which under high contention?
- Where do isolation levels alone fail to save you here?
Red flag Doing read-then-write in application code without a lock or atomic update and assuming the transaction wrapper alone prevents the lost update (it doesn't at READ COMMITTED).
source: PostgreSQL docs — Explicit Locking (Row-Level Locks / FOR UPDATE) ↗ -
What is a write-skew anomaly, and why can it slip past REPEATABLE READ / snapshot isolation?
Write skew: two transactions each read an overlapping set of rows, each checks an invariant that currently holds, then each writes to a different row — and the combined result violates the invariant that neither saw broken.
Classic case: a hospital requires >=1 doctor on call. Two on-call doctors each run 'if more than one is on call, I can go off-call', both read 2-on-call (true), both update their own row, and now zero are on call. Snapshot isolation /
REPEATABLE READdoesn't catch it because the two transactions write *disjoint* rows — there's no write-write conflict, only a read-write dependency cycle. OnlySERIALIZABLE(in Postgres, Serializable Snapshot Isolation) detects the dependency cycle and aborts one.What a strong answer coversWrite skew: concurrent transactions read overlapping data, then write *disjoint* rows, breaking an invariant.
Snapshot isolation misses it because there's no write-write conflict to detect.
It's a read-write dependency cycle, not a lost update.
Only SERIALIZABLE (SSI in Postgres) prevents it; or use explicit
SELECT … FOR UPDATEto materialize the conflict.
Quick self-checkWhich isolation level is required to reliably prevent write skew?
-
No — it doesn't even prevent non-repeatable reads, let alone write skew.
-
No — write skew specifically survives snapshot isolation because the writes touch disjoint rows.
-
Correct — only serializable execution (e.g. Postgres SSI) detects the read-write dependency cycle and aborts a transaction.
-
No — that's the weakest level; it allows dirty reads and certainly write skew.
Follow-ups they push on- How does SELECT … FOR UPDATE turn a write-skew into a detectable conflict?
- What is Serializable Snapshot Isolation and how does it differ from two-phase locking?
Red flag Believing REPEATABLE READ/snapshot isolation prevents all anomalies — it still allows write skew, which needs SERIALIZABLE or explicit locking.
source: PostgreSQL docs — Serializable Isolation Level (write skew) ↗
3.6 NoSQL 14
-
State the CAP theorem and explain why 'CA' isn't a real choice for a distributed database.
CAP says that when a network partition (P) happens, a distributed system can preserve at most one of Consistency (every read sees the latest write) and Availability (every request gets a non-error response) — you must drop one.
'CA' isn't a meaningful pick because partitions *will* happen in any real network — you don't get to opt out of P. So the real choice during a partition is CP (refuse/error to stay consistent — e.g. a leader-based store rejecting writes it can't replicate) or AP (answer with possibly-stale data and reconcile later — Dynamo-style stores). When there's *no* partition, a good system gives both C and A; CAP only forces the trade *during* a partition. PACELC extends it: else (no partition) you still trade latency vs consistency.
What a strong answer coversUnder a partition you choose Consistency or Availability, not both.
Partitions are unavoidable in real networks, so P isn't optional — 'CA' is a non-choice.
CP = stay consistent, reject/err during partition; AP = stay available, serve stale, reconcile.
CAP only bites *during* a partition; PACELC adds the latency-vs-consistency trade for normal operation.
Quick self-checkDuring a network partition, a payment system that must never double-charge should behave as…
-
Risky for payments — serving/accepting under partition can produce conflicting writes (double charges) to reconcile.
-
Correct — for financial correctness you sacrifice availability during the partition to preserve consistency.
-
Wrong — CA isn't achievable in a distributed system; you can't opt out of partitions.
-
Wrong — CAP applies to any distributed data store, SQL or NoSQL.
Follow-ups they push on- What does PACELC add to CAP?
- Give a real CP store and a real AP store and the workload each suits.
Red flag Treating CAP as 'pick any two' (you can't drop P) or thinking it forces a permanent global trade rather than one that only applies during a partition.
source: Wikipedia — CAP theorem ↗ -
Name the four main NoSQL families and a use case where each beats a relational DB.
Document (MongoDB) — flexible JSON-like docs; content, catalogs, user profiles where the shape varies.
Key-value (Redis, DynamoDB) — fastest by-key access; caching, sessions, leaderboards, rate counters.
Wide-column (Cassandra, HBase) — massive distributed write scale; time-series, IoT, event logs.
Graph (Neo4j) — relationship-heavy traversals; social graphs, fraud rings, recommendations.
The through-line: each optimizes a specific access pattern that relational tables + joins serve poorly at scale.
Follow-ups they push on- Why is a graph DB better than SQL recursive joins for 'friends of friends of friends'?
- Document vs wide-column — how do their data models differ?
Red flag Treating 'NoSQL' as one thing, or claiming it's 'schemaless so always better' — each family has a narrow sweet spot.
source: MongoDB — Types of NoSQL Databases ↗ -
When is a graph database the right tool, and why does it beat relational recursive joins for deep traversals?
Use a graph DB (Neo4j) when relationships are first-class and traversals are deep/variable-length: social graphs ('friends of friends of friends'), fraud rings, recommendation paths, dependency/permission graphs.
In a relational store, each 'hop' is another self-join, and a 4-hop query means 4 joins whose cost compounds with table size — the optimizer re-finds matching rows by index lookup each level. A graph DB uses index-free adjacency: each node directly stores pointers to its neighbors, so traversing one more hop is O(neighbors of the current node), independent of total graph size. That makes variable-depth path queries (shortest path, reachability) both fast and natural to express (Cypher's
MATCH (a)-[:FRIEND*1..4]->(b)).What a strong answer coversGraph DBs shine when relationships and multi-hop traversal are the core workload.
Index-free adjacency: nodes point straight at neighbors, so a hop is local, not a global index lookup.
Relational deep traversal = N self-joins whose cost compounds with table size.
Variable-length paths (shortest path, reachability) are awkward in SQL, native in graph query languages.
Follow-ups they push on- What is index-free adjacency, concretely?
- Could a recursive CTE handle this in SQL, and where does it fall down at scale?
Red flag Forcing a deeply-connected, variable-depth traversal into repeated SQL self-joins/recursive CTEs and watching it degrade as hop count and table size grow.
source: Neo4j — Graph Database Concepts (index-free adjacency) ↗ -
Cache-aside vs write-through vs write-behind — compare the caching strategies.
Cache-aside (lazy): the app checks the cache; on a miss it reads the DB and populates the cache, and on writes it updates the DB and *invalidates* the key. Simple and resilient (cache down ≠ data loss), but the first read after a miss/eviction is slow and there's a brief staleness window.
Write-through: writes go to cache and DB synchronously, so the cache is always fresh — at the cost of higher write latency and caching data that may never be read.
Write-behind (write-back): writes hit the cache and are flushed to the DB asynchronously — lowest write latency, highest throughput, but risks data loss if the cache fails before flushing and adds complexity. Cache-aside is the common default for read-heavy web workloads.
What a strong answer coversCache-aside: app-managed, populate on miss, invalidate on write — simple, resilient, can serve stale briefly.
Write-through: write cache+DB together — always fresh, slower writes, may cache unread data.
Write-behind: async flush to DB — fastest writes, but risks loss on cache failure.
Default to cache-aside for read-heavy systems; reserve write-behind for write-heavy, loss-tolerant cases.
Quick self-checkWhich strategy has the **lowest write latency** but the **highest risk of data loss**?
-
No — it writes to the DB directly (then invalidates), so no loss, but it's not the lowest write latency.
-
No — it writes cache and DB synchronously, so writes are durable but slower, not the fastest.
-
Correct — it acks the write from cache and flushes to the DB asynchronously: fastest writes, but unflushed data is lost if the cache fails.
-
No — read-through governs cache population on reads, not write latency.
Follow-ups they push on- Why does write-behind risk data loss, and how do you mitigate it?
- How do you avoid a cache stampede when a hot cache-aside key expires?
Red flag Choosing write-behind for data you can't afford to lose, or running cache-aside without an invalidation step so the cache serves stale data after every update.
source: AWS — Caching strategies (lazy loading / write-through) ↗ -
A NoSQL store is 'schemaless' — what does that actually mean, and what's the catch?
'Schemaless' means the database doesn't enforce a fixed schema — different documents in a collection can have different fields, and you can add a field without a migration. It's better called schema-on-read: the structure is interpreted by the application when it reads, rather than enforced by the database on write.
The catch is the schema doesn't disappear — it moves into your application code, which must handle missing fields, mixed types, and old document shapes (versioning) forever. Without DB-level constraints you can silently write inconsistent data, so mature NoSQL stores add optional validation (MongoDB JSON Schema validators) and teams still enforce structure in code. 'Flexible' is the upside; 'no guardrails' is the downside.
What a strong answer coversSchemaless = the DB doesn't enforce structure; really 'schema-on-read'.
The schema moves into application code, which must tolerate missing/old/variant shapes.
Flexibility speeds iteration but removes the DB's data-integrity guardrails.
Mitigate with optional validators (MongoDB schema validation) and explicit document versioning.
Quick self-checkA 'schemaless' document store most accurately means…
-
No — the data still has structure; it's enforced/interpreted by the app on read, not by the DB on write.
-
Correct — 'schemaless' is schema-on-read; the burden moves to application code.
-
No — the point is the DB does *not* enforce a fixed schema; documents can vary.
-
No — it stores rich structured documents; it just doesn't impose one uniform schema.
Follow-ups they push on- How does schema-on-read differ from schema-on-write?
- How do you evolve millions of existing documents to a new shape?
Red flag Believing 'schemaless' means no schema to manage — the schema is just enforced (or not) in application code, where inconsistencies accumulate silently.
source: MongoDB — Schema Validation ↗ -
In MongoDB, when do you embed a sub-document vs reference another collection?
Embed when the child is owned by and always read with the parent, the relationship is one-to-few, and the embedded data doesn't grow unbounded — e.g. a user's addresses inside the user document. One read fetches everything; no join.
Reference (store an ObjectId, join with
$lookupor a second query) when the child is large, shared across parents (many-to-many), updated independently, or the array would grow without bound (a celebrity's millions of followers). This avoids the 16MB document cap and write amplification.Rule: model around your access patterns, not entities — 'data that is accessed together should be stored together.'
Follow-ups they push on- What's MongoDB's document size limit, and how does it force referencing?
- How would you model a comments-on-posts relationship?
Red flag Reflexively normalizing like a relational schema, or embedding an unbounded growing array that eventually hits the 16MB document limit.
source: MongoDB — Data Modeling Introduction ↗ -
What is BASE and how does it differ from ACID?
BASE = Basically Available, Soft state, Eventual consistency. It's the consistency model many NoSQL/distributed stores choose: stay available and partition-tolerant, accept that replicas converge *eventually* rather than being instantly consistent.
Vs ACID, which insists every transaction leaves the DB strongly consistent and isolated. BASE relaxes that to gain availability and horizontal scale. It's the practical face of the CAP theorem: under a network partition you pick availability (BASE/AP) or consistency (ACID/CP). Use BASE where stale-by-seconds reads are fine (feeds, product views); use ACID where they aren't (payments).
Follow-ups they push on- State the CAP theorem and which corner BASE sits in.
- Give a feature where eventual consistency is unacceptable.
Red flag Equating 'NoSQL' with 'no transactions' — many (MongoDB, DynamoDB) now offer ACID transactions; BASE is a choice, not an inherent limitation.
source: MongoDB — ACID Transactions / Database Consistency ↗ -
What is the MongoDB aggregation pipeline, and how does it map to SQL?
The aggregation pipeline passes documents through ordered stages, each transforming the stream and feeding the next — like Unix pipes for data.
Rough SQL mapping:
$match~ WHERE,$group~ GROUP BY (+ aggregates),$project~ SELECT (shape columns),$sort~ ORDER BY,$limit/$skip~ LIMIT/OFFSET,$lookup~ LEFT JOIN,$unwind~ flatten an array into rows.Stage order matters for performance: put
$matchand$sortearly so they can use indexes and shrink the working set before expensive$group/$lookup.Follow-ups they push on- Why put $match as early as possible in the pipeline?
- What does $unwind do and when is it needed before $group?
Red flag Ordering stages so `$match` comes after a `$group`/`$lookup`, defeating index use and processing far more documents than necessary.
source: MongoDB — Aggregation Pipeline ↗ -
Why use Redis for caching, and what are the main eviction/expiry concerns?
Redis is an in-memory key-value store, so reads/writes are microsecond-fast — ideal as a cache in front of a slower primary DB, plus sessions, rate limiters, and leaderboards (sorted sets).
Key concerns: set a TTL (
EXPIRE) so stale data ages out; pick an eviction policy for when memory is full (allkeys-lru,allkeys-lfu,volatile-ttl, etc.); and have a cache-invalidation strategy on writes (write-through, or delete-on-update). Watch for stampede — many requests recomputing a hot key the instant it expires — mitigated by locks or jittered TTLs.Follow-ups they push on- Cache-aside vs write-through vs write-behind?
- What is a cache stampede / thundering herd, and how do you avoid it?
Red flag Caching without a TTL or invalidation plan (serving stale data forever), or ignoring eviction so the cache silently drops keys under memory pressure.
source: Redis — Key eviction (docs) ↗ -
When would you NOT use NoSQL — i.e., when is a relational database still the right call?
Choose relational when you need strong multi-row transactions / ACID (money, inventory, bookings), flexible ad-hoc queries and joins across well-structured related data, constraints and referential integrity enforced by the DB, and a stable schema.
NoSQL earns its place for huge scale on a known access pattern, flexible/evolving document shapes, or relationship-traversal workloads. The honest senior answer is 'it depends on access patterns and consistency needs' — and modern Postgres (JSONB, partitioning, logical replication) covers many cases people reach for NoSQL for.
Follow-ups they push on- How does Postgres JSONB blur the SQL/NoSQL line?
- Polyglot persistence — when is mixing both justified?
Red flag Picking NoSQL for hype/scale you don't have, then reimplementing joins and transactions in application code; or assuming relational 'can't scale'.
source: MongoDB — NoSQL vs SQL Databases ↗ -
Why is NoSQL data modeling driven by access patterns, and what does DynamoDB single-table design illustrate?
Relational modeling normalizes by entity and joins at read time. NoSQL stores (especially DynamoDB) have no joins and charge for every access, so you model queries first: list the access patterns, then design keys so each query is a single, indexed key lookup.
Single-table design takes this to the extreme — multiple entity types (users, orders, items) share one table, distinguished by a composite primary key (a generic partition key + sort key, often
PK/SKwith prefixes likeUSER#123/ORDER#456). Related items share a partition so one query fetches them together without a join, and secondary indexes (GSIs) serve alternate patterns. The cost is a rigid, query-specific schema that's painful to change when access patterns evolve.What a strong answer coversNo joins + per-request cost -> design around queries, not entities.
List access patterns first, then shape partition/sort keys so each is one key lookup.
Single-table design co-locates related items in a partition via prefixed composite keys.
Secondary indexes (GSIs) add alternate access patterns; the schema is rigid to new ones.
Follow-ups they push on- How does a composite (partition + sort) key let one query return several related items?
- What's the downside when a brand-new access pattern appears later?
Red flag Modeling a NoSQL store like a normalized relational schema and then needing joins the database can't do, forcing N round-trips or client-side joins.
source: AWS docs — DynamoDB single-table design / data modeling ↗ -
What is eventual consistency, and how do read-your-writes and quorum reads/writes fit in?
Eventual consistency: replicas may temporarily disagree, but with no new writes they all converge to the same value. The window means a read just after a write can return stale data.
Stronger guarantees layer on top. Read-your-writes ensures *you* always see your own latest write (route your reads to a replica known to have it, or to the leader). Quorum tunes consistency per operation: with N replicas, require W acks on write and R replicas on read; if R + W > N the read and write sets overlap, so a read is guaranteed to see the latest acknowledged write (strong consistency) — at the cost of latency/availability. Dynamo-style systems expose W/R so you trade consistency against speed per call.
What a strong answer coversEventual consistency: replicas converge once writes stop; reads can be briefly stale.
Read-your-writes: a session always sees its own latest write.
Quorum: pick W (write acks) and R (read replicas) out of N.
R + W > Nguarantees overlap -> a read sees the latest committed write (strong consistency).Higher R/W means stronger consistency but more latency and less availability.
Quick self-checkWith N=3 replicas, which (W, R) configuration guarantees strongly-consistent reads?
-
No — R + W = 2, not > 3; the read and write sets may not overlap, so reads can be stale.
-
Correct — R + W = 4 > 3, so the read set and write set must share at least one replica with the latest write.
-
No — R + W = 3, which is not strictly greater than N=3; overlap isn't guaranteed.
-
Misleading — R=3 with W=1 does give R+W=4>3, but the framing 'only if' is wrong; the quorum math is what guarantees it, and this option understates W's role.
Follow-ups they push on- Why does R + W > N guarantee a read sees the newest write?
- What's a tunable example — W=N for strong writes vs W=1 for fast writes?
Red flag Assuming eventual consistency means 'never consistent', or thinking any single quorum value is right — R/W are a per-workload latency-vs-consistency dial.
source: Wikipedia — Eventual consistency ↗ -
Explain sharding vs replication vs partitioning. How are they different?
Replication — keep copies of the same data on multiple nodes (leader-follower). Goal: high availability + read scaling + durability. It does *not* increase write capacity (one leader takes writes).
Sharding — split the dataset into disjoint pieces across nodes by a shard key, each node owning a subset. Goal: scale writes and storage beyond one machine.
Partitioning — the general term for splitting a table: horizontal = rows split across partitions (sharding is horizontal partitioning across servers); vertical = columns split into separate tables.
In practice you combine them: shard for write scale, then replicate each shard for HA.
Follow-ups they push on- Why does replication alone not scale writes?
- How do you choose a shard key, and what's a hot-shard / hotspot?
Red flag Using sharding and replication interchangeably — replicas are full copies (HA + reads), shards are disjoint subsets (write/storage scale).
source: MongoDB — Sharding ↗ -
How would you choose a shard key, and what goes wrong with a bad one?
A good shard key has high cardinality, even write distribution, and matches your query pattern so most queries hit one shard (targeted, not scatter-gather).
Failure modes: a monotonically increasing key (timestamp, auto-increment id) sends all new writes to one shard — a 'hot shard'. A low-cardinality key (country, status) can't split finely enough. A key that doesn't appear in queries forces every query to fan out to all shards (broadcast).
Mitigations: hashed shard keys to spread writes, or compound keys (e.g.
user_id+ time) to keep related data together while distributing load.Follow-ups they push on- Why is a hashed shard key better for write distribution but worse for range queries?
- What is a scatter-gather query and why is it slow?
Red flag Picking an auto-increment or timestamp shard key and creating a permanent hotspot, or a key absent from common queries forcing broadcasts.
source: MongoDB — Choose a Shard Key ↗
3.7 Stored routines, views & triggers 11
-
What is SQL injection, and how do stored procedures and parameterized queries relate to preventing it?
SQL injection happens when user input is concatenated into a query string, so input like
' OR 1=1 --becomes executable SQL. The real defense is parameterized queries / prepared statements: the SQL text and the data travel separately, so input is always treated as a value, never as code.Stored procedures help *only if* they use parameters internally — a procedure that builds and
EXECUTEs a string from its arguments (dynamic SQL) is just as injectable. So 'use stored procedures' is not itself the fix; 'never interpolate untrusted input into SQL' is. ORMs parameterize by default, which is a big part of why they're safer out of the box.What a strong answer coversInjection = untrusted input concatenated into SQL text and executed as code.
Fix = parameterized queries / prepared statements: code and data sent separately.
Stored procedures are safe only when parameterized; dynamic SQL inside them is still vulnerable.
ORMs parameterize by default; the danger returns the moment you build raw SQL by string concat.
Quick self-checkWhich most reliably prevents SQL injection?
-
Fragile — easy to miss cases (numeric contexts, encodings); not a robust defense on its own.
-
Correct — separating SQL code from data means input can never be parsed as SQL.
-
Insufficient — a procedure that builds dynamic SQL from its inputs is still injectable.
-
Good defense-in-depth that limits blast radius, but it doesn't prevent the injection itself.
Follow-ups they push on- How can a stored procedure still be injectable (dynamic SQL / EXECUTE)?
- Why isn't escaping/quoting input a reliable substitute for parameterization?
Red flag Believing 'we use stored procedures, so we're safe from injection' — a procedure that concatenates input into dynamic SQL is exactly as vulnerable as inline string-building.
source: OWASP — SQL Injection Prevention Cheat Sheet ↗ -
What is the difference between a view and a materialized view, and when would you use each?
A regular view is just a saved query — it stores no data. Every read re-runs the underlying
SELECTagainst the live tables, so results are always current but you pay the full query cost on each access.A materialized view stores the computed result on disk, so reads are cheap — but the data is a snapshot that goes stale until you
REFRESH MATERIALIZED VIEW.Use a plain view to centralize/simplify a query, present a stable interface, or restrict columns for security. Use a materialized view when the query is expensive and slightly-stale results are acceptable: dashboards, reporting rollups, precomputed aggregates.
Follow-ups they push on- How do you refresh a materialized view without blocking readers?
- Can you put an index on a materialized view? (yes)
Red flag Believing a regular view caches its results (it doesn't — it re-executes every time), or treating a materialized view as always up to date.
source: PostgreSQL — Materialized Views ↗ -
Can you INSERT/UPDATE/DELETE through a view?
Sometimes. A simple view — one base table, no aggregation,
DISTINCT,GROUP BY, window functions, or set operations — is automatically updatable: writes pass straight through to the base table. Complex views (joins, aggregates) are not directly writable; you make them writable with anINSTEAD OFtrigger that translates the change to the right base tables.Add
WITH CHECK OPTIONso an INSERT/UPDATE can't create a row that would fall outside the view'sWHEREand silently disappear.Follow-ups they push on- What does WITH CHECK OPTION protect against?
- How does an INSTEAD OF trigger make a multi-table view writable?
Red flag Assuming any view is updatable, then being surprised when a write to a join/aggregate view errors out.
source: PostgreSQL — CREATE VIEW (updatable views) ↗ -
What are triggers good for, and why are they dangerous in production?
A trigger runs a function automatically on
INSERT/UPDATE/DELETE(BEFORE,AFTER, orINSTEAD OF). Legitimate uses: writing audit/history rows, enforcing invariants the schema can't express, maintaining a derived/denormalized column, or keeping a summary table in sync.The danger is that triggers are invisible side effects. They fire on every row change, hide business logic away from the application, add latency to every write, can cascade or recurse, and quietly make bulk operations slow. They're powerful but easy to abuse.
Follow-ups they push on- BEFORE vs AFTER vs INSTEAD OF — when do you reach for each?
- How do you prevent a trigger from recursively firing on itself?
Red flag Burying critical business logic in triggers so behavior becomes 'spooky action at a distance', or ignoring their per-row cost on large bulk writes.
source: PostgreSQL — CREATE TRIGGER ↗ -
A table's writes are mysteriously slow and some rows change on their own — how do you debug it?
Symptoms like 'an UPDATE touched rows I never wrote', unexplained slow writes, or
stack depth limit exceededalmost always trace back to a trigger.Steps: list the triggers on the table (
\d tablein psql, orinformation_schema.triggers), read the trigger function, noteBEFOREvsAFTERand which events fire it, and look for a trigger that writes back to the same table (recursion) or a per-row trigger running during a bulk operation. AddRAISE NOTICEto trace, and temporarilyALTER TABLE ... DISABLE TRIGGERto isolate the culprit.Follow-ups they push on- How do you stop a trigger from recursively re-firing on its own writes?
- Row-level vs statement-level triggers for a million-row update?
Red flag Debugging the application for hours when an AFTER trigger is the real cause — or disabling a trigger in production to test and forgetting to re-enable it.
source: PostgreSQL — Overview of Trigger Behavior ↗ -
What's the difference between a stored function and a stored procedure in PostgreSQL?
A function returns a value (scalar, row, or set) and is meant to be *called inside* a SQL statement —
SELECT my_fn(x). Because it runs *within* the calling query's transaction, it cannot issueCOMMIT/ROLLBACK.A procedure (added in Postgres 11) is invoked with
CALL my_proc(...), may return nothing, and crucially can manage transactions — it canCOMMIT/ROLLBACKmid-body, which is what makes procedures right for batch jobs that process and commit in chunks. So: need a value inside a query -> function; need explicit transaction control for multi-step/batch work -> procedure.What a strong answer coversFunction: returns a value, called inside SQL (
SELECT f(...)), no transaction control.Procedure: called with
CALL, can COMMIT/ROLLBACK in its body.Procedures (PG 11+) suit batch jobs that commit in chunks; functions suit computed values.
A function runs inside the caller's transaction; it can't open/close one.
Quick self-checkYou need a routine that processes a million rows in batches, committing every 10,000. In Postgres you should write a…
-
Wrong — a function runs inside the caller's transaction and can't COMMIT mid-body.
-
Correct — only a procedure can issue COMMIT/ROLLBACK, enabling chunked batch commits.
-
Wrong — a trigger fires per row/statement on DML events; it's not a batch driver and can't manage the outer transaction.
-
Wrong — a view is a stored query that returns rows; it executes no procedural batch logic.
Follow-ups they push on- Why can a procedure but not a function COMMIT mid-execution?
- What does VOLATILE vs STABLE vs IMMUTABLE tell the planner about a function?
Red flag Trying to COMMIT inside a function (errors), or assuming 'function' and 'procedure' are just two names for the same thing.
source: PostgreSQL docs — CREATE PROCEDURE (transaction control) ↗ -
When do you reach for a BEFORE, AFTER, or INSTEAD OF trigger?
BEFORE fires before the row change and can modify or veto it — use it to validate, normalize/derive a column (set
updated_at, lowercase an email), orRETURN NULLto skip the operation. The row isn't written yet, so you can't see its final generated id.AFTER fires once the change is committed to the row — use it for side effects that depend on the final state: writing an audit/history row, enqueuing a notification, maintaining a summary table. It can see the new id.
INSTEAD OF applies only to views: it replaces the (impossible) direct write with custom logic, which is how you make a complex/multi-table view updatable.
What a strong answer coversBEFORE: validate / mutate / cancel the row before it's written (can RETURN NULL to skip).
AFTER: react to the committed change — audit logs, notifications, summary maintenance.
INSTEAD OF: only on views; substitutes custom DML to make a non-updatable view writable.
BEFORE can't see auto-generated values (id/serial); AFTER can.
Quick self-checkYou want to reject or normalize a value before it's stored. Which trigger timing fits?
-
Wrong — the row is already written; you can't cleanly veto or alter the incoming values then.
-
Correct — BEFORE runs before the write, so it can validate, modify, or RETURN NULL to cancel the operation.
-
Wrong — INSTEAD OF triggers apply to views, not base tables.
-
Wrong — deferring checks the condition at commit; it doesn't let you normalize the value pre-write like BEFORE does.
Follow-ups they push on- Why can't a BEFORE INSERT trigger see the new serial id?
- Row-level vs statement-level triggers — when does each fire?
Red flag Using an AFTER trigger to try to alter the row (too late) or a BEFORE trigger to read the generated primary key (not assigned yet).
source: PostgreSQL docs — Overview of Trigger Behavior (BEFORE/AFTER/INSTEAD OF) ↗ -
When is a materialized view the wrong tool, and what would you use instead?
A materialized view recomputes its *entire* result on
REFRESH— there's no built-in incremental update in core Postgres. So it's the wrong tool when you need near-real-time freshness or the base data is huge and changes constantly: each full refresh is expensive and the data is stale between refreshes.Better alternatives by need: for freshness, maintain a summary/rollup table updated incrementally by triggers or in the write path (
comment_count); for ad-hoc speed without staleness, just add the right indexes to the plain view's query; for genuinely incremental materialization, reach for an external tool or an extension (e.g. continuous aggregates in TimescaleDB). Materialized views fit *expensive, periodically-refreshed reporting* — dashboards that tolerate minutes/hours of lag.What a strong answer coversCore Postgres materialized views refresh in full — no incremental maintenance.
Wrong for near-real-time needs or huge, constantly-changing base data.
Freshness alternative: an incrementally-maintained summary table (triggers / write-path updates).
Speed-without-staleness alternative: index the plain view's underlying query.
Right fit: expensive, periodically-refreshed reporting that tolerates lag.
Follow-ups they push on- How would you keep a comment_count fresh without a materialized view?
- What do TimescaleDB continuous aggregates add over a plain materialized view?
Red flag Using a materialized view for data that must be fresh, then refreshing it constantly and paying a full recompute each time instead of maintaining an incremental summary table.
source: PostgreSQL docs — Materialized Views (refresh is full recompute) ↗ -
How do you prevent a row-level trigger from recursively firing on its own writes?
If an
AFTER UPDATEtrigger on a table issues anotherUPDATEon the same table, that write fires the trigger again — risking infinite recursion and astack depth limit exceedederror.Guards: (1) make the trigger's write a no-op when nothing changed — in a BEFORE trigger,
IF NEW IS DISTINCT FROM OLD THEN … ELSE RETURN NULLstops the cascade once values stabilize; (2) only re-write when a condition flips, so the second pass changes nothing and the chain ends; (3) usepg_trigger_depth()to act only at depth 1; (4) restructure so the trigger updates a *different* table. The cleanest fix is usually a BEFORE trigger that mutatesNEWin place (no second UPDATE needed at all) rather than issuing a recursive write.What a strong answer coversA trigger that writes back to its own table re-fires itself -> potential infinite recursion.
Symptom:
stack depth limit exceeded.Guard with a 'did anything actually change?' check (
NEW IS DISTINCT FROM OLD).Or gate on
pg_trigger_depth(), or update a different table.Best: a BEFORE trigger that edits
NEWdirectly — no recursive UPDATE at all.
Follow-ups they push on- Why does mutating NEW in a BEFORE trigger avoid recursion entirely?
- What does pg_trigger_depth() return and how do you use it?
Red flag Writing an AFTER trigger that UPDATEs the same table unconditionally, causing it to re-fire forever and hit the stack-depth limit.
source: PostgreSQL docs — Trigger Procedures / recursion behavior ↗ -
When should business logic live in stored procedures/functions versus the application?
Pushing logic into the database keeps it close to the data: fewer round trips, atomic multi-statement work, reuse across apps and languages, and often faster set-based processing.
The costs: logic is now split across two codebases, it's harder to version/test/debug, you take on DB-vendor lock-in, and it burns scarce DB CPU that's hard to scale horizontally.
The modern default is to keep business logic in the application and reserve DB routines for data-intensive, set-based, or integrity-critical work where the round-trip or consistency win is real.
Follow-ups they push on- Function vs procedure in Postgres — which can control transactions?
- How would you version-control and test stored procedures?
Red flag Either extreme: cramming all business logic into the DB (unmaintainable, unscalable), or chatty app code looping row-by-row over work that should be one set-based statement.
source: PostgreSQL — CREATE PROCEDURE ↗ -
How do you refresh a materialized view without blocking reads?
A plain
REFRESH MATERIALIZED VIEW mvtakes an exclusive lock and blocks every reader until it finishes. UseREFRESH MATERIALIZED VIEW CONCURRENTLY mvinstead: it rebuilds without blockingSELECTs. The trade-offs are that it requires aUNIQUEindex on the view (so it can diff rows) and it runs slower.Schedule refreshes off-peak (cron /
pg_cron) or kick them off right after the upstream load completes. If you need near-real-time freshness, full refresh is the wrong tool — maintain a trigger-updated summary table instead.Follow-ups they push on- Why does CONCURRENTLY require a unique index?
- When is an incrementally-maintained summary table better than a materialized view?
Red flag Running a plain (non-concurrent) refresh on a hot view during business hours and locking out every reader.
source: PostgreSQL — REFRESH MATERIALIZED VIEW ↗
04 Node.js Internals 86 Q's
4.1 The event loop & async model 16
-
What prints, and in what order? console.log("A"); setTimeout(() => console.log("B"), 0); queueMicrotask(() => console.log("C")); Promise.resolve().then(() => console.log("D")); console.log("E")
A E C D B.Sync code first:
A,E. Then the microtask queue drains before any macrotask.queueMicrotaskandPromise.resolve().thenfeed the same Promise/microtask queue, so they run in registration order:Cwas queued first, thenD. Finally thesetTimeoutmacrotask fires in the timers phase:B.The point:
queueMicrotaskis not a separate higher-priority queue likenextTick— it shares the Promise microtask queue and is the standards-based way to schedule a microtask.What a strong answer coversSync runs to completion first:
A, thenE.queueMicrotaskandPromise.thenshare one microtask queue, drained in FIFO/registration order:CthenD.All microtasks drain before any macrotask, so
setTimeout'sBis last.Unlike
process.nextTick,queueMicrotaskhas no priority over Promise callbacks — same queue.
Quick self-checkWhat is the output order?
-
Correct — sync (A,E), then microtasks in registration order (C,D), then the timer (B).
-
Wrong: C was queued before D, and they share one FIFO queue, so C precedes D.
-
Wrong: the setTimeout macrotask runs after the microtask queue is fully drained, not before.
-
Wrong: E is synchronous and runs before any microtask (C, D).
Follow-ups they push on- Where would a process.nextTick callback land relative to C and D?
- Why prefer queueMicrotask over Promise.resolve().then() for scheduling a microtask?
Red flag Treating queueMicrotask as a separate, higher-priority queue — it shares the Promise microtask queue and runs in registration order.
source: MDN — queueMicrotask ↗ -
What prints, and in what order? console.log("A"); setTimeout(() => console.log("B"), 0); Promise.resolve().then(() => console.log("C")); process.nextTick(() => console.log("D")); console.log("E")
A E D C B.First the synchronous code runs top to bottom:
A, thenE. The other three are deferred. Before the event loop advances to its next phase, Node drains its microtask queues, andprocess.nextTickhas its own queue that runs before the Promise microtask queue, soDthenC. Finally thesetTimeoutcallback fires in the timers phase:B.The rule to memorize: nextTick > Promise microtasks > macrotasks (timers/immediate/I/O).
Follow-ups they push on- Why does process.nextTick run before the Promise callback even though it was scheduled later in the code?
- What happens if a nextTick callback schedules another nextTick — can it starve the loop?
Red flag Saying the order follows the source-code order, or putting `C` (Promise) before `D` (nextTick).
source: Node.js docs — Event loop, timers, and nextTick ↗ -
What prints? for (let i = 0; i < 3; i++) { setTimeout(() => console.log(i), 0); } for (var j = 0; j < 3; j++) { setTimeout(() => console.log(j), 0); }
0 1 2 3 3 3.The first loop uses
let, which is block-scoped: each iteration gets a fresh binding ofi, so the three closures capture0,1,2respectively. The second loop usesvar, which is function-scoped: all three closures capture the *same*j, and by the time the timers fire (after the synchronous loops finish)jis already3— so it prints3three times.This is the classic closures-in-a-loop trap. The timers all queue with delay 0 and fire in order after the synchronous code completes.
What a strong answer coversletis block-scoped → a fresh binding per iteration → captures 0, 1, 2.varis function-scoped → one shared binding → all closures see the final value 3.All callbacks are deferred (setTimeout), so they read the variable after the loop finishes.
Fix for var: an IIFE per iteration, or just use
let.
Quick self-checkWhat is the output?
-
Correct — let gives per-iteration bindings (0,1,2); var shares one binding, which is 3 by the time the timers fire.
-
Wrong: the var loop's closures all reference the same j, which has reached 3.
-
Wrong: the let loop creates a fresh binding each iteration, so it prints 0,1,2.
-
Wrong: after the var loop completes, j is 3 (the condition failed at 3), not 2.
Follow-ups they push on- How would you make the var loop print 0 1 2 without changing var to let?
- Would using Promise.resolve().then instead of setTimeout change the captured values?
Red flag Expecting both loops to print 0 1 2 — the var loop captures one shared, function-scoped binding.
source: Lydia Hallie — javascript-questions ↗ -
At the top level of a module: setTimeout(() => console.log("timeout"), 0); setImmediate(() => console.log("immediate")). Which logs first?
It is not guaranteed — the order is non-deterministic at the top level.
setTimeout(0)is clamped to a 1ms timer, so whether the timers phase or the check phase reaches its callback first depends on how long process setup took. Run it twice and you may see different orders.The twist interviewers want: move both into an I/O callback, e.g. inside
fs.readFile(...), andsetImmediatealways wins. After an I/O (poll-phase) callback, the loop goes straight to the check phase (setImmediate) before looping back to timers.Follow-ups they push on- Why does setImmediate become deterministic once you nest both inside an fs.readFile callback?
- Where in the phase order do timers and check sit relative to the poll phase?
Red flag Claiming setImmediate or setTimeout always wins at the top level — the whole point is that it is non-deterministic there.
source: Node.js docs — setImmediate vs setTimeout ↗ -
Name the phases of the Node.js event loop in order, and say what runs in each.
Six phases, run in this order each iteration ("tick"):
1. timers —
setTimeout/setIntervalcallbacks whose threshold has elapsed.
2. pending callbacks — a few deferred system/OS callbacks (e.g. some TCP errors).
3. idle, prepare — internal to libuv; you never schedule here.
4. poll — retrieve new I/O events and run their callbacks; the loop may block here waiting for I/O.
5. check —setImmediatecallbacks.
6. close callbacks — e.g.socket.on("close", ...).Between every callback (and between phases) Node drains the microtask queues: the
process.nextTickqueue first, then the Promise/queueMicrotaskqueue.Follow-ups they push on- In which phase does the loop actually block waiting for work?
- Are microtasks a phase of the loop? (No — they drain between callbacks.)
Red flag Listing microtasks (Promises) as an event-loop phase — they are not; they run between phases.
source: Node.js docs — Event loop, timers, and nextTick ↗ -
Node.js is "single-threaded," yet it handles thousands of concurrent connections. How? Where do background threads come from?
There is one JavaScript thread that runs all your code on the event loop. Concurrency comes from not waiting: when you do I/O (network, disk, DNS), Node hands the work to the OS or to libuv and registers a callback, then immediately moves on. When the I/O completes, its callback is queued and runs later on the JS thread.
Most network I/O uses the OS's async primitives directly (epoll/kqueue/IOCP) — no extra thread. A few things that lack an async OS API run on libuv's thread pool (default size 4,
UV_THREADPOOL_SIZE): file-system ops, DNSlookup, and somecrypto/zlibwork.So: one thread for JS, the OS + a small libuv pool for the blocking bits.
Follow-ups they push on- Which built-in operations actually use the libuv thread pool?
- What is UV_THREADPOOL_SIZE and when would you raise it?
Red flag Saying every async operation spawns a thread, or that the thread pool handles network sockets (it usually does not).
source: Node.js docs — Don't block the event loop ↗ -
What is the difference between process.nextTick() and setImmediate(), despite the confusing names?
The names are backwards from what you would guess.
-
process.nextTick(cb)runscbbefore the event loop continues — as soon as the current operation finishes, before returning to the loop. It is a microtask, higher priority than Promises. "Next tick" here means "before the next loop phase," i.e. almost immediately.
-setImmediate(cb)schedulescbfor the check phase of the *next* loop iteration. Despite "immediate," it is later thannextTick.Node docs themselves recommend
setImmediatefor most cases because it is easier to reason about and cannot starve the loop the way recursivenextTickcan.Follow-ups they push on- Why can recursive process.nextTick starve I/O but recursive setImmediate cannot?
- Which one would you use to defer work to 'after this function returns but before any I/O'?
Red flag Assuming setImmediate runs before nextTick because of the name — it is the opposite.
source: Node.js docs — Understanding setImmediate() ↗ -
A request handler runs a synchronous for-loop summing 1 to 10 billion. What happens to every other in-flight request, and why?
Every other request stalls until the loop finishes. There is one JS thread, and a synchronous CPU-bound loop never yields to the event loop — no timers fire, no I/O callbacks run, no new connections are accepted. The whole server appears frozen.
Fixes, in order of preference:
1. Offload the CPU work to a Worker thread (or a child process / external service).
2. Chunk the work and yield between chunks withsetImmediateso the loop can service I/O.
3. Push it out of the request path entirely (a job queue).The mental model: async I/O is free concurrency, but CPU work is not — it must be moved off the main thread.
Follow-ups they push on- How would you detect event-loop blocking in production? (event-loop lag / monitoring.)
- Why is Worker threads better than just adding more setTimeout calls here?
Red flag Thinking async/await or wrapping the loop in a Promise makes synchronous CPU work non-blocking — it does not.
source: Node.js docs — Don't block the event loop ↗ -
What prints? async function f() { console.log(1); await null; console.log(2); } console.log(3); f(); console.log(4)
3 1 4 2.console.log(3)runs. Thenf()is *called* and runs synchronously up to theawait: it logs1. Atawait null, the function suspends and its continuation (console.log(2)) is scheduled as a microtask; control returns to the caller, which logs4. The synchronous stack is now empty, so the microtask queue drains:2.The insight: code before the first
awaitruns synchronously; everything afterawaitis a microtask, even when you await an already-resolved value likenull.Follow-ups they push on- Does it matter that we awaited `null` instead of a real Promise? (No — await always yields.)
- Where would a process.nextTick scheduled in main code land relative to console.log(2)?
Red flag Treating the body after `await` as still synchronous and printing `1 2` together.
source: Lydia Hallie — JavaScript Visualized: Promises & Async/Await ↗ -
What is UV_THREADPOOL_SIZE, what is its default, and what symptom tells you it's too small?
UV_THREADPOOL_SIZEis the environment variable that sets the size of libuv's thread pool, which backs the handful of operations that lack an async OS API: file-system I/O, DNSlookup, and somecrypto/zlibwork. The default is 4.The symptom of it being too small: those specific operations start queuing behind each other even though the CPU is idle and the event loop is free. For example, fire 5 concurrent
crypto.pbkdf2calls with a pool of 4 and the 5th does not start until one of the first four finishes — added latency that looks mysterious because nothing is "blocked."Raise it (e.g.
UV_THREADPOOL_SIZE=8) when you do heavy concurrent fs/crypto work, but it must be set before the pool is created (at process start).What a strong answer coversSets libuv's thread pool size; default 4.
Backs fs I/O,
dns.lookup, and somecrypto/zlib— not network sockets (those use the OS directly).Symptom of too-small: those ops serialize/queue while CPU and event loop sit idle.
Must be set at process startup — changing it after the pool spins up has no effect.
Follow-ups they push on- Why doesn't raising UV_THREADPOOL_SIZE help an HTTP server doing pure network I/O?
- How would you tell pool saturation apart from event-loop blocking?
Red flag Raising the pool size to fix latency on network I/O — sockets don't use the pool, so it does nothing.
source: Node.js docs — UV_THREADPOOL_SIZE ↗ -
What prints? const fs = require("fs"); fs.readFile(__filename, () => { setTimeout(() => console.log("timeout"), 0); setImmediate(() => console.log("immediate")); });
immediatethentimeout— deterministically, every run.The
readFilecallback runs in the poll phase. From the poll phase the loop advances next to the check phase, wheresetImmediatecallbacks live — soimmediatefires first. Only after wrapping back around to the timers phase does thesetTimeout(0)callback run:timeout.This is the famous twist: at the top level
setTimeout(0)vssetImmediateordering is non-deterministic, but inside an I/O callbacksetImmediatealways wins because check immediately follows poll.What a strong answer coversThe I/O callback runs in the poll phase; the loop goes poll → check → (wrap) → timers.
check (setImmediate) comes right after poll, so
immediateruns beforetimeout.This ordering is deterministic inside an I/O callback (unlike at the top level).
It demonstrates the phase order, not a race —
setImmediatereliably beatssetTimeout(0)here.
Quick self-checkWhat prints, and is it deterministic?
-
Correct — poll is followed by check, so setImmediate runs before the loop wraps back to timers.
-
Wrong: timers come after check when entering from the poll phase, so setImmediate wins.
-
Wrong: that's only true at the top level; inside an I/O callback the phase order makes it deterministic.
-
Wrong: both callbacks run; the timer fires on the next loop iteration.
Follow-ups they push on- Why is the same pair non-deterministic at the top level of the module?
- Where does a process.nextTick scheduled inside the readFile callback run relative to these two?
Red flag Saying setTimeout wins or that it's non-deterministic — inside an I/O callback, setImmediate is guaranteed first.
source: Node.js docs — setImmediate() vs setTimeout() ↗ -
Can recursive process.nextTick() starve the event loop? Contrast with recursive setImmediate().
Yes — recursive
process.nextTickcan starve the loop. The nextTick queue is drained completely between phases, and a callback that schedules another nextTick keeps re-filling that queue, so the loop never advances to timers, poll, or I/O. Your server stops accepting connections and firing timers while the CPU spins on nextTicks.Recursive
setImmediatedoes not starve I/O.setImmediatecallbacks run in the check phase, and each loop iteration runs the immediates queued *before* this iteration started — newly-scheduled ones wait for the *next* iteration. So the loop still visits the poll phase between iterations and services I/O.This is exactly why Node's docs recommend
setImmediateovernextTickfor deferring work in most cases.What a strong answer coversnextTick queue drains fully between phases; recursive nextTick re-fills it and blocks the loop from advancing.
Recursive setImmediate yields each iteration — newly-queued immediates wait for the next tick, so I/O still runs.
Starvation symptom: timers don't fire and new connections aren't accepted while CPU is busy.
Docs recommend setImmediate for deferral precisely because it can't starve the loop.
Follow-ups they push on- Why does a newly-scheduled setImmediate wait for the next loop iteration but a newly-scheduled nextTick does not?
- When is process.nextTick still the right tool despite the starvation risk?
Red flag Using recursive nextTick for chunked work — it can lock out all I/O; use setImmediate to chunk safely.
source: Node.js docs — process.nextTick() ↗ -
How does the event loop in Node differ from the one in the browser? Name two concrete differences.
They share the core idea — a single JS thread, a macrotask queue, and a microtask queue drained between tasks — but differ in details:
1. Extra microtask queue: Node has
process.nextTick, which runs before the Promise microtask queue. The browser has only the Promise/queueMicrotaskqueue.
2. Phases andsetImmediate: Node's loop is libuv's multi-phase loop (timers, poll, check, …) and exposessetImmediate(the check phase). The browser has nosetImmediate; its closest analog is task scheduling viasetTimeout/messaging, and rendering steps (style/layout/paint,requestAnimationFrame) are interleaved into its loop — Node has no rendering.So: Node = libuv phases +
nextTick+setImmediate, no rendering; browser = task/microtask + a render step, nonextTick/setImmediate.What a strong answer coversNode has two microtask queues (nextTick before Promises); the browser has only the Promise queue.
Node's loop has libuv phases and
setImmediate; the browser has neither.The browser interleaves rendering (rAF, style/layout/paint); Node has no render step.
Both: single JS thread, microtasks drain to empty between macrotasks.
Follow-ups they push on- What's the browser's closest equivalent to setImmediate?
- Where does requestAnimationFrame sit relative to microtasks in the browser?
Red flag Assuming setImmediate or process.nextTick exist in the browser, or that the two loops are identical.
source: MDN — The event loop ↗ -
What is 'event-loop lag' (event-loop delay), why does it matter, and how do you measure it?
Event-loop lag is the extra time between when a callback (e.g. a timer) was *supposed* to run and when it *actually* runs. A timer set for 0ms that fires 80ms late means the loop spent ~80ms busy elsewhere — almost always a synchronous, CPU-bound task blocking the single thread.
It matters because it is the single best health signal for a Node service: high lag means requests are queuing and latency is spiking for *everyone*, even if CPU and memory look fine. It is the symptom of "don't block the event loop."
Measure it precisely with the built-in
perf_hooks.monitorEventLoopDelay()histogram (min/max/percentiles), or the crude classic: a recurringsetIntervalthat records how far past its scheduled time it fires.What a strong answer coversLag = actual minus scheduled callback time; reflects how long the loop was busy.
High lag almost always means synchronous CPU work blocking the one JS thread.
It's a leading indicator of latency for all requests, not just one.
Measure with
perf_hooks.monitorEventLoopDelay()(a histogram) or a self-timing setInterval.
Follow-ups they push on- What's a healthy lag threshold for an HTTP service, and what would you alert on?
- How does monitorEventLoopDelay differ from just timing a setInterval?
Red flag Diagnosing latency with CPU/memory only — a blocked loop can show low CPU yet high lag and timeouts.
source: Node.js docs — perf_hooks.monitorEventLoopDelay ↗ -
What prints? console.log("start"); setTimeout(() => console.log("timeout"), 0); Promise.resolve().then(() => { console.log("promise1"); process.nextTick(() => console.log("nextTick-in-promise")); }); process.nextTick(() => console.log("nextTick")); console.log("end")
start end nextTick promise1 nextTick-in-promise timeout.Sync first:
start,end. Then the microtask drain begins. The nextTick queue runs to completion first:nextTick. Then the Promise queue:promise1— which itself schedules a new nextTick. The drain is exhaustive: after the Promise queue, Node re-checks the nextTick queue and findsnextTick-in-promise, running it before leaving the microtask phase. Only once both microtask queues are empty does the loop reach timers:timeout.Key idea: microtasks added while draining are processed in the same drain, before any macrotask.
Follow-ups they push on- Could this pattern (nextTick scheduling nextTick) starve the timers phase indefinitely?
- Where does queueMicrotask sit relative to process.nextTick?
Red flag Running `timeout` before `nextTick-in-promise` — newly-queued microtasks still drain before any timer.
source: Node.js docs — Event loop, timers, and nextTick ↗ -
Are microtasks (Promise callbacks) part of the event loop's phases? When exactly do they run?
No — microtasks are not one of the libuv phases. There are two microtask queues (the
process.nextTickqueue, then the Promise/queueMicrotaskqueue) that Node drains completely between every callback and at each phase boundary.Concretely: run one callback from a phase, then fully drain nextTick, then fully drain Promises, then run the next callback. Because the drain is exhaustive, a flood of microtasks (or recursive
nextTick) can delay the loop from ever reaching the next macrotask — a real starvation risk.In the browser the model is similar but there is only the Promise microtask queue (no
nextTick).Follow-ups they push on- How does this differ between Node and the browser?
- What is queueMicrotask and why prefer it over Promise.resolve().then for scheduling?
Red flag Describing microtasks as 'the last phase' of the loop — they interleave between callbacks, not at the end.
source: Node.js docs — Event loop, timers, and nextTick ↗
4.2 Async evolution & error handling 14
-
What does Node do by default when a promise rejects with no handler? Has this changed across versions?
In current Node (the
--unhandled-rejections=throwdefault since v15), an unhandled rejection is treated like an uncaught exception: Node prints the error and terminates the process with a non-zero exit code.This was a deliberate hardening. Older Node (≤ v14) only logged an
UnhandledPromiseRejectionWarningand kept running — which let silent, half-broken state accumulate. The change forces you to handle rejections.You can still observe them via the
process.on("unhandledRejection", ...)event (log/flush before exit), or override the mode with--unhandled-rejections=warn, but the right fix is toawait/.catchthe promise. Treat a crash here as a real bug, not noise.What a strong answer coversCurrent default (
throw, since v15): an unhandled rejection crashes the process with a non-zero code.Node ≤ v14 only logged a warning and kept running — the old, dangerous behavior.
Hook
process.on('unhandledRejection')to log/flush, but exit; don't swallow.The real fix is upstream:
await,return, or.catchthe promise.
Quick self-checkBy default in current Node, an unhandled promise rejection will:
-
Wrong: that was the pre-v15 behavior; it's no longer the default.
-
Correct — the default mode is `throw`, treating it like an uncaught exception.
-
Wrong: Node never silently ignores rejections; at minimum it warns, and now it crashes.
-
Wrong: Node has no retry mechanism for rejected promises.
Follow-ups they push on- Why was 'log and keep running' considered dangerous enough to change the default?
- What's the difference between the unhandledRejection and rejectionHandled events?
Red flag Assuming an unhandled rejection just logs a warning — in modern Node it terminates the process.
source: Node.js docs — --unhandled-rejections=mode ↗ -
Trace the evolution callbacks → Promises → async/await. What problem did each step solve?
Callbacks: the original async primitive — pass a
function(err, result). The error-first convention is the norm, but nesting dependent async steps creates the deeply-indented "callback hell" / pyramid of doom, and error handling is manual at every level.Promises (ES2015): a first-class object representing a future value with
.then/.catch. They flatten nesting into chains and give one.catchfor the whole chain. Composition helpers:Promise.all,race,allSettled,any.async/await (ES2017): syntactic sugar over Promises.
awaitlets you write asynchronous code that *reads* synchronously, and ordinarytry/catchhandles errors. Under the hood it is still Promises and microtasks.Follow-ups they push on- Is async/await just Promises under the hood? (Yes.)
- When would you still reach for raw Promise combinators over await?
Red flag Claiming async/await makes code run on a background thread — it is the same single-threaded microtask machinery.
source: MDN — Asynchronous JavaScript ↗ -
What prints, and does the program crash? Promise.reject(new Error("boom")).catch(() => console.log("caught")); console.log("sync")
syncthencaught, and it does not crash.The
.catchis attached synchronously, in the same expression — so the rejection has a handler from the start; it's never "unhandled." The handler runs as a microtask, after the synchronousconsole.log("sync"). So order issync, thencaught.Contrast with
const p = Promise.reject(...); ... attach .catch later: as long as the handler is attached within the same tick, Node still treats it as handled. The danger is a rejection that reaches the end of a tick with *no* handler attached — that's what firesunhandledRejection.What a strong answer covers.catchis attached in the same expression, so the rejection is handled — no crash.The catch handler runs as a microtask, after synchronous code:
syncthencaught.A rejection is 'unhandled' only if no handler is attached by the end of the tick.
Attaching
.catcheven a few lines later (same tick) still counts as handled.
Quick self-checkWhat is the output, and does it crash?
-
Wrong order: the .catch handler is a microtask, so it runs after synchronous `sync`.
-
Correct — sync runs first, the catch microtask runs after, and the rejection is handled.
-
Wrong: a .catch is attached, so the rejection is handled and there's no crash.
-
Wrong: the synchronous console.log('sync') always runs.
Follow-ups they push on- What would change if you removed the .catch entirely?
- Does attaching .catch in a later setTimeout still prevent unhandledRejection?
Red flag Thinking a synchronously-caught rejection crashes — it's handled, and the handler is just a microtask.
source: MDN — Promise.prototype.catch ↗ -
Why is `[1,2,3].forEach(async (x) => { await save(x); })` a trap? What happens to errors and ordering?
forEachignores the return value of its callback. Your callback returns a Promise, butforEachdiscards it — so nothingawaits the saves. The result:- No waiting: code after the
forEachruns *before* anysavefinishes; you can't sequence anything after it.
- Lost errors: each callback's promise floats; a rejection becomes anunhandledRejectionrather than something you can catch.
- No ordering guarantee relative to the surrounding code.Use
for...ofwithawaitfor sequential, orawait Promise.all(arr.map(fn))for concurrent. Both actually wait and let errors propagate.``
``
for (const x of [1,2,3]) await save(x); // sequential
await Promise.all([1,2,3].map((x) => save(x))); // concurrentWhat a strong answer coversforEachdiscards the callback's returned Promise — the awaits are never awaited by the caller.Code after the forEach runs before the saves complete (no sequencing).
Rejections float → unhandledRejection, not catchable at the call site.
Use
for...of+ await (sequential) orPromise.all(map(...))(concurrent).
Follow-ups they push on- Which replacement gives sequential vs concurrent execution?
- Do .map and .filter have the same async problem as forEach?
Red flag Passing an async function to forEach and assuming the loop waits — it doesn't, and errors are lost.
source: MDN — Array.prototype.forEach (Caveats / async) ↗ -
What is a "floating promise," and why is it dangerous? Show a version of fetchUser() that silently loses errors.
A floating promise is a Promise you create but never
await,return, or attach.catchto. If it rejects, the rejection is unhandled — the error vanishes (and in modern Node, crashes the process).``
`
function handler(req, res) {
saveToDb(req.body); // floating — no await, no .catch
res.send("ok"); // responds 200 even if the DB write throws
}The client gets 200 OK
while the write may have failed silently. Fixes:await saveToDb(...)(and wrap in try/catch), orreturnit, or attach.catch. Lint rules like@typescript-eslint/no-floating-promises` catch these.Follow-ups they push on- What does Node do by default on an unhandled rejection in current LTS?
- How does the `no-floating-promises` lint rule help?
Red flag Assuming an un-awaited async call's errors will surface somewhere — they are lost unless explicitly handled.
source: Node.js docs — process 'unhandledRejection' ↗ -
Why doesn't this try/catch catch the error? try { setTimeout(() => { throw new Error("boom"); }, 0); } catch (e) { console.log("caught"); }
It does not catch anything — the program crashes with an uncaught exception.
try/catchonly guards the synchronous execution of its block. By the time thesetTimeoutcallback actually runs (a later event-loop tick), thetryblock has long since returned and its stack frame is gone. The thrown error has no surrounding catch, so it becomes anuncaughtException.To handle it, the try/catch must live inside the async callback, or use a Promise and
.catch/await:``
``
setTimeout(() => {
try { throw new Error("boom"); } catch (e) { console.log("caught"); }
}, 0);Follow-ups they push on- Why does try/catch around an `await`ed Promise work, but not around a bare callback?
- What is the last-resort safety net for uncaught exceptions, and why shouldn't you keep running after one?
Red flag Believing a synchronous try/catch can catch errors thrown from a later callback.
source: Node.js docs — process 'uncaughtException' ↗ -
Compare Promise.all, Promise.allSettled, Promise.race, and Promise.any. When would you pick each?
-
Promise.all— resolves with an array of all results; rejects on the first rejection (fail-fast). Use when you need *every* task to succeed (e.g. fan-out queries that all must return).
-Promise.allSettled— never rejects; resolves with{status, value|reason}for each. Use when you want *all* results regardless of individual failures (e.g. notify N services, report which failed).
-Promise.race— settles (resolve or reject) as soon as the first promise settles. Use for timeouts: race the work against a timer.
-Promise.any— resolves with the first fulfilled value; rejects only if *all* reject (with anAggregateError). Use for redundancy: first successful mirror/replica wins.Follow-ups they push on- How do you implement a timeout with Promise.race?
- With Promise.all, do the other promises stop running when one rejects? (No — they keep going.)
Red flag Confusing `race` (first to settle, including rejection) with `any` (first to fulfill), or assuming `all` cancels siblings on rejection.
source: MDN — Promise.allSettled ↗ -
What prints, and how long does it take? const a = await slow(1000); const b = await slow(1000); — vs — const [a, b] = await Promise.all([slow(1000), slow(1000)])
The sequential version takes ~2000ms; the
Promise.allversion takes ~1000ms.In the first snippet each
await*pauses* until that promise settles before the next call even starts — the twoslow(1000)calls run back-to-back. In the second, bothslow(1000)calls are invoked first (kicking off concurrently), andawait Promise.allwaits for both — so they overlap.The lesson:
awaitin a sequence serializes independent work. If tasks do not depend on each other, start them together and await the aggregate.Follow-ups they push on- How would you write this so b's input depends on a's result? (Then sequential is correct.)
- What's the bug in `for (const url of urls) await fetch(url)` when order doesn't matter?
Red flag Awaiting independent operations one-by-one in a loop, turning parallelizable work into serial latency.
source: MDN — Using Promises ↗ -
How do you handle errors in async/await code, and what's the difference between unhandledRejection and uncaughtException?
Within an async function, wrap awaited calls in
try/catch; thecatchreceives whatever the awaited promise rejected with. For fire-and-forget chains, attach.catch. At the boundary (e.g. an Express route), funnel errors to a central error handler.The two process-level events:
-
unhandledRejection— a Promise rejected with no handler. Usually a bug (a floating promise). In current Node it terminates the process by default.
-uncaughtException— a synchronous (or callback) error bubbled to the top with no try/catch.Both should be treated as last-resort: log, flush, and exit. The process is in an unknown state, so do not silently continue serving traffic.
Follow-ups they push on- Why is it unsafe to keep the process alive after an uncaughtException?
- Where should the single catch-all error handler live in an Express app?
Red flag Using process.on('uncaughtException') to swallow errors and keep running — that hides corruption and leaks.
source: Node.js docs — process events ↗ -
What does util.promisify do, and why is the error-first callback convention what makes it possible?
util.promisify(fn)wraps a function that follows Node's error-first callback convention —fn(...args, (err, result) => ...)— and returns a version that returns a Promise instead. The promise rejects witherrif it's truthy, otherwise resolves withresult.It works *only* because the callback shape is standardized: error first, single result second.
promisifyknows exactly where the error and value are, so it can mechanically translate callback → Promise. Functions with a different callback shape (multiple results, or callback-first) needpromisify.customor manual wrapping.In practice you reach for it less now because most core modules ship promise variants (
fs.promises,dns.promises,timers/promises), but it's still the bridge for legacy callback APIs.What a strong answer coversConverts an error-first callback API into one that returns a Promise.
Rejects on truthy
err, resolves on the singleresult— exactly the error-first shape.Non-standard callback shapes need
util.promisify.custom.Often unnecessary today:
fs.promises,dns.promises,timers/promisesexist.
Follow-ups they push on- How would you promisify a callback that returns multiple result values?
- When would you still use util.promisify instead of the promise-native API?
Red flag Promisifying a function whose callback isn't error-first (or callback-first) — the wrapper resolves/rejects wrongly.
source: Node.js docs — util.promisify ↗ -
Why is it unsafe to keep the process alive after an 'uncaughtException'? What's the correct response?
An
uncaughtExceptionmeans an error escaped all try/catch and bubbled to the top. At that point you have no idea what state the program is in — a half-finished write, a held lock, a corrupted in-memory structure, a leaked connection. Continuing to serve traffic on top of that corruption risks silently wrong results and resource leaks.Node's own docs are explicit: the handler is for synchronous cleanup, not for resuming normal operation. The correct pattern is to log the error, flush logs/metrics, release critical resources, and exit with a non-zero code — then let your process manager (systemd, Kubernetes, PM2) restart a fresh, clean process.
For graceful handling of *expected* errors, catch them where they occur;
uncaughtExceptionis the last-resort net, not a control-flow mechanism.What a strong answer coversAfter uncaughtException the process state is unknown/corrupt — locks, writes, structures may be half-done.
Node docs: the handler is for sync cleanup, not for resuming work.
Correct response: log, flush, release resources, exit non-zero; let a supervisor restart.
Use it as a last-resort net; handle expected errors at their source.
Follow-ups they push on- What process-level supervisor would restart the exited process in a container?
- How does the domain module / AsyncLocalStorage relate to error isolation?
Red flag Using process.on('uncaughtException') to swallow and continue — it masks corruption and leaks.
source: Node.js docs — Warning: Using 'uncaughtException' correctly ↗ -
What does async/await actually compile to, and why does that mean two awaits in a row are slower than Promise.all?
await expris syntactic sugar for taking the promiseexprresolves to and suspending the function until it settles, scheduling the continuation as a microtask — roughlyPromise.resolve(expr).then(continuation). The function literally pauses at eachawaitand resumes only after that promise settles.So
const a = await f(); const b = await g();cannot startg()untilf()has fully settled — they are serialized, total time ≈ time(f) + time(g). WithPromise.all([f(), g()]), bothf()andg()are *invoked synchronously first* (kicking off concurrently), and you await the aggregate — total ≈ max(f, g).The mental model:
awaitis a pause point, not a parallelizer. Start independent work before you await it.What a strong answer coversawait≈ pause the function and resume its continuation as a microtask once the promise settles.Sequential awaits serialize: each starts only after the previous settles.
Promise.allinvokes all the calls first, then awaits the aggregate → overlap.Use sequential awaits only when later work *depends* on the earlier result.
Follow-ups they push on- How would you start two awaits concurrently without Promise.all? (Assign the promises first, await later.)
- Does code before the first await run synchronously? (Yes.)
Red flag Treating await as 'fire concurrently' — it's a suspension point; independent awaits run one after another.
source: MDN — await ↗ -
How do you add a timeout to an async operation that has no built-in timeout, and what's the catch with AbortController?
The classic pattern is
Promise.racebetween the work and a timer that rejects:``
`
function withTimeout(p, ms) {
return Promise.race([
p,
new Promise((_, rej) =>
setTimeout(() => rej(new Error("timeout")), ms)),
]);
}The catch: Promise.race
only stops waiting — it does not cancel the underlying work. The original promise keeps running (the request still completes, the socket stays open), and the leftoversetTimeoutkeeps the loop alive unless youclearTimeoutit. So you can leak timers and in-flight requests.The better tool when the API supports it is AbortController
: pass controller.signaltofetch/streams/etc. and callcontroller.abort()on timeout to actually cancel the work and release resources.AbortSignal.timeout(ms)` is the built-in shorthand. The catch with AbortController: it only works if the callee honors the signal — it can't cancel arbitrary code that ignores it.What a strong answer coversPromise.race([work, timeoutReject])is the standard timeout pattern.race stops *waiting* but does not cancel the underlying work — it keeps running.
Clear the timer (
clearTimeout) or it can keep the event loop alive / leak.Prefer
AbortController/AbortSignal.timeout(ms)to truly cancel — but only if the callee honors the signal.
Follow-ups they push on- Why does the work keep running after Promise.race rejects on timeout?
- What does AbortController give you that Promise.race can't?
Red flag Assuming Promise.race cancels the slow operation — it only stops awaiting it; the work and timer can leak.
source: MDN — AbortController ↗ -
You have an array of IDs and want to fetch each, but the upstream API rate-limits you. Why is `await Promise.all(ids.map(fetchOne))` risky, and what's a better pattern?
Promise.all(ids.map(fetchOne))fires all requests at once. With thousands of IDs you can exhaust sockets, blow memory, and trip the upstream rate limit — every request fails together.Better: bound the concurrency. Process in fixed-size batches, or use a concurrency-limiter (e.g.
p-limit) so at most N run at a time:``
`
const limit = pLimit(5);
const results = await Promise.all(
ids.map((id) => limit(() => fetchOne(id)))
);This keeps Promise.all
's aggregate semantics while capping in-flight requests at 5. For pure sequential needs, a plainfor...ofwithawait` works but is slow.Follow-ups they push on- How would you also add retry-with-backoff for the rate-limit 429s?
- Why is allSettled sometimes better than all here?
Red flag Unbounded Promise.all over a large array — it looks elegant but is a classic source of overload and rate-limit failures.
source: MDN — Promise.all ↗
4.3 Streams & buffers 14
-
Name the four stream types in Node and give a concrete example of each.
- Readable — you read data out of it. Example:
fs.createReadStream(file), an incoming HTTP request (req).
- Writable — you write data into it. Example:fs.createWriteStream(file), an HTTP response (res),process.stdout.
- Duplex — readable *and* writable, two independent channels. Example: a TCP socket (net.Socket).
- Transform — a Duplex where the output is a function of the input. Example:zlib.createGzip(), a crypto cipher, or a custom parser.The value of streams: process data in chunks as it arrives instead of buffering the whole payload in memory.
Follow-ups they push on- How is a Transform stream different from a plain Duplex?
- Which stream type is an HTTP request, and which is the response?
Red flag Saying Duplex and Transform are the same — Transform's output is derived from its input; a Duplex's two sides are unrelated.
source: Node.js docs — How to use streams ↗ -
Sketch a custom Transform stream that uppercases text. What are the _transform and _flush methods for?
Subclass
Transform(or pass atransformoption) and implement_transform(chunk, encoding, callback): process each incoming chunk,pushany output, and callcallback()to signal you're ready for the next chunk (orcallback(err)to error the stream).``
`
import { Transform } from "node:stream";
const upper = new Transform({
transform(chunk, _enc, cb) {
this.push(chunk.toString().toUpperCase());
cb();
},
});_flush(callback)
is optional and runs once, after the last chunk but before the stream ends — use it to emit any buffered/trailing data (e.g. the final piece of a line-splitter that has a partial line left over). _transformis per-chunk;_flush` is the one-time finalizer.What a strong answer covers_transform(chunk, enc, cb)runs per chunk: process,this.push(...), thencb().Call
cb(err)to propagate errors; calling cb signals readiness for the next chunk (backpressure-aware)._flush(cb)runs once after the last chunk to emit any buffered/trailing output.Pass
{ transform, flush }options or subclass — both work.
Follow-ups they push on- When is _flush essential? (Buffered/partial data like the last incomplete line.)
- How does calling the callback relate to backpressure on the readable side?
Red flag Forgetting to call the _transform callback — the stream stalls because it never asks for the next chunk.
source: Node.js docs — Implementing a Transform stream ↗ -
An Express handler does `fs.readFile(bigFile, (e, data) => res.send(data))` and the server OOMs under load. What's the streaming fix?
fs.readFilebuffers the entire file into memory before sending. Under concurrency, N simultaneous requests for a big file means N full copies in RAM at once — the heap balloons and the process OOMs.The fix is to stream the file straight to the response, so only small chunks are in memory and backpressure throttles reads to the client's download speed:
``
`
import { pipeline } from "node:stream/promises";
await pipeline(fs.createReadStream(bigFile), res);pipeline
wires backpressure (a slow client pauses the file read) and cleans up/propagates errors. Memory stays ~highWaterMark-sized per request, independent of file size. (Frameworks expose this asres.sendFile/reply.send(stream)`, which stream under the hood.)What a strong answer coversfs.readFileloads the whole file into RAM; N concurrent requests = N full copies → OOM.Streaming sends chunks, so per-request memory ≈ highWaterMark regardless of file size.
Backpressure throttles disk reads to the client's download rate.
Use
pipeline(createReadStream, res)(orres.sendFile) for error handling + cleanup.
Follow-ups they push on- Why does pipeline matter here over a bare .pipe to res?
- What does a slow client do to a streamed response vs a buffered one?
Red flag Buffering whole files with readFile in a request handler — fine in dev, OOMs under concurrent load.
source: Node.js docs — How to use streams ↗ -
You must read a 10GB file, transform each line, and write the result — on a box with 512MB RAM. How?
Stream it; never load the whole file. Build a pipeline of a Readable → Transform → Writable so only small chunks are in memory at any moment, with backpressure keeping the buffers bounded:
``
`
import { pipeline } from "node:stream/promises";
await pipeline(
fs.createReadStream("in"),
someLineTransform,
fs.createWriteStream("out")
);pipeline
wires backpressure (the read pauses when the write is slow) and — crucially — propagates errors and cleans up every stream (destroying them) if any stage fails. Memory stays ~highWaterMark-sized, independent of the 10GB total.fs.readFile` would try to allocate 10GB and crash.Follow-ups they push on- Why prefer pipeline() over chaining .pipe()? (Error handling + cleanup.)
- How would you split the stream into lines before the transform?
Red flag Reaching for fs.readFile / reading into one big Buffer — it cannot fit and OOMs the process.
source: Node.js docs — stream.pipeline ↗ -
What is a Buffer, and why does Node need it when JavaScript already has strings and arrays?
A Buffer is a fixed-length chunk of raw binary memory outside the V8 heap — Node's way of handling bytes (files, TCP packets, images, crypto) that pre-date
TypedArrayin the language. It is a subclass ofUint8Array.JavaScript strings are UTF-16 text, not bytes; a regular array is boxed and heap-heavy. Binary protocols, file contents, and network frames are sequences of bytes — Buffer gives you direct, efficient access to them and lets you control the encoding when converting to/from strings (
buf.toString("utf8"),Buffer.from(str, "base64")).Gotcha: a multi-byte UTF-8 character can be split across two chunks; decode with
StringDecoderor accumulate beforetoString.Follow-ups they push on- What goes wrong if you call buf.toString() on a chunk that splits a multi-byte character?
- Why is Buffer allocated off the V8 heap?
Red flag Treating chunk boundaries as character boundaries — concatenating decoded chunks can corrupt multi-byte UTF-8.
source: Node.js docs — Buffer ↗ -
This streaming code occasionally crashes the whole server with no stack trace pointing at user code. What's the most likely cause?
An unhandled
'error'event on a stream. Streams areEventEmitters, andEventEmitterhas a special rule: if an'error'event is emitted and there is no'error'listener, Node *throws* — crashing the process.With streams this is easy to hit: a read fails (file gone, socket reset), the source emits
error, nothing is listening, and the server dies. The fix is to handleerroron every stream, or — better — usepipeline(), which routes errors to one place and destroys the streams.``
``
rs.on("error", handle); // not optionalFollow-ups they push on- Why does an EventEmitter throw specifically on an unhandled 'error' event?
- How does pipeline() remove the need to attach error handlers to each stream?
Red flag Handling 'data'/'end' but forgetting 'error' — the one event whose absence crashes the process.
source: Node.js docs — Error handling with streams ↗ -
What's the difference between Buffer.alloc(n) and Buffer.allocUnsafe(n), and why does the 'unsafe' one exist?
Buffer.alloc(n)allocatesnbytes and zero-fills them — safe, predictable, but it pays the cost of writing zeros across the whole buffer.Buffer.allocUnsafe(n)allocatesnbytes without initializing them, so the memory may contain leftover bytes from previously freed allocations — potentially old data (passwords, keys, other requests). It's faster precisely because it skips the zero-fill.The 'unsafe' version exists for hot paths where you're about to fully overwrite the buffer immediately (e.g. you
copy/fillinto allnbytes before reading). The danger is forgetting to overwrite some region and then sending/logging it — leaking stale memory. Default toBuffer.alloc; reach forallocUnsafeonly when you'll write every byte before reading and have measured a real win.Never use the deprecated
new Buffer(n)constructor — it's unsafe and removed/forbidden.What a strong answer coversalloczero-fills (safe);allocUnsafeskips initialization (faster, may expose old memory).allocUnsafe may contain sensitive leftover bytes from freed allocations.
Only safe when you fully overwrite every byte before any read.
Avoid the deprecated
new Buffer()constructor entirely.
Quick self-checkWhich statement about Buffer.allocUnsafe(n) is correct?
-
Wrong: the speed comes precisely from NOT zero-filling.
-
Correct — it skips initialization, so old heap contents can remain.
-
Wrong: that's not its defining behavior; the key point is uninitialized memory.
-
Wrong: Buffers are off-heap regardless of which allocator you use.
Follow-ups they push on- What real security bug can leak from sending an under-written allocUnsafe buffer?
- Why was the old `new Buffer(n)` constructor deprecated?
Red flag Using allocUnsafe and not overwriting every byte — you can leak stale heap memory into output.
source: Node.js docs — Buffer.allocUnsafe ↗ -
What does stream.finished() / the 'end' vs 'finish' vs 'close' events tell you, and which fires for readable vs writable?
Three lifecycle events that interviewers conflate:
-
'end'— fires on a Readable when there's no more data to read (the source is exhausted).
-'finish'— fires on a Writable afterend()is called and all data has been flushed to the underlying system.
-'close'— fires when the stream and its resources (file descriptor, socket) are destroyed/closed; it's the cleanup signal, on both kinds.Because getting these right by hand is error-prone,
stream.finished(stream, cb)(and its promise form) gives you one callback that resolves when a stream is no longer readable/writable or errors — abstracting over end/finish/close/error. It's the robust way to know "this stream is truly done."What a strong answer covers'end'→ Readable exhausted (no more data to read).'finish'→ Writable flushed everything afterend().'close'→ underlying resource destroyed; cleanup signal on either side.stream.finished()unifies end/finish/close/error into one done-or-failed callback.
Follow-ups they push on- Why might 'finish' fire but 'close' not, or vice versa?
- How is stream.finished safer than listening for 'end' yourself?
Red flag Listening for 'end' on a Writable (it never fires there) or 'finish' on a Readable — wrong event for the side.
source: Node.js docs — stream.finished() ↗ -
What are object-mode streams, and async iteration over a stream (for await...of)? When would you use each?
Object mode (
{ objectMode: true }) lets a stream's chunks be arbitrary JS values (objects, numbers) instead of Buffers/strings. Useful for pipelines of parsed records — e.g. a CSV row parser emitting objects into a Transform that validates them. In object modehighWaterMarkcounts objects, not bytes (default 16).Async iteration: a Readable is async-iterable, so you can consume it with
for await...of:``
`
for await (const chunk of fs.createReadStream(file)) {
process(chunk);
}This reads chunks one at a time with built-in backpressure (the loop body's await pauses reading) and lets you use ordinary try/catch
for errors — far more readable than wiring'data'/'end'/'error'` by hand. Use it whenever you'd otherwise write event-handler boilerplate to consume a stream sequentially.What a strong answer coversObject mode: chunks are arbitrary JS values, not Buffers/strings; highWaterMark counts objects (default 16).
Readables are async-iterable:
for await...ofconsumes chunk-by-chunk.Async iteration has built-in backpressure and lets try/catch handle errors.
Use object mode for record pipelines; async iteration to avoid 'data'/'end'/'error' boilerplate.
Follow-ups they push on- How does for await...of provide backpressure automatically?
- What happens to the stream if you break out of the for await loop early?
Red flag Assuming chunks are always Buffers — in object mode they're whatever you pushed, and toString() would mangle them.
source: Node.js docs — Consuming readable streams with async iterators ↗ -
Why can `chunk.toString()` on each stream chunk corrupt text, and how do you decode multi-byte data safely?
Stream chunks split at arbitrary byte boundaries, not character boundaries. A multi-byte UTF-8 character (emoji, accented letters, CJK) can land with its first byte at the end of one chunk and the rest at the start of the next. Calling
chunk.toString("utf8")on each chunk independently then decodes a partial character — producing the replacement char `or mojibake — and you can't fix it by concatenating the broken strings afterward.Safe options:
- Use string_decoder.StringDecoder, which buffers incomplete multi-byte sequences across chunks and only emits complete characters.
- Or set the stream's encoding with setEncoding("utf8")(which uses StringDecoder internally) so 'data'yields decoded strings.Buffer.concat(...).toString()` once at the end (fine for small data, not for huge streams).
- Or accumulate the raw Buffers andWhat a strong answer coversChunks break on byte boundaries; a multi-byte char can straddle two chunks.
chunk.toString()per chunk decodes partial characters → garbled output you can't repair by concatenation.Use
StringDecoder(buffers incomplete sequences) orstream.setEncoding('utf8').Alternatively
Buffer.concatall chunks and decode once — only for small payloads.
Follow-ups they push on- Why can't you just concatenate the per-chunk decoded strings to fix it?
- When is Buffer.concat-then-decode acceptable vs StringDecoder?
Red flag Decoding each chunk with toString() independently — multi-byte characters spanning chunk boundaries corrupt.
source: Node.js docs — StringDecoder ↗ -
What is backpressure? What does it mean when stream.write() returns false, and what is the 'drain' event for?
Backpressure is the feedback that a fast producer is outpacing a slow consumer. Each writable stream has an internal buffer with a
highWaterMark. Whenwrite()pushes the buffer past that threshold, it returnsfalse— a signal saying "stop writing, I'm full."If you ignore it and keep writing, the buffer grows unbounded and memory balloons. The correct response: pause the source and wait for the
drainevent, which fires once the buffer has emptied below the mark, then resume.You rarely wire this by hand —
pipe()andpipeline()implement the pause/resume dance for you, which is exactly why they are preferred.Follow-ups they push on- How does pipe() handle backpressure automatically?
- What is highWaterMark and what happens if you set it very high?
Red flag Writing in a loop while ignoring write()'s return value — unbounded memory growth under load.
source: Node.js docs — Stream backpressuring ↗ -
Why is pipeline() preferred over chaining .pipe()? What does each do about errors?
a.pipe(b).pipe(c)handles backpressure but not errors: ifbemitserror,pipedoes not forward it or destroy the other streams. You are left with un-destroyed streams (leaked file descriptors/sockets) and an unhandlederrorevent — which crashes the process if no listener exists.stream.pipeline(a, b, c, cb)(or the promise formnode:stream/promises) wires the same backpressure and: forwards the first error to the callback/rejection, and destroys every stream in the chain on completion or failure. That cleanup is the whole reason to prefer it.Rule of thumb: use
pipelinefor anything with real error/cleanup needs; bare.pipeonly for trivial throwaway cases.Follow-ups they push on- What resource leaks when a .pipe chain errors mid-way?
- What does the promise version of pipeline let you do with async/await?
Red flag Using long .pipe chains in production and assuming an error anywhere is handled — it is not.
source: Node.js docs — stream.pipeline ↗ -
What are the two reading modes of a Readable stream (flowing vs paused), and how do you switch between them?
A Readable stream is in one of two modes:
- Paused (pull) — you explicitly call
read()to pull chunks. This is the default for a freshly created stream.
- Flowing (push) — chunks are pushed at you as fast as they arrive via'data'events.It switches to flowing when you attach a
'data'listener, callresume(), orpipe()it. It goes back to paused withpause()or by removing the'data'listener (andunpipe).The practical takeaway: attaching a
'data'handler starts the firehose immediately — if your consumer is slow you must respect backpressure (or just usepipe/pipeline, which manages the mode for you).Follow-ups they push on- What starts a stream flowing the moment you attach a 'data' listener?
- Which mode does pipe() put the source in?
Red flag Adding a 'data' listener and assuming the stream waits for you — it starts pushing chunks immediately.
source: Node.js docs — Two reading modes ↗ -
What is highWaterMark on a stream, and what actually happens if you set it very high vs very low?
highWaterMarkis the buffer threshold that drives backpressure. For a Writable it's the byte (or object) count at whichwrite()starts returningfalse; for a Readable it's how much data the stream buffers ahead via internalread()calls. Default is 64 KB for byte streams (16 objects in object mode).- Set it very high: the stream buffers a lot before signaling backpressure, so more data sits in memory. You get fewer pause/resume cycles (possibly slightly higher throughput) at the cost of a bigger memory footprint — and a huge value can defeat the point of streaming.
- Set it very low: backpressure kicks in almost immediately, memory stays tiny, but you pay more overhead in frequent pause/resume andreadcalls, hurting throughput.It's a memory-vs-throughput knob; the 64 KB default is a sensible balance for most workloads.
What a strong answer coversThe buffer threshold that triggers backpressure (write() → false; readable buffers ahead).
Default 64 KB for byte streams, 16 for object mode.
Higher → more in-memory buffering, fewer pause/resume cycles, bigger footprint.
Lower → tighter memory, more overhead from frequent backpressure signaling.
Follow-ups they push on- How does highWaterMark interact with the drain event?
- Why might a very high highWaterMark partially defeat the purpose of streaming?
Red flag Cranking highWaterMark up to 'go faster' — it just buffers more in memory and can reintroduce OOM risk.
source: Node.js docs — Buffering / highWaterMark ↗
4.4 Modules & packages 14
-
package.json: dependencies vs devDependencies vs peerDependencies — what's the distinction and when does each install?
-
dependencies— packages your code needs at runtime (Express, the DB driver). Installed for everyone who installs your package.
-devDependencies— needed only to build/test/lint (TypeScript, jest, eslint). Installed for local dev, but skipped withnpm install --omit=dev(production installs).
-peerDependencies— a package your plugin expects the host project to provide, to avoid duplicate/clashing copies (e.g. a React component library listsreactas a peer so it uses the app's single React).Getting this wrong: a runtime package in devDeps breaks production; a build tool in deps bloats the production image.
Follow-ups they push on- What breaks if you put your web framework in devDependencies?
- Why do React component libraries list react as a peerDependency rather than a dependency?
Red flag Putting runtime libs in devDependencies — works locally, then crashes in a --omit=dev production install.
source: npm docs — package.json dependencies ↗ -
CommonJS vs ES Modules: name the real differences (syntax, loading, this, __dirname, top-level await).
- Syntax: CJS uses
require()/module.exports; ESM usesimport/export.
- Loading: CJS is synchronous and loads at runtime, sorequire()can be conditional/dynamic. ESM is asynchronous with a static parse phase — imports are hoisted and resolved before the body runs (use dynamicimport()for conditional loading).
- Bindings: CJS exports a *copied value*; ESM exports *live bindings* (re-exported values stay in sync).
-this: top-levelthisismodule.exportsin CJS, butundefinedin ESM.
-__dirname/__filename: available in CJS; in ESM you derive them fromimport.meta.url.
- Top-level await: allowed in ESM, not in CJS.Node picks the mode from
"type"in package.json (or.cjs/.mjsextension).Follow-ups they push on- How do you get __dirname in an ES module?
- Why can you require() conditionally but not top-level-import conditionally?
Red flag Saying they are interchangeable — sync vs async loading and live-bindings vs copied-values cause real behavioral differences.
source: Node.js docs — Modules: ECMAScript modules ↗ -
What is a transitive dependency, and why can `npm audit` report dozens of vulnerabilities you didn't install directly?
A transitive (indirect) dependency is a package your dependencies depend on — not something you listed in your
package.json. A modern app with a handful of direct deps routinely pulls in hundreds of transitive packages, and the lockfile records the whole tree.npm auditscans that entire tree against a vulnerability database, so most reported issues live deep in transitive packages you never named. That's also the supply-chain risk surface: you trust not just your deps but everything they trust.Fixing them:
npm audit fixbumps within allowed ranges; a transitive fix may require the direct dependency to update, or anoverridesentry in package.json to force a patched version. And weigh severity in context — a vuln in a dev-only or unreachable code path isn't always exploitable in your app.What a strong answer coversTransitive = a dependency of your dependencies; you didn't list it directly.
Apps pull in hundreds of transitive packages; the lockfile captures the full tree.
npm auditscans the whole tree, so most findings are in indirect packages.Fix via
npm audit fix, upgrading the direct dep, oroverridesto pin a patched version.
Follow-ups they push on- When would you use the `overrides` field to force a transitive version?
- Why isn't every audit 'high severity' finding actually exploitable in your app?
Red flag Treating every npm audit finding as a critical blocker, or assuming you can only fix direct dependencies.
source: npm docs — npm audit ↗ -
How do you get __dirname and __filename in an ES module, and why aren't they available like in CommonJS?
In CommonJS,
__dirnameand__filenameare injected into every module's wrapper scope. ESM has no such wrapper — modules run in a standard scope where those magic variables don't exist. Instead, ESM gives youimport.meta.url, the file's URL (afile://string).Derive the paths from it:
``
`
import { fileURLToPath } from "node:url";
import { dirname } from "node:path";
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);fileURLToPath
is required becauseimport.meta.urlis a URL, not a filesystem path (and on Windows or with spaces/special chars, naive string slicing breaks). Recent Node also exposesimport.meta.dirname/import.meta.filename` as conveniences.What a strong answer coversCJS injects
__dirname/__filenamevia the module wrapper; ESM has no wrapper.ESM exposes
import.meta.url(afile://URL) instead.Convert with
fileURLToPath(import.meta.url)thenpath.dirname(...).Don't string-slice the URL —
fileURLToPathhandles Windows/encoding correctly.
Follow-ups they push on- Why is fileURLToPath needed instead of just stripping the file:// prefix?
- What are import.meta.dirname and import.meta.filename?
Red flag Hand-parsing import.meta.url by slicing 'file://' — breaks on Windows paths and URL-encoded characters.
source: Node.js docs — import.meta.url ↗ -
Why does committing node_modules vs relying on the lockfile matter, and what makes `npm ci` deterministic where `npm install` isn't?
You normally don't commit
node_modules(huge, platform-specific native builds, churns the diff); you commit the lockfile and rebuild from it. The lockfile +npm ciis what gives reproducibility without the bloat.What makes them differ:
-
npm installtreatspackage.jsonas the source of truth: it resolves ranges, may update the lockfile, and reuses/patches an existingnode_modules. Two installs at different times can yield different trees if a new in-range version was published.
-npm citreats the lockfile as authoritative: it deletesnode_modulesfirst, installs the exact pinned versions, and errors ifpackage.jsonand the lockfile disagree. No range re-resolution, so the tree is byte-identical every run — ideal for CI/prod.So determinism comes from
npm cirefusing to re-resolve ranges and always starting from a clean slate.What a strong answer coversCommit the lockfile, not node_modules (bloat + platform-specific native builds).
npm installmay update the lockfile and reuse node_modules → can drift over time.npm ciwipes node_modules and installs the exact lockfile versions, erroring on mismatch.Determinism = no range re-resolution + clean-slate install.
Follow-ups they push on- Why might committing node_modules with native addons break on a teammate's machine?
- What happens with npm ci if you forgot to update the lockfile after editing package.json?
Red flag Using npm install in CI (non-deterministic, can silently bump versions) instead of npm ci.
source: npm docs — npm ci ↗ -
What's the difference between `exports = foo` and `module.exports = foo` in CommonJS? Which one actually works, and why?
Only
module.exports = fooworks to replace the whole export.At module start, Node does roughly
exports = module.exports = {}—exportsis just a *local variable pointing at the same object* asmodule.exports. What gets returned to the requirer ismodule.exports.-
exports.foo = ...works because you are mutating the shared object.
-exports = fooonly reassigns the local variableexports;module.exportsstill points at the original{}, so the requirer gets an empty object.
-module.exports = foocorrectly replaces what is returned.Rule: use
exports.x = ...to add properties, butmodule.exports = ...to export a single thing.Follow-ups they push on- After `module.exports = foo`, does `exports.bar = 1` still affect the export? (No.)
- Why does `exports.foo = ...` work but `exports = {...}` not?
Red flag Reassigning `exports = ...` and wondering why the importer gets `{}` — you broke the alias to module.exports.
source: Node.js docs — module.exports vs exports ↗ -
In semver, what versions does "^1.2.3" allow, and how does that differ from "~1.2.3"? When is each dangerous?
Semver is
MAJOR.MINOR.PATCH.-
^1.2.3(caret) allows everything up to but not including the next MAJOR —>=1.2.3 <2.0.0. So1.9.0is fine;2.0.0is not. (Special case: for0.x,^0.2.3is treated as>=0.2.3 <0.3.0— a0.xminor bump can break.)
-~1.2.3(tilde) allows only PATCH bumps —>=1.2.3 <1.3.0.Caret is the npm default. The risk: a sloppy maintainer ships a breaking change in a *minor*, and your caret range silently pulls it in. That is exactly why
package-lock.jsonpins exact resolved versions for reproducible installs.Follow-ups they push on- Why is the lockfile essential even though you specified a range?
- What does ^0.2.3 resolve to, and why is the 0.x rule special?
Red flag Thinking ^ and ~ are the same, or trusting that minor bumps are always non-breaking.
source: npm docs — About semantic versioning ↗ -
What does package-lock.json do, and why should you commit it? What's the difference between `npm install` and `npm ci`?
package-lock.jsonrecords the *exact* version, resolved URL, and integrity hash of every package in the tree (including transitive deps). Becausepackage.jsononly specifies ranges, the lockfile is what makes installs reproducible — everyone and CI get byte-identical trees. Commit it.-
npm installreadspackage.json, may update the lockfile to satisfy ranges, and adds/removes packages. Good for development.
-npm ciinstalls strictly from the lockfile, errors ifpackage.jsonand the lock disagree, and wipesnode_modulesfirst. Deterministic and faster — the right choice for CI and production builds.Follow-ups they push on- Why does npm ci fail if package.json and the lockfile are out of sync?
- What integrity field in the lockfile protects against tampered packages?
Red flag Gitignoring the lockfile (irreproducible builds) or using `npm install` in CI instead of `npm ci`.
source: npm docs — npm ci ↗ -
What prints? // counter.js: let c = 0; module.exports = { inc: () => ++c, get: () => c }; // app.js: const a = require("./counter"); const b = require("./counter"); a.inc(); console.log(b.get())
It prints
1.CommonJS caches modules by resolved path. The first
require("./counter")executes the file once and caches itsmodule.exports; the secondrequirereturns the same cached object — no re-execution. Soaandbare the *same* object sharing the *same*c.a.inc()makesc1, andb.get()reads that samec:1.This is why a module is effectively a singleton — handy for shared config/connections, but a trap if you expect a fresh instance per require.
Follow-ups they push on- What key does the module cache use, and how can the same file be loaded twice?
- How would you force a fresh module instance? (Bust require.cache — and why that's usually a smell.)
Red flag Expecting each require to give a fresh module — it returns the cached singleton.
source: Node.js docs — Modules caching ↗ -
What does the "exports" field in package.json do, and how do conditional exports (import/require/default) work?
The
exportsfield defines a package's official entry points and, crucially, encapsulates it: once you declareexports, consumers can import only the paths you list — deep imports into internal files (pkg/lib/secret.js) are blocked. It supersedesmain.Conditional exports map one specifier to different files depending on how it's loaded:
``
`
{
"exports": {
".": {
"import": "./index.mjs",
"require": "./index.cjs",
"default": "./index.mjs"
}
}
}Node picks import
when loaded viaimport/import(),requirewhen loaded viarequire(), anddefaultas the fallback. This is how a package ships both an ESM and a CJS build from one entry point (the "dual package"). Conditions are matched in order, so put more specific ones first;default` must be last.What a strong answer coversexportsdeclares entry points and encapsulates internals (blocks deep imports).Conditional exports map a specifier to different files by condition.
importvsrequirelets one package ship both ESM and CJS builds (dual package).Conditions match in order, most-specific first;
defaultis the last-resort fallback.
Follow-ups they push on- What's the 'dual package hazard' and how do conditional exports relate to it?
- How does the exports field break tools that relied on deep-importing internal files?
Red flag Adding an exports field and accidentally breaking consumers who deep-imported internal paths.
source: Node.js docs — Packages: conditional exports ↗ -
What prints? // a.mjs: export let count = 0; export function inc() { count++; } // main.mjs: import { count, inc } from "./a.mjs"; inc(); console.log(count)
It prints
1.ESM exports are live bindings, not copied values. The imported
countis a read-only *view* of the exporter'scountvariable — not a snapshot taken at import time. Wheninc()mutatescountinsidea.mjs, the importer's view reflects the new value, soconsole.log(count)reads1.Contrast with CommonJS:
const { count } = require("./a")copies the value at require time, so callinginc()would not change your localcount(it'd still be0). Note you can *read* the live binding but not reassign it from the importer (count = 5throws — imports are read-only).What a strong answer coversESM imports are live, read-only bindings to the exporter's variables.
Mutating the exported variable inside its module is visible to all importers.
CommonJS copies values at require time, so it would still print 0.
Importers can read the live value but cannot reassign the binding (TypeError).
Quick self-checkWhat does main.mjs print?
-
Wrong: that's the CommonJS copy-by-value behavior; ESM uses live bindings.
-
Correct — ESM imports are live bindings, so inc()'s mutation of count is visible.
-
Wrong: count is exported and initialized to 0, never undefined.
-
Wrong: reading count is fine; only reassigning the import would throw.
Follow-ups they push on- What's the CommonJS equivalent and why does it print 0 instead?
- Why can't you reassign an imported binding in the importing module?
Red flag Assuming ESM imports are value snapshots like CJS — they're live bindings, so mutations show through.
source: MDN — export (live bindings) ↗ -
How does Node resolve `require("foo")` (a bare specifier) vs `require("./foo")` (a relative path)?
Relative/absolute (
./foo,../foo,/abs/foo): resolve against the current file. Node tries the exact path, thenfoo.js/foo.json/foo.node, thenfoo/as a directory (itspackage.jsonmain/exports, elseindex.js).Bare specifier (
foo): Node walksnode_modulesoutward —./node_modules/foo, then the parent'snode_modules, up to the filesystem root — and uses the first match. Core modules (fs,path, ornode:fs) short-circuit this and win immediately.This outward walk is why a dependency can resolve a different copy of a package than your app, and why
node_modulescan nest.Follow-ups they push on- Why might two packages each get their own copy of a shared dependency?
- What does the `exports` field in package.json change about resolution?
Red flag Assuming bare specifiers resolve from one global location — Node searches node_modules up the directory tree.
source: Node.js docs — Modules: all-together resolution ↗ -
What is a circular dependency between two CommonJS modules, and what does the importer actually receive?
A circular dependency is
a.jsrequiringb.jswhileb.jsrequiresa.js. CommonJS doesn't deadlock — it returns a partially-completedmodule.exports.When
astarts loading and requiresb,bbegins executing; ifbthen requiresa, Node seesais already in progress and handsbthe **partial exports ofaas they exist *right now* (whateverahad assigned before therequire(b)line). Ifahadn't exported the thingbneeds yet,bseesundefined.So behavior depends on statement order** and is fragile. Symptom: a value is mysteriously
undefinedonly when modules load in a particular order. Fixes: restructure to break the cycle, extract the shared piece into a third module, or require lazily (inside the function that uses it). ESM handles cycles better via live bindings but can still hit temporal-dead-zone errors.What a strong answer coversCJS doesn't deadlock; it returns the partial exports of the in-progress module.
What
bsees ofadepends on whatahad exported before itsrequire(b)line.Symptom: a dependency value is
undefineddepending on load order.Fix: break the cycle, extract a shared module, or require lazily inside a function.
Follow-ups they push on- How does ESM's live-binding model change circular-dependency behavior?
- Why does moving the require() to the bottom of the file sometimes 'fix' it?
Red flag Assuming a circular require throws or deadlocks — it silently returns half-initialized exports.
source: Node.js docs — Modules: Cycles ↗ -
How do you import a CommonJS package from an ES module, and an ESM-only package from CommonJS? Why is one harder?
CJS → from ESM: easy.
import pkg from "cjs-package"works — Node treats the module'smodule.exportsas the default export. Named imports work for statically-detectable named exports, but the whole object is reliably available as the default.ESM-only → from CJS: harder, because
require()of an ESM module is restricted. ESM is asynchronous (it can use top-level await) andrequireis synchronous, so historicallyrequire("esm-only-pkg")threwERR_REQUIRE_ESM. The portable workaround is dynamicimport(), which returns a promise:``
`
const { thing } = await import("esm-only-pkg");(Recent Node versions added synchronous require()
of ESM that has no top-level await, but dynamicimport()is the safe, version-independent answer.)The asymmetry comes from sync-vs-async loading: pulling async ESM into a sync require` is the fundamentally awkward direction.
What a strong answer coversCJS from ESM:
import x from 'cjs'— module.exports becomes the default export.ESM from CJS:
require()is restricted (ESM is async), classicallyERR_REQUIRE_ESM.Portable fix for ESM-from-CJS: dynamic
import()(returns a promise).Asymmetry stems from ESM being async (top-level await) vs require being synchronous.
Follow-ups they push on- Why is dynamic import() the version-safe way to load ESM from CJS?
- What does it mean that newer Node can require() ESM without top-level await?
Red flag Trying to require() an ESM-only package and hitting ERR_REQUIRE_ESM — reach for dynamic import().
source: Node.js docs — Interoperability with CommonJS ↗
4.5 Globals, events & CPU concurrency 14
-
What's on the `process` global that you actually use? Cover argv, env, exit codes, and the on() events.
processis the interface to the running Node process:-
process.argv— CLI arguments array;[0]is the node binary,[1]is the script, real args from[2].
-process.env— environment variables (always strings); the standard place for config/secrets (process.env.NODE_ENV,DATABASE_URL).
-process.exit(code)— terminate now with an exit code (0success, non-zero failure). Prefer letting the loop drain naturally;exit()can cut off in-flight I/O.
- Events:process.on("SIGTERM"/"SIGINT", ...)for graceful shutdown, plus"uncaughtException"and"unhandledRejection"as last-resort handlers.It is also an
EventEmitter, which is why thoseon(...)hooks exist.Follow-ups they push on- Why does process.argv start your real arguments at index 2?
- How do you implement graceful shutdown on SIGTERM in a containerized service?
Red flag Calling process.exit() in the middle of request handling and truncating pending writes/logs.
source: Node.js docs — process ↗ -
Why should you read configuration from environment variables (process.env) instead of hardcoding it or committing a config file?
It is the twelve-factor practice: keep config in the environment, separate from code. Benefits:
- One build, many environments — the same artifact runs in dev/staging/prod by swapping env vars; no code change or rebuild per environment.
- Secrets stay out of git — DB passwords and API keys never land in the repo (a top cause of credential leaks).
- Ops-friendly — platforms (Docker, Kubernetes, Cloudflare, CI) all inject env vars natively.Practical notes:
process.envvalues are always strings (coerce numbers/booleans yourself), use a.envfile (gitignored) locally, and validate required vars at startup so a missingDATABASE_URLfails fast rather than at 3am.Follow-ups they push on- Why validate env vars at boot instead of where they're used?
- What type are process.env values, and what bug does that cause with `process.env.PORT`?
Red flag Committing secrets in a config file, or assuming process.env.PORT is a number (it's a string).
source: Node.js docs — process.env ↗ -
How do you implement graceful shutdown on SIGTERM in a containerized Node service, and why does it matter?
When an orchestrator (Kubernetes, Docker, a process manager) stops your container, it sends
SIGTERMand gives a grace period beforeSIGKILL. Without handling it, in-flight requests are cut off, connections drop, and writes can be left half-done.Graceful shutdown on
SIGTERM:``
`
process.on("SIGTERM", async () => {
server.close(); // stop accepting new connections
await drainInFlightRequests(); // let current ones finish
await db.end(); await redis.quit(); // close pools/connections
process.exit(0);
});Steps: stop accepting new work (server.close()
), let in-flight requests drain (with a timeout fallback so a stuck request can't hang shutdown forever), close DB/cache/queue connections, then exit0. This avoids dropped requests during deploys/scaling and prevents connection-pool leaks. Also handleSIGINT` for local Ctrl-C.What a strong answer coversOrchestrators send SIGTERM, then SIGKILL after a grace period.
On SIGTERM: stop accepting new connections (
server.close), drain in-flight, close pools, exit 0.Add a timeout fallback so a stuck request can't block shutdown indefinitely.
Prevents dropped requests during deploys and connection-pool leaks; handle SIGINT too.
Follow-ups they push on- Why do you need a timeout fallback around draining in-flight requests?
- What happens to open requests if you ignore SIGTERM until SIGKILL?
Red flag Calling process.exit(0) immediately on SIGTERM, truncating in-flight requests instead of draining first.
source: Node.js docs — Signal events ↗ -
What globals are available in Node without require (e.g. globalThis, Buffer, __dirname, setTimeout), and which are NOT truly global?
Genuinely global (available anywhere, no import):
globalThis,process,Buffer,console, the timer functions (setTimeout/setInterval/setImmediateand theirclear*),queueMicrotask,URL/URLSearchParams,TextEncoder/TextDecoder, and (in modern Node)fetch,structuredClone, andAbortController.The trap — these look global but are actually module-scoped variables injected by the CommonJS wrapper, not properties of
globalThis:__dirname,__filename,require,module,exports. That's exactly why they don't exist in ES modules (no wrapper) — you useimport.meta.urland staticimportinstead.So: timers/process/Buffer are true globals; the
require/module/__dirnamefamily are per-module wrapper locals.What a strong answer coversTrue globals:
globalThis,process,Buffer,console, timers,fetch,URL,AbortController, etc.__dirname,__filename,require,module,exportsare module-wrapper locals, not on globalThis.That's why those CJS locals are absent in ESM (no module wrapper).
Many former-polyfill APIs (fetch, structuredClone) are now built-in globals.
Follow-ups they push on- Why are __dirname and require unavailable in ES modules?
- Is `fetch` available globally in current Node without a library?
Red flag Calling __dirname/require 'global' — they're injected per-module by the CJS wrapper and absent in ESM.
source: Node.js docs — Global objects ↗ -
Explain the EventEmitter pattern. What's special about the 'error' event, and what's the 'newListener' / max-listeners warning about?
EventEmitteris Node's pub/sub primitive: register handlers withon(event, fn)(oronce) and fire them withemit(event, ...args). Synchronous by default — listeners run in registration order on the same tick. Much of Node's API (streams, HTTP servers, sockets) is built on it.Two gotchas interviewers probe:
-
'error'is special: if youemit("error")and there is no error listener, the emitter throws and crashes the process. Always handle'error'.
- Max listeners: adding more than 10 listeners for one event logs a *MaxListenersExceededWarning* — a heuristic for a listener leak (e.g. adding a handler per request and never removing it). Raise the limit withsetMaxListenersonly if it is genuinely intentional.Follow-ups they push on- Why does an unhandled 'error' event crash, while other unhandled events are silent?
- What real bug does the 'more than 10 listeners' warning usually indicate?
Red flag Treating the max-listeners warning as noise and bumping the limit, instead of finding the leak.
source: GreatFrontend — JS interview questions by ex-interviewers ↗ -
What prints? const EventEmitter = require("events"); const e = new EventEmitter(); e.on("x", () => console.log("A")); e.on("x", () => console.log("B")); console.log("before"); e.emit("x"); console.log("after")
before A B after.EventEmitter listeners are synchronous —
emitcalls each registered handler in order, on the same tick, beforeemitreturns. Soconsole.log("before")runs, thenemit("x")invokes the two listeners immediately (A, thenB, in registration order), and only then doesconsole.log("after")run.This surprises people who assume events are deferred/async like DOM events or
setTimeout. If you need a listener to yield, you must defer it yourself (e.g.setImmediateinside the handler).Follow-ups they push on- How would you make a listener run asynchronously without blocking emit?
- In what order do multiple listeners for the same event fire?
Red flag Assuming emit is asynchronous and printing `before after A B`.
source: Node.js docs — EventEmitter emit ↗ -
What's the difference between EventEmitter's on() and once(), and why is a per-request on() handler a classic leak?
on(event, fn)registers a handler that fires on every emission until you remove it.once(event, fn)fires exactly once and then auto-removes itself.The leak: code that does
emitter.on("data", handler)per request (or per connection) on a long-lived emitter, without ever callingremoveListener/off. Each request adds another handler that's never cleaned up; the array of listeners grows unbounded, the closures pin everything they captured, and memory climbs. Node's heuristic warns at >10 listeners (MaxListenersExceededWarning) precisely to catch this.Fixes: use
oncewhen you only need the next event; remove handlers when the request ends (off); or useAbortSignal/{ signal }to auto-detach. The warning is a symptom — find and remove the accumulating listener, don't just raisesetMaxListeners.What a strong answer coversonfires on every emission until removed;oncefires once and auto-removes.Per-request
on()on a long-lived emitter without cleanup accumulates listeners → leak.Captured closures keep referenced objects alive; >10 listeners triggers the warning.
Fix with
once, explicitoff, or an AbortSignal — not by bumping setMaxListeners.
Quick self-checkWhich best describes the difference between on() and once()?
-
Wrong: once() fires the handler a single time, period.
-
Correct — that auto-removal is exactly why once() avoids the accumulation leak.
-
Wrong: both invoke handlers synchronously during emit.
-
Wrong: the auto-removal after one call is a real behavioral difference.
Follow-ups they push on- How does passing an AbortSignal help auto-remove a listener?
- Why does the leak grow memory and not just listener count?
Red flag Adding a listener per request and never removing it, then silencing the max-listeners warning instead of fixing it.
source: Node.js docs — emitter.once() ↗ -
Worker threads vs child processes vs cluster — what does each give you, and when do you pick which?
All add parallelism, but for different jobs:
- Worker threads — multiple JS threads in one process, can share memory via
SharedArrayBuffer, cheap to spawn, message-passing for the rest. Pick for CPU-bound JS work (image resize, parsing, hashing) you want to keep in-process.
- Child processes (spawn/fork/exec) — full separate OS processes, total isolation, can run any program (not just Node). Pick to run an external binary (ffmpeg, git) or to isolate untrusted/risky work.
- Cluster — forks multiple copies of your server that share one listening port; the OS load-balances connections across them. Pick to use all CPU cores for an I/O-bound server (the classic way to scale an HTTP server).Shorthand: CPU-bound in-process → worker; external program / isolation → child process; scale a server across cores → cluster.
Follow-ups they push on- Why is cluster the wrong tool for a single CPU-heavy computation?
- How do worker threads share data without copying it?
Red flag Reaching for cluster to speed up one CPU-bound task — cluster scales request throughput, not a single computation.
source: Node.js docs — Worker threads ↗ -
How do worker threads communicate with the main thread, and what data can/can't cross the boundary?
Worker threads talk over a message channel:
worker.postMessage(value)/parentPort.postMessage(value), received via"message"events. There is no shared scope — each thread has its own V8 isolate, globals, and module registry.What can cross:
- Structured-cloneable values — objects, arrays, Maps, Sets, typed arrays, etc. are copied (deep clone).
- Transferable objects (ArrayBuffer,MessagePort) can be moved in the transferList: ownership transfers and the sender's copy is detached (zero-copy, but no longer usable on the sender).
-SharedArrayBufferis genuinely shared (both threads see the same bytes; coordinate withAtomics).What can't cross: functions, closures, class instances with methods, DOM-like handles — anything not structured-cloneable throws a
DataCloneError. So you pass data, not behavior; the worker loads its own code from a file/string.What a strong answer coversCommunicate via
postMessage+'message'events; no shared scope between threads.Plain data is deep-copied via structured clone.
ArrayBuffer/MessagePort can be transferred (detached on sender, zero-copy).
Functions/closures/methods can't be sent (DataCloneError); SharedArrayBuffer truly shares memory.
Follow-ups they push on- What's the difference between transferring an ArrayBuffer and copying it?
- Why can't you postMessage a function to a worker?
Red flag Trying to postMessage a function or class instance with methods — only structured-cloneable data crosses.
source: Node.js docs — worker.postMessage() ↗ -
What prints, and is it on the main thread? const { Worker, isMainThread } = require("worker_threads"); if (isMainThread) { new Worker(__filename); console.log("main"); } else { console.log("worker"); }
Both
mainandworkerprint —mainfrom the main thread,workerfrom the spawned worker — and the relative order is non-deterministic (the worker starts asynchronously, somainusually prints first, but don't rely on it).The pattern is the standard self-referencing worker: the file checks
isMainThread. On first run it'strue, so the branch spawns aWorker(__filename)— which re-executes the same file in a new thread whereisMainThreadisfalse, taking theelsebranch and printingworker. The worker has its own module instance, globals, and event loop; it does not share memory with the main thread (only message-passing / SharedArrayBuffer).What a strong answer coversBoth branches run:
mainon the main thread,workerin the spawned thread.new Worker(__filename)re-executes the file withisMainThread === false.Relative print order is non-deterministic (worker starts async).
The worker has its own isolate/globals/event loop — no shared memory by default.
Quick self-checkWhat does this program output?
-
Wrong: the new Worker re-runs the file and prints 'worker' too.
-
Correct — the file runs twice (main + worker thread) and the worker starts asynchronously.
-
Wrong: the worker starts async, so 'main' typically prints first; order isn't guaranteed either way.
-
Wrong: self-referencing Worker(__filename) is the standard pattern and works.
Follow-ups they push on- Why is the order of 'main' vs 'worker' not guaranteed?
- How would the worker send a result back to the main thread?
Red flag Assuming only one line prints, or that the worker shares the main thread's variables/globals.
source: Node.js docs — worker_threads isMainThread ↗ -
Why does cluster scale an I/O-bound server but not a single CPU-bound computation, and how does it share a port?
Cluster forks N worker processes (typically one per core), each a full Node instance with its own event loop. They all share one listening socket: the primary process creates the listener and hands incoming connections to workers (by default the OS/round-robin distributes them). So N independent event loops handle requests in parallel — that's why it scales an I/O-bound server across cores: more loops = more concurrent request handling and CPU utilization.
It does nothing for a single CPU-bound computation, because that one task runs on one worker's single thread; the other workers can't help compute it — they're separate processes handling *other* requests. Cluster scales throughput (requests/sec across many requests), not the latency of one heavy computation. For that, you need worker threads (split the work) or an algorithmic fix.
What a strong answer coversCluster = N processes, each its own event loop, sharing one listening socket.
Primary distributes connections (round-robin by default) → parallel request handling across cores.
Scales I/O-bound throughput, not the latency of a single computation.
One CPU-bound task still runs on one thread; use worker threads to split it.
Follow-ups they push on- How does the primary process distribute incoming connections to workers?
- When would worker threads beat cluster for the same workload?
Red flag Expecting cluster to speed up one heavy computation — it multiplies request handlers, not the single task.
source: Node.js docs — Cluster ↗ -
A worker thread is meant to share a big array with the main thread to avoid copying. How do you actually share memory, and what's the catch?
Ordinary
postMessage(data)copies via structured clone (or *transfers* anArrayBuffer, leaving the sender's copy detached). To truly share memory you use aSharedArrayBuffer(often viewed through a typed array): both threads see the same bytes, no copy.The catch: shared mutable memory reintroduces data races. Two threads writing the same slot need coordination — use the
AtomicsAPI (Atomics.add,Atomics.wait/notify) for safe reads/writes and signaling. You can only share raw binary buffers this way, not arbitrary JS objects.So: copy is the safe default;
SharedArrayBuffer + Atomicsis the zero-copy path you reach for only when the data is large and the synchronization is worth it.Follow-ups they push on- What's the difference between transferring an ArrayBuffer and sharing a SharedArrayBuffer?
- Why do you need Atomics rather than just writing to the shared buffer directly?
Red flag Assuming postMessage shares memory — by default it copies (or transfers), and SharedArrayBuffer still needs Atomics for safety.
source: Node.js docs — Worker threads ↗ -
Cluster forks one worker per core, but in-memory session state and a request counter behave oddly across requests. Why, and how do you fix it?
Each cluster worker is a separate process with its own memory — they share the listening socket, not application state. A request lands on whichever worker the OS hands it to, so an in-process counter or in-memory session is only correct on the worker that happened to handle the *previous* request. Across workers you see stale/jumping values.
The fix is to externalize shared state: put sessions and counters in Redis (or a DB), so all workers read/write one source of truth. As a stopgap you can enable sticky sessions (route a client to the same worker), but that just pins the problem rather than solving shared state — and it breaks if that worker restarts.
General rule: cluster (and any horizontally-scaled service) must be stateless; keep state in a shared store.
Follow-ups they push on- Why don't cluster workers share a single counter variable?
- What do sticky sessions buy you, and why aren't they a real substitute for external state?
Red flag Keeping sessions/counters in process memory under cluster and expecting consistency across workers.
source: Node.js docs — Cluster ↗ -
spawn vs exec vs execFile vs fork in child_process — what distinguishes them, and which can blow up on large output?
All run a child process, differently:
-
spawn(cmd, args)— launches a process and streams its stdout/stderr. No output-size limit; use it for long-running processes or large output (e.g. piping ffmpeg).
-exec(cmdString)— runs the command in a shell and buffers all output, handing it to a callback. Convenient, but the buffer is capped (maxBuffer, default 1 MB) — exceed it and the child is killed with an error. Shell parsing also opens command-injection risk if you interpolate untrusted input.
-execFile(file, args)— likeexec(buffers output) but runs the binary directly, no shell — safer against injection, no shell features.
-fork(modulePath)— a specializedspawnfor a new Node.js process running a JS file, with a built-in IPC channel (child.send/process.on("message")).The trap:
exec/execFilebuffer, so big output OOMs or tripsmaxBuffer; stream withspawninstead.What a strong answer coversspawnstreams output (no size cap) — best for large/long output.execruns in a shell and buffers (default 1 MB maxBuffer) → kills child on overflow; injection risk.execFilebuffers too but skips the shell — safer, no shell features.forkspawns a child Node process with an IPCmessagechannel.
Follow-ups they push on- Why is execFile safer than exec against command injection?
- What error do you get when exec output exceeds maxBuffer?
Red flag Using exec for a command with large output — it buffers and either OOMs or hits maxBuffer; stream with spawn.
source: Node.js docs — Child process ↗
4.6 V8, memory & frameworks 14
-
What is Express middleware? Walk through what next() does and how the chain executes.
Express middleware is a function
(req, res, next)that sits in a chain between the incoming request and the route handler. Each request flows through the registered middleware in order; a middleware can read/modifyreq/res, end the response, or callnext()to pass control to the next one.- Call
next()→ continue to the next middleware/handler.
- Callnext(err)→ skip ahead to the error-handling middleware (the special 4-arg form(err, req, res, next)).
- Call neither and don't send a response → the request hangs (a common bug).Uses: logging, body parsing, auth, CORS, and a final centralized error handler. Order matters — auth must run before the protected handler.
Follow-ups they push on- Why must the error-handling middleware have four arguments?
- What happens if a middleware neither sends a response nor calls next()?
Red flag Forgetting to call next() (request hangs) or registering middleware in the wrong order (auth after the handler).
source: Express docs — Using middleware ↗ -
What are the most common causes of memory leaks in a long-running Node service?
A leak in a GC'd runtime is memory the GC can't reclaim because something still references it. Usual suspects:
- Unbounded caches / maps — a module-level
Mapyou only ever add to; it grows forever. Use an LRU with a size cap or TTLs.
- Forgotten event listeners / timers — adding a listener (orsetInterval) per request/connection and never removing it; the closures pin everything they captured (hence the max-listeners warning).
- Growing module-level (global) state — pushing onto an array that is never trimmed.
- Closures capturing big objects — a long-lived callback that closes over a large buffer keeps it alive.Diagnose by watching RSS/heap trend upward over hours, then take two heap snapshots and diff what grew.
Follow-ups they push on- How do you confirm a leak vs normal heap growth? (Two snapshots, diff retained objects.)
- Why is an unbounded cache the textbook leak?
Red flag Believing garbage collection makes leaks impossible — reachable-but-unused references defeat the GC.
source: Node.js docs — Memory diagnostics ↗ -
In Express, why doesn't a thrown error inside an async route handler reach your error-handling middleware (in Express 4)?
In Express 4, the router only catches errors thrown synchronously. An
asynchandler returns a promise; if it rejects (or youawaitsomething that throws), the rejection happens on a later tick after the handler already returned — Express never sees it, so your(err, req, res, next)middleware isn't invoked and the request hangs (and you get anunhandledRejection).Fixes:
- Forward errors explicitly:try { ... } catch (e) { next(e); }.
- Wrap handlers in an async helper that catches and callsnext(or useexpress-async-errors).Express 5 fixes this: it automatically forwards a rejected promise from a handler to the error middleware, so a plain
throw/rejection in an async handler is caught. Know which major version you're on — this is a very common production gotcha.What a strong answer coversExpress 4's router only catches synchronous throws; a rejected async handler escapes it.
Result: error middleware isn't called, the request hangs, and you get unhandledRejection.
Fix in v4:
try/catch+next(err), an async wrapper, orexpress-async-errors.Express 5 auto-forwards rejected promises to the error handler.
Follow-ups they push on- How does an async-handler wrapper forward rejections to next()?
- What changed in Express 5 around async error handling?
Red flag Assuming Express 4 catches async/await errors automatically — it doesn't; the request hangs.
source: Express docs — Error handling (async) ↗ -
What does dependency injection in NestJS actually solve as a codebase grows, compared to manually constructing services?
Without DI you wire dependencies by hand: each class
news the things it needs, which hardcodes concrete implementations and threads constructor arguments through the whole tree. As the app grows this becomes brittle — changing a service's dependencies means editing every call site, and substituting a fake for tests is painful.NestJS DI inverts that: you declare a class
@Injectable()and ask for its dependencies in the constructor; an IoC container constructs and caches them (singletons by default) and injects them where declared. Benefits:- Decoupling — depend on an abstraction/token, swap the concrete provider in one place.
- Testability — override a provider with a mock in the test module; no monkey-patching.
- Lifecycle/scoping — the container manages singletons (and request-scoped instances) consistently.The payoff is at scale: wiring lives in module metadata, not scattered
newcalls, so large teams can reason about and replace pieces independently.What a strong answer coversManual wiring hardcodes concretes and threads constructor args through the tree.
Nest's IoC container constructs, caches (singleton by default), and injects dependencies.
Decouples via tokens/abstractions — swap a provider in one place.
Makes testing easy (override providers with mocks) and centralizes lifecycle/scoping.
Follow-ups they push on- How would you inject a mock repository in a Nest unit test?
- What's the difference between a singleton and a request-scoped provider?
Red flag Dismissing DI as ceremony — its payoff (decoupling, testability) shows up as the dependency graph grows.
source: NestJS docs — Providers / Dependency injection ↗ -
How do you debug and profile a Node process — say it's leaking memory or pinning the CPU in production?
Start with the built-in inspector: run with
--inspect(or--inspect-brk) and connect Chrome DevTools (chrome://inspect) or VS Code.- CPU pinned: take a CPU profile (DevTools Profiler, or
--prof/--cpu-prof) and read the flame graph for the hot function. Also watch event-loop lag — high lag means something is blocking the loop.
- Memory leak: take two heap snapshots minutes apart under load and use the Comparison view to see which object types keep growing and what retains them.process.memoryUsage()(RSS/heapUsed) shows the trend.In production, prefer low-overhead options:
--cpu-prof/--heap-profto dump profiles to disk, or APM tools. The first move is almost always: snapshot/profile, then diff.Follow-ups they push on- How do you find what retains a leaked object in a heap snapshot? (Retainers path.)
- What is event-loop lag and how would you measure it?
Red flag Guessing at the hot path or leak instead of taking a profile / two heap snapshots and diffing.
source: Node.js docs — Debugging with --inspect ↗ -
Express vs Fastify vs NestJS — at a concept level, what differentiates them?
- Express — the minimal, unopinionated classic: a thin router + middleware model. Huge ecosystem, you assemble structure yourself. Great default; less guidance on large-app architecture.
- Fastify — Express-like but built for performance and developer ergonomics: a faster router, schema-based validation/serialization (JSON Schema) that also speeds up responses, and a first-class plugin/encapsulation system. Pick when throughput and built-in validation matter.
- NestJS — an opinionated framework (Angular-inspired) layered on top of Express *or* Fastify: TypeScript-first, modules/controllers/providers, dependency injection, decorators. Pick for large, structured teams/codebases that want enforced architecture out of the box.Trade-off axis: Express (minimal, flexible) → Fastify (fast, validated) → Nest (structured, batteries-included).
Follow-ups they push on- What does Fastify's schema-based serialization buy you over plain JSON.stringify?
- What problem does NestJS's dependency injection solve as a codebase grows?
Red flag Calling them interchangeable — they sit at very different points on the minimal-vs-opinionated spectrum.
source: Fastify docs — Benchmarks & overview ↗ -
How does single-threaded Node serve high concurrency, and where does that model fall down?
Node wins at I/O-bound concurrency because the one JS thread never *waits* on I/O — it dispatches the request to the OS/libuv and serves other requests while the bytes are in flight. Thousands of mostly-idle connections (each waiting on a DB or network) cost little: no thread-per-connection overhead, just registered callbacks. That is the sweet spot: APIs, proxies, real-time/websocket servers.
Where it falls down: CPU-bound work. One synchronous heavy computation (image processing, big JSON crunch, sync crypto) blocks the single thread and stalls *every* connection. The fixes are the concurrency tools — worker threads for in-process CPU work, cluster to use all cores for throughput, or offloading to a separate service/queue.
Summary: brilliant for I/O concurrency, weak for CPU parallelism — so keep CPU work off the event-loop thread.
Follow-ups they push on- Why is thread-per-connection (classic blocking servers) less memory-efficient for many idle connections?
- Which workloads should you NOT put on a plain single-process Node server?
Red flag Claiming Node is fast for everything — it shines for I/O concurrency, not CPU parallelism.
source: Node.js docs — Don't block the event loop ↗ -
Name the four classic Node 'gotchas' that bite teams in production, and how each manifests.
The recurring four:
1. Blocking the event loop — synchronous CPU work (or
*Syncfs calls) on the request path freezes the whole server; symptom is rising latency/timeouts across all requests at once. Offload to a worker or chunk withsetImmediate.
2. Unhandled stream errors — a stream emits'error'with no listener and crashes the process. Handle'error'on every stream / usepipeline.
3. Floating promises — an un-awaited async call whose rejection is lost (or now crashes viaunhandledRejection); symptom is silent failures or sudden exits. Always await/return/.catch.
4. Unhandled rejections / uncaught exceptions — treated as last-resort: log and exit, don't swallow and keep serving a corrupted process.These map directly onto the earlier chapters — they are the failure modes of the event loop, streams, and async model.
Follow-ups they push on- Which of these would a linter (no-floating-promises) catch automatically?
- Why is 'log and continue' the wrong response to an uncaughtException?
Red flag Treating these as edge cases — they are the single most common ways production Node services fall over.
source: Node.js docs — Don't block the event loop ↗ -
Why does JIT compilation make microbenchmarks misleading, and what does V8 do with 'hot' functions?
V8 runs JS through a tiered pipeline: an interpreter (Ignition) runs bytecode immediately, and an optimizing compiler (TurboFan, with a mid-tier Maglev) recompiles 'hot' functions — ones called often — into fast machine code, using runtime type feedback to specialize them.
This makes naive microbenchmarks misleading two ways: (1) the first runs are slow (cold, interpreted) before optimization kicks in, so timing a few iterations measures warmup, not steady state; (2) if a function later sees an unexpected type, V8 deoptimizes it back to slower code — a benchmark with uniform inputs won't reveal the real-world deopt cost. Also dead-code elimination can delete a benchmark whose result is unused.
Takeaways: warm up before measuring, run many iterations, use a real benchmarking harness, and keep functions monomorphic (consistent argument shapes) so V8 can keep them optimized.
What a strong answer coversV8 tiers: Ignition (interpret) → Maglev/TurboFan (optimize hot functions) using type feedback.
Cold runs are slow; timing few iterations measures warmup, not steady state.
Type changes trigger deoptimization; uniform-input benchmarks hide that cost.
Warm up, run many iterations, keep functions monomorphic; beware dead-code elimination.
Follow-ups they push on- What is a 'deopt' and what kinds of code commonly trigger it?
- Why does keeping object shapes consistent (monomorphic) help V8?
Red flag Trusting a few-iteration microbenchmark — you're measuring cold interpreted code, not optimized steady state.
source: V8 blog — Firing up the Ignition interpreter / TurboFan ↗ -
What does Fastify's schema-based serialization buy you over returning a plain object that gets JSON.stringify'd?
When you attach a response JSON Schema to a Fastify route, Fastify compiles a specialized serializer (via
fast-json-stringify) tailored to that exact shape. Instead of the genericJSON.stringifyreflecting over the object at runtime, it runs straight-line code that knows the fields and types ahead of time — measurably faster serialization, the main reason for the speedup on JSON-heavy endpoints.Two more wins: the schema acts as an output contract — fields not in the schema are stripped, which prevents accidentally leaking internal/sensitive properties — and combined with request schemas you get validation at the boundary. So: faster responses, an explicit contract, and a safety filter against over-exposure.
Trade-off: you must keep the schema in sync with the response, and a field you forget to declare silently disappears from the output.
What a strong answer coversCompiles a shape-specific serializer (fast-json-stringify) — faster than generic JSON.stringify.
Strips fields not in the schema → prevents leaking internal/sensitive properties.
Pairs with request schemas for boundary validation and an explicit contract.
Trade-off: undeclared fields silently vanish; the schema must stay in sync.
Follow-ups they push on- How does schema-based serialization prevent accidentally leaking a password field?
- What's the risk of forgetting to add a field to the response schema?
Red flag Forgetting that fields absent from the response schema are silently dropped from the output.
source: Fastify docs — Validation and Serialization ↗ -
WeakMap and WeakRef exist partly to avoid memory leaks. How does a WeakMap-keyed cache differ from a Map-keyed one?
A
Mapholds strong references to its keys. If you use objects as keys in a long-livedMapcache and neverdeletethem, those keys (and their values) can never be garbage-collected — the Map itself keeps them alive. That's the textbook unbounded-cache leak.A
WeakMapholds its keys weakly: an entry does not prevent its key object from being collected. Once nothing else references the key, the GC can reclaim the key and its associated value, and the entry vanishes automatically. So aWeakMapkeyed by an object (e.g. caching per-request or per-element metadata) cleans itself up when the key dies — no manual eviction.Caveats:
WeakMapkeys must be objects, it's not enumerable (no.size, no iteration — because collection timing is non-deterministic), and it's a tool for associating data with object lifetimes, not a general size-bounded cache (use an LRU for that).WeakRef/FinalizationRegistryare the lower-level primitives for individual weak references.What a strong answer coversMapkeys are strong references → object keys live as long as the Map (leak risk).WeakMapkeys are weak → key + value are GC'd once nothing else references the key.WeakMap keys must be objects; it's not iterable and has no
.size.Great for per-object metadata tied to lifetime; use an LRU for size-bounded caches.
Quick self-checkWhy can a WeakMap-keyed cache avoid a leak that a Map-keyed one causes?
-
Wrong: WeakMap has no size limit or LRU eviction; collection is tied to key reachability.
-
Correct — weak keys let the GC reclaim entries automatically when the key object dies.
-
Wrong: it's a normal heap structure; the difference is reference strength, not location.
-
Wrong: there's no compression; the mechanism is weak references to keys.
Follow-ups they push on- Why can't a WeakMap be iterated or report its size?
- When is a WeakMap the wrong choice and an LRU cache the right one?
Red flag Using a plain Map with object keys as a long-lived cache and never evicting — it pins keys/values forever.
source: MDN — WeakMap ↗ -
Give a rough picture of how V8 manages memory and garbage collection. What's the generational heap?
V8 (the JS engine in Node and Chrome) compiles JS to machine code and manages a generational heap on the generational hypothesis: most objects die young.
- Young generation (new space) — small; new allocations go here. Collected often by a fast Scavenge (copying) collector. Cheap because it touches little memory.
- Old generation (old space) — objects that survive a couple of scavenges are *promoted* here. Collected less often by Mark-Sweep-Compact (mark reachable objects, sweep the rest, compact to fight fragmentation).Much of this runs concurrently/incrementally to keep pauses short. The heap has a default cap (historically ~1.5–2GB for old space) tunable via
--max-old-space-size. The practical takeaway: short-lived allocations are nearly free; long-lived retained objects are what cost you.Follow-ups they push on- Why is collecting the young generation so much cheaper than the old generation?
- What does --max-old-space-size change, and when do you raise it?
Red flag Describing GC as one big stop-the-world sweep — modern V8 is generational and largely incremental/concurrent.
source: Node.js docs — Memory diagnostics ↗ -
What's the difference between RSS, heapTotal, and heapUsed in process.memoryUsage(), and which one reveals a leak?
process.memoryUsage()returns several numbers:-
rss(Resident Set Size) — total physical RAM the process holds: V8 heap + native allocations + Buffers (off-heap) + code/stack. The OS-level footprint.
-heapTotal— memory V8 has reserved for its JS object heap.
-heapUsed— the portion of that heap actually in use by live JS objects.
- (external/arrayBuffers) — memory used by C++ objects and ArrayBuffers/Buffers bound to V8, outside the JS heap.For a leak, watch the trend over time, not a single reading. A steadily-climbing
heapUsedthat never drops after GC points to a JS-object leak (caches, listeners). A climbingrsswith flatheapUsedpoints to off-heap/native growth (Buffers, native addons). SoheapUsedfor JS leaks,rss/externalfor off-heap ones.What a strong answer coversrss= total physical RAM (heap + native + Buffers + code) — the OS footprint.heapTotal= V8 heap reserved;heapUsed= live JS objects within it.Climbing
heapUsedthat survives GC → JS-object leak (caches, listeners).Climbing
rss/externalwith flat heapUsed → off-heap/native (Buffer) growth.
Follow-ups they push on- Why might rss grow while heapUsed stays flat? (Off-heap Buffers / native memory.)
- Why look at the trend across snapshots rather than one reading?
Red flag Diagnosing all leaks via heapUsed — off-heap Buffer/native growth shows up in rss/external, not the JS heap.
source: Node.js docs — process.memoryUsage() ↗ -
What does --max-old-space-size control, and why does raising it sometimes hide a leak rather than fix it?
--max-old-space-size=<MB>raises the cap on V8's old-generation heap (where long-lived objects live). When the old space approaches this limit, V8 runs aggressive GC; if memory still can't be reclaimed, the process dies withFATAL ERROR: ... JavaScript heap out of memory. The default is well under modern machine RAM (historically ~2 GB on 64-bit), so legitimately large workloads sometimes need it raised.The trap: bumping it to make OOM crashes "go away" when the real problem is a leak. If memory grows without bound, a bigger cap just postpones the crash — it grows to the new limit and dies again, now with bigger GC pauses along the way. Raise it when working set is genuinely large and bounded; for unbounded growth, profile and fix the leak (heap snapshots, retainer paths) instead.
What a strong answer coversSets V8's old-generation heap cap; hitting it → 'JavaScript heap out of memory' crash.
Default is below machine RAM, so large legitimate workloads may need it raised.
For a real leak, a higher cap just delays the crash (and worsens GC pauses).
Raise for genuinely-large bounded working sets; profile/fix for unbounded growth.
Follow-ups they push on- How do you tell a real leak from a legitimately large working set?
- What's the downside of a very large old-space heap on GC pause times?
Red flag Cranking --max-old-space-size to stop OOM crashes that are actually a leak — it postpones, not fixes.
source: Node.js docs — --max-old-space-size ↗
05 Frontend 67 Q's
5.1 How the browser works 16
-
How does the browser build the DOM and the CSSOM, and how do they combine into the render tree?
The browser tokenizes the HTML bytes into nodes and assembles them into the DOM tree — a complete model of the markup. In parallel it parses CSS (inline,
<style>, and external) into the CSSOM, a tree of style rules with the cascade resolved.The render tree combines the two: it walks the DOM and attaches computed styles, but includes only the nodes that will be painted. Nodes with
display:noneare excluded entirely;<head>and<script>are not visual so they are absent too.visibility:hiddennodes stay in the tree (they occupy space).The render tree then feeds layout, which computes each node's geometry.
What a strong answer coversDOM = full parsed markup; CSSOM = parsed style rules with the cascade applied.
The render tree = DOM nodes that will be displayed, each annotated with computed styles.
display:nonenodes are excluded from the render tree;visibility:hiddennodes are kept (they still take space).The CSSOM cannot be built incrementally the way the DOM can — CSS is treated as render-blocking until fully parsed.
Quick self-checkWhich node is present in the DOM but NOT in the render tree?
-
Still in the render tree — it occupies layout space, just isn't painted.
-
Correct — display:none nodes are excluded from the render tree entirely.
-
Still rendered and laid out; it's just fully transparent.
-
A normal visible element — present in the render tree.
Follow-ups they push on- Why is the render tree not a 1:1 copy of the DOM?
- Why does an element with display:none not appear in the render tree but visibility:hidden does?
Red flag Saying the render tree is just the DOM, or that display:none and visibility:hidden are treated the same here. display:none drops the node entirely; visibility:hidden keeps it (with its box).
source: web.dev — Constructing the Object Model (CRP) ↗ -
What is the difference between the DOMContentLoaded and load events?
DOMContentLoadedfires when the HTML is fully parsed and the DOM is built — deferred scripts have run, but it does not wait for stylesheets, images, or subframes.loadfires later, when the page and all dependent resources (images, stylesheets, iframes) have finished loading.Most app initialization that only needs the DOM should run on
DOMContentLoaded(or just usedefer); reserveloadfor logic that needs final layout or image dimensions.Follow-ups they push on- Does DOMContentLoaded wait for async scripts?
- When would you actually need the load event?
Red flag Thinking DOMContentLoaded waits for images, or putting all init in load and delaying interactivity unnecessarily.
source: MDN — Document: DOMContentLoaded event ↗ -
Walk me through what happens from typing a URL to seeing the page render.
DNS resolves the host, TCP+TLS connect, the browser requests the HTML and parses it into the DOM; CSS is parsed into the CSSOM; DOM + CSSOM combine into the render tree. Then layout (reflow) computes geometry, paint fills pixels, and composite assembles layers on the GPU.
Note that CSS is render-blocking and
<script>is parser-blocking unless markedasyncordefer. This whole sequence is the critical rendering path.Follow-ups they push on- Why can transform/opacity animations skip layout and paint?
- Where does the JS engine block the parser, and how do async/defer change that?
Red flag Forgetting the CSSOM, or conflating reflow (layout) with repaint (paint). Saying the DOM alone produces pixels.
source: web.dev — Critical rendering path ↗ -
How does a browser repaint at 60fps, and what is the ~16ms frame budget? Where does requestAnimationFrame fit?
At a 60Hz refresh rate the browser aims to produce a new frame every ~16.7ms (1000/60). Within that budget it must run any JS, recalculate style, lay out, paint, and composite — so a long-running task that overruns 16ms causes a dropped frame (jank).
requestAnimationFrame(cb)schedulescbto run right before the next paint, so visual updates align with the frame instead of firing at arbitrary times (assetTimeoutwould). It is the correct place to do animation work and DOM writes that should be visible next frame.Real budget is less than 16ms because the browser itself needs some of it; aim to keep main-thread work well under that.
What a strong answer covers60fps means a frame roughly every 16.7ms (1000ms / 60).
All per-frame work (JS, style, layout, paint, composite) must fit the budget or a frame drops.
requestAnimationFrameruns callbacks just before the next repaint, syncing visual updates to the frame.Prefer rAF over setTimeout for animation; setTimeout isn't aligned to the refresh cycle.
Follow-ups they push on- Why is requestAnimationFrame better than setTimeout for animations?
- What happens to rAF callbacks in a background (hidden) tab?
Red flag Using setTimeout for smooth animation (not frame-aligned), or assuming you have the full 16ms — the browser's own work eats into it.
source: MDN — Window: requestAnimationFrame() ↗ -
How do async, defer, and type="module" scripts differ in download and execution timing?
A plain
<script>blocks the parser: download and run happen inline, halting DOM construction.async: downloads in parallel; runs as soon as it arrives, possibly interrupting parsing, in no guaranteed order.defer: downloads in parallel; runs after the document is parsed, just beforeDOMContentLoaded, in document order.type="module"scripts are deferred by default (no attribute needed) and execute in order; addingasyncto a module makes it run as soon as it and its imports are ready. Modules are also always strict mode and have their own scope.Quick rule: app/UI code →
defer(or a module); independent third-party (analytics) →async.What a strong answer coversPlain script: parser-blocking download + execute.
async: parallel download, run on arrival, unordered.defer: parallel download, run after parse in order (before DOMContentLoaded).type="module": deferred by default, ordered, strict mode, scoped.
Quick self-checkBy default (no async/defer attribute), when does a <script type="module"> execute?
-
Modules are deferred by default; they don't block parsing.
-
Correct — module scripts behave like deferred scripts by default.
-
That's async behavior; modules default to defer, not async.
-
Deferred scripts run before DOMContentLoaded, well ahead of load.
Follow-ups they push on- Why is a module script deferred even without the defer attribute?
- What ordering guarantees do you lose with async?
Red flag Adding `defer` to a module thinking it's required (it's already deferred), or assuming async preserves execution order.
source: MDN — <script> type=module / async / defer ↗ -
What is the difference between reflow and repaint?
Reflow (layout) recomputes element geometry — sizes and positions. It is expensive because changing one element can cascade to its ancestors, descendants, and siblings. Triggers: width/height, margin/padding, font-size, adding/removing DOM nodes, reading
offsetHeight.Repaint redraws pixels without changing geometry — e.g.
color,background-color,visibility. Cheaper than reflow.Composite-only changes (
transform,opacity) can skip both layout and paint and run on the GPU's compositor thread, which is why they animate smoothly.Follow-ups they push on- Why does reading offsetWidth in a loop after writing styles cause layout thrashing?
- How would you batch DOM reads and writes to avoid forced synchronous layout?
Red flag Claiming color changes cause reflow, or that all CSS animations are cheap. Animating `top`/`left`/`width` triggers reflow every frame; `transform` does not.
source: web.dev — Critical rendering path ↗ -
Why is CSS render-blocking, and why is a plain <script> parser-blocking?
CSS is render-blocking because the browser will not paint until it has the CSSOM — rendering with incomplete styles would cause a flash of unstyled content. So it blocks the first render, though not DOM construction.
A plain
<script>is parser-blocking: when the parser hits it, it stops building the DOM, fetches (if external) and executes the script, then resumes. Scripts can read and mutate the DOM, so the browser cannot safely keep parsing past them. This is why scripts are traditionally placed at the end of<body>.Follow-ups they push on- What do async and defer change about this?
- What is a render-blocking resource vs a parser-blocking one?
Red flag Saying CSS blocks DOM construction (it blocks render, not the DOM), or that all scripts block the parser regardless of attributes.
source: web.dev — Critical rendering path ↗ -
What is the difference between async and defer on a script tag?
Both download the script in parallel without blocking the parser; they differ in when execution happens and whether order is preserved.
defer: execute after the HTML is fully parsed, just beforeDOMContentLoaded, and in document order. Good for scripts that depend on the DOM or on each other.async: execute as soon as the download finishes, which can interrupt parsing, and in no guaranteed order. Good for independent scripts like analytics.A plain script (no attribute) blocks the parser while it downloads and runs.
Follow-ups they push on- Which would you use for a third-party analytics snippet, and which for an app bundle?
- Do async/defer affect inline scripts?
Red flag Swapping the two, or claiming async preserves order. async is order-independent; defer preserves order. (async/defer are ignored on inline scripts.)
source: MDN — <script>: async and defer ↗ -
Why can the browser parse HTML and discover sub-resources before the document is fully loaded? What is the preload scanner?
Modern browsers run a secondary preload scanner (also called a lookahead pre-parser) that races ahead of the main HTML parser. While the main parser may be blocked executing a synchronous
<script>, the preload scanner scans the raw markup for resources —<img>,<link>,<script src>— and starts fetching them early.This is why a render-blocking script does not also stall *network* discovery of later assets. It is also why CSS injected by JavaScript (rather than declared in markup) can hurt performance: the preload scanner cannot see it, so the fetch starts late.
Takeaway: keep critical resources in the initial HTML as plain
<link>/<img>so the scanner can find them.What a strong answer coversThe preload scanner pre-parses raw HTML to discover and fetch sub-resources ahead of the main parser.
It keeps the network busy even when the main parser is blocked on a synchronous script.
It only sees resources declared in the markup — JS-injected assets are invisible to it.
Declaring critical assets as plain tags (or
<link rel=preload>) lets discovery start as early as possible.
Follow-ups they push on- Why might lazy-loading or injecting your LCP image via JS hurt LCP?
- How does <link rel=preload> interact with the preload scanner?
Red flag Assuming a blocking script also blocks all network discovery — the preload scanner keeps fetching declared resources. Hiding critical assets behind JS injection defeats it.
source: web.dev — How the browser's preload scanner speeds up page loads ↗ -
What is the difference between a render-blocking resource and a parser-blocking resource?
Render-blocking resources prevent the browser from painting the first frame until they are processed — chiefly CSS (and synchronous CSS in
<head>). The DOM may keep being built, but nothing is shown until the CSSOM is ready.Parser-blocking resources halt DOM construction itself. A synchronous
<script>is the classic case: the parser stops, fetches and runs the script, then resumes — because the script coulddocument.writeor mutate the not-yet-built DOM.They overlap (a blocking script is effectively both, since stopping the parser also delays render), but the mental model differs: CSS blocks *painting*, scripts block *parsing*.
What a strong answer coversRender-blocking (CSS): DOM keeps building, but first paint waits for the CSSOM.
Parser-blocking (sync
<script>): DOM construction itself pauses until the script runs.async/defermake scripts non-parser-blocking;mediaqueries /printcan make a stylesheet non-render-blocking.A synchronous in-
<head>script behind a stylesheet is doubly bad: it waits for the CSS, then blocks the parser.
Follow-ups they push on- Why might a synchronous script wait for a preceding stylesheet to load?
- How do you make a stylesheet non-render-blocking with the media attribute?
Red flag Conflating the two: CSS blocks render (not DOM construction); a plain script blocks parsing (and therefore render too).
source: web.dev — Render blocking resources ↗ -
What is the compositor thread, and how is the browser's main thread different from it?
The main thread runs JavaScript, parses HTML/CSS, computes style, layout, and paint. If it is busy (a long task), the page cannot respond to input or update the DOM — this is what hurts INP.
The compositor thread runs separately and assembles already-painted layers into the final frame, handling scrolling and
transform/opacityanimations on the GPU. Because it does not need the main thread, scrolling and compositor-driven animations stay smooth even while JS is busy — until they need a property that forces layout/paint, which bounces work back to the main thread.This split is why
transform/opacityanimate at 60fps and why heavy JS tanks responsiveness but not necessarily scroll.What a strong answer coversMain thread: JS execution, style, layout, paint — a single thread that blocks the whole page when busy.
Compositor thread: stitches painted layers, handles scroll and
transform/opacityoff the main thread (often GPU-accelerated).Compositor-only changes (
transform,opacity) skip layout and paint, so they animate even during main-thread work.Long main-thread tasks block input handling and DOM updates, degrading responsiveness (INP).
Follow-ups they push on- Why does animating `top`/`left` re-involve the main thread every frame?
- How does breaking up long tasks improve responsiveness?
Red flag Believing all animations run off the main thread, or that the compositor can recompute layout. It only composites already-painted layers.
source: web.dev — Inside look at modern web browser (the compositor) ↗ -
What does this code do to rendering performance, and how would you fix it? for (const el of items) { el.style.width = el.offsetWidth + 10 + 'px'; }
Each iteration writes a style (
el.style.width = ...) and then the next read ofoffsetWidthforces the browser to flush layout so the read is accurate — a forced synchronous layout on every pass. With N items you get N reflows: classic layout thrashing.Fix: split into a read phase then a write phase so layout is computed at most once.
const widths = items.map((el) => el.offsetWidth);items.forEach((el, i) => { el.style.width = widths[i] + 10 + 'px'; });Now all reads happen against one stable layout, and all writes are batched before the next reflow.
What a strong answer coversReading
offsetWidthafter a style write forces a synchronous layout so the value is fresh.Interleaving read/write per iteration = one reflow per item = layout thrashing.
Fix: batch all reads first, then all writes (read/write separation).
requestAnimationFramecan schedule the write phase to align with the next frame.
Quick self-checkWhy is the original loop slow?
-
Correct — the write invalidates layout and the next read flushes it, N times.
-
It dirties layout, but the slowness is the forced read-after-write, not paint.
-
Loop form is irrelevant; the layout thrashing dominates.
-
offsetWidth returns a number; parsing isn't the bottleneck.
Follow-ups they push on- Which properties besides offsetWidth force a synchronous layout when read?
- How would FastDOM or requestAnimationFrame help here?
Red flag Thinking the cost is the loop itself rather than the read-after-write pattern that forces a reflow each iteration.
source: web.dev — Avoid large, complex layouts and layout thrashing ↗ -
What is a layer (compositor layer), and what is the tradeoff of promoting elements with will-change?
The browser can split the page into compositor layers — separate bitmaps the GPU can transform and blend independently. Promoting an element to its own layer lets the compositor move it (via
transform) without repainting, which is what makes such animations cheap.will-change: transform(oropacity) hints the browser to promote an element ahead of time so the first frame is not janky. The tradeoff: each layer costs GPU memory, and too many layers add management overhead that can make things slower, not faster.Rule of thumb: apply
will-changejust before an animation and remove it after; never blanket it onto many elements.What a strong answer coversA compositor layer is an independently rasterized surface the GPU can move/blend without repaint.
will-changeproactively promotes an element so animations start smoothly.Each layer consumes GPU memory; over-promotion causes overhead and can regress performance.
Apply
will-changenarrowly and temporarily, not as a global optimization.
Follow-ups they push on- How can you inspect layers in DevTools (the Layers panel)?
- Why is `will-change: transform` on every element a bad idea?
Red flag Treating `will-change` as a free speed-up and applying it everywhere — it inflates memory and can hurt performance.
source: MDN — will-change ↗ -
Why do animating transform and opacity perform better than animating top/left or width/height?
top/left/width/heightchange geometry, so every animation frame triggers layout (reflow), then paint, then composite — on the main thread.transformandopacitycan be handled by the compositor: the element is promoted to its own layer and the GPU moves/blends it without re-running layout or paint. The work happens off the main thread, so it stays smooth even if JS is busy.Practical rule: animate
transformandopacity; usewill-changesparingly to hint layer promotion.Follow-ups they push on- What is the downside of promoting too many layers with will-change?
- What is the compositor thread and how is it separate from the main thread?
Red flag Overusing `will-change` on everything (memory blow-up, no benefit), or believing all CSS animations bypass the main thread.
source: web.dev — Animations and performance ↗ -
What is layout thrashing, and how do you avoid forced synchronous layout?
Layout thrashing is repeatedly interleaving DOM writes and layout-forcing reads in a loop, so the browser must recompute layout synchronously over and over.
Reading a property like
offsetHeight,getBoundingClientRect(), orscrollTopafter a style write forces the browser to flush pending layout immediately so the read is accurate — a forced synchronous layout.Fix: batch all reads first, then all writes. Libraries like FastDOM do this;
requestAnimationFramecan schedule the write phase.Follow-ups they push on- Which DOM properties force a synchronous layout when read?
- How does requestAnimationFrame help schedule reads vs writes?
Red flag Reading offsetWidth and then writing style in the same loop iteration, forcing a reflow each pass.
source: web.dev — Avoid large, complex layouts and layout thrashing ↗ -
What is the critical rendering path and how would you optimize it?
The critical rendering path is the sequence of steps the browser takes to turn HTML, CSS, and JS into pixels: build the DOM, build the CSSOM, combine into the render tree, lay out, paint, composite.
Optimizing it means getting the first meaningful paint sooner by reducing critical resources:
- Inline critical CSS, defer the rest; minimize render-blocking CSS.
- Adddefer/asyncto scripts so they do not block parsing.
- Preload key assets (<link rel="preload">), preconnect to origins.
- Minify and compress; reduce bytes and round-trips.Follow-ups they push on- How does inlining critical CSS help LCP?
- What is the tradeoff of inlining vs caching a separate CSS file?
Red flag Listing micro-optimizations without naming the blocking resources (CSS render-blocking, scripts parser-blocking) that actually delay first paint.
source: web.dev — Critical rendering path ↗
5.2 DOM, HTML & CSS 16
-
What is the difference between display:none and visibility:hidden?
display:noneremoves the element from the render tree entirely: it occupies no space, is not painted, and is skipped by most assistive tech. Toggling it triggers reflow.visibility:hiddenkeeps the element in layout — it still occupies its box and affects siblings — but is not painted (invisible). It is not interactive.A third option,
opacity:0, is fully painted and still interactive (clickable) and laid out; it just renders transparent.Follow-ups they push on- Which of the three is keyboard-focusable / clickable?
- Which triggers reflow when toggled vs only repaint?
Red flag Saying visibility:hidden removes the element from layout — it still occupies space. Confusing opacity:0 (still clickable) with display:none.
source: MDN — visibility ↗ -
How would you center a div both horizontally and vertically? Give more than one approach.
Flexbox (most common): on the parent,
display: flex; align-items: center; justify-content: center;Grid (terse): on the parent,
display: grid; place-items: center;Absolute + transform (no flex/grid): on the child,
position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%);The transform trick offsets by the element's own size (the 50% in translate is relative to the element), so it centers regardless of dimensions. Flexbox/grid are preferred for in-flow content; absolute centering suits overlays where the child is taken out of flow.
What a strong answer coversFlexbox:
align-items: center; justify-content: center;on the container.Grid:
place-items: center;— the shortest form.Absolute +
translate(-50%, -50%)centers without knowing the element's size.Prefer flex/grid for in-flow content; absolute centering for overlays/modals.
Follow-ups they push on- Why does translate(-50%, -50%) work without knowing the element's dimensions?
- What changes if the parent has a fixed height vs auto height?
Red flag Using `margin: auto` for vertical centering on a block (works horizontally, not vertically without flex), or forgetting the parent needs a height for flex centering to be visible.
source: MDN — Box alignment (centering) ↗ -
Explain the CSS box model and the box-sizing property.
Every element is a box with four areas, from inside out: content, padding, border, margin.
With the default
box-sizing: content-box, thewidthyou set applies to the content only; padding and border are added on top, so the rendered box is wider thanwidth.With
box-sizing: border-box,widthincludes content + padding + border, so the element stays the size you set. This is why many resets apply* { box-sizing: border-box; }.Follow-ups they push on- Why do margins collapse vertically between block elements?
- Does margin count toward the element's width in either box-sizing mode?
Red flag Forgetting that margin is always outside the box (never part of width), or not knowing border-box folds padding/border into the declared width.
source: MDN — The box model ↗ -
Why does semantic HTML matter? Give examples beyond <div> and <span>.
Semantic elements describe meaning, not just appearance, which benefits accessibility, SEO, and maintainability.
Elements like
<header>,<nav>,<main>,<article>,<section>,<aside>,<footer>create landmarks that screen readers and the accessibility tree expose, letting users jump between regions.<button>,<a>,<label>,<input>come with built-in keyboard behavior and roles.A
<div>with a click handler has none of that — you would have to re-add role, tabindex, and key handling manually.Follow-ups they push on- What do you lose by using <div onClick> instead of <button>?
- How do landmark elements help screen-reader navigation?
Red flag Treating semantics as purely cosmetic, or reinventing a button from a div without role/tabindex/keyboard support.
source: MDN — HTML: A good basis for accessibility ↗ -
What does flex: 1 actually mean? Break down flex-grow, flex-shrink, and flex-basis.
flexis shorthand for three properties:- flex-grow — how much a item grows to fill leftover free space, relative to siblings.
- flex-shrink — how much it shrinks when there isn't enough space.
- flex-basis — the item's starting size before grow/shrink (its 'ideal' main size).flex: 1expands toflex: 1 1 0%— grow 1, shrink 1, basis 0%. Because basis is 0, items size purely by their grow ratio, so equalflex: 1items become equal width regardless of content. Contrastflex: auto(1 1 auto), where content size is the starting point, so items differ by content length.What a strong answer coversflex: <grow> <shrink> <basis>;flex: 1=1 1 0%.flex-grow distributes free space; flex-shrink distributes overflow; flex-basis is the pre-grow size.
flex: 1on siblings gives equal sizes (basis 0);flex: auto(basis auto) sizes from content first.flex-basistakes priority overwidthfor the main-axis starting size.
Follow-ups they push on- What's the difference between flex: 1 and flex: auto?
- When does flex-shrink: 0 matter (preventing an item from collapsing)?
Red flag Thinking flex: 1 sets a width directly, or confusing flex: 1 (basis 0, equal sizes) with flex: auto (basis auto, content-driven sizes).
source: MDN — flex ↗ -
Why and when do vertical margins collapse between block elements?
Margin collapsing is when adjacent vertical margins combine into a single margin equal to the largest of them, rather than summing. It applies only to block-level boxes in normal flow along the block (vertical) axis — never horizontal margins.
Three cases: adjacent siblings (the bottom margin of one and top margin of the next collapse); a parent and its first/last child (if no border/padding/content separates them); and an empty block (its own top and bottom margins collapse).
It does not happen for flex/grid items, floated or absolutely-positioned elements, or when a border, padding, or
overflow: autoseparates the boxes. This trips people up when a child's margin unexpectedly pushes the parent.What a strong answer coversCollapsing takes the max of the two margins, not the sum — vertical only.
Happens between siblings, parent/first-or-last child, and within empty blocks.
Prevented by a border, padding,
overflowother than visible, or a BFC.Does not apply to flex/grid items, floats, or absolutely positioned boxes.
Follow-ups they push on- How does establishing a block formatting context (BFC) stop collapsing?
- Why does a child's top margin sometimes push the parent down?
Red flag Expecting margins to add up, or thinking collapsing applies to flex/grid items (it doesn't) or to horizontal margins (it doesn't).
source: MDN — Mastering margin collapsing ↗ -
What is the difference between rem, em, %, vw/vh, and px? When would you reach for each?
px is an absolute (device-independent) pixel — fixed, predictable, but ignores user font preferences.
em is relative to the current element's font-size (for most properties), so it compounds when nested. rem is relative to the root
<html>font-size — no compounding, which makes it the go-to for scalable, accessible typography and spacing.% is relative to the parent's corresponding dimension. vw/vh are 1% of the viewport's width/height, useful for full-screen sections.
Practical default:
remfor type and spacing (respects user zoom/root size),%/fr/vwfor fluid layout,pxfor hairline borders.What a strong answer coverspx: absolute and fixed; doesn't scale with user font settings.em: relative to the element's own font-size — compounds when nested.rem: relative to the root font-size — no compounding; best for accessible, scalable type.%is relative to the parent;vw/vhare 1% of viewport width/height.
Follow-ups they push on- Why can nested em values produce surprising sizes?
- Why is rem preferred over px for font sizes from an accessibility standpoint?
Red flag Confusing em (element-relative, compounds) with rem (root-relative), or using px for font sizes and breaking user zoom/font-size preferences.
source: MDN — CSS values and units (length) ↗ -
What is the difference between event.target and event.currentTarget on a bubbling event?
event.targetis the element where the event originated — the deepest node that was actually clicked/typed-in.event.currentTargetis the element whose listener is currently running — i.e. the element you calledaddEventListeneron.During bubbling,
targetstays constant as the event travels up, whilecurrentTargetchanges at each ancestor whose listener fires. In a delegated handler on a<ul>,currentTargetis the<ul>, andtargetis the specific<li>(or a child of it, which is why you often usetarget.closest('li')).Note: in an arrow function
thiswon't be the element, butevent.currentTargetalways is.What a strong answer coverstarget= where the event started (deepest element); constant through bubbling.currentTarget= the element whose listener is running now; changes per ancestor.In delegation,
currentTargetis the parent you bound to;targetis the actual descendant.currentTargetequalsthisin a normal function handler, but not in an arrow function.
Quick self-checkA click listener is on a <ul>. The user clicks a <span> inside an <li>. Inside the handler, what are target and currentTarget?
-
Backwards — currentTarget is the listening element (the <ul>).
-
Correct — target is the clicked <span>; currentTarget is the <ul> the listener is bound to.
-
target is the deepest clicked node (the <span>), not the <li>.
-
currentTarget is the bound element (<ul>), not the clicked node.
Follow-ups they push on- Why use target.closest('li') instead of target directly in a delegated handler?
- What is event.target inside a handler bound directly to the element itself?
Red flag Swapping the two: target is the origin, currentTarget is the listening element. In delegation, acting on target directly can grab a nested child instead of the row.
source: MDN — Event: currentTarget ↗ -
What is the difference between stopPropagation and preventDefault?
They are orthogonal.
preventDefault()cancels the browser's default action for the event — following a link, submitting a form, checking a checkbox — but the event still propagates to other listeners.stopPropagation()stops the event from traveling further through the capture/bubble phases to other elements, but does not cancel the default action. (stopImmediatePropagation()additionally prevents other listeners on the *same* element.)So: prevent the default behavior with
preventDefault; stop the event reaching parents withstopPropagation. Returningfalsefrom a jQuery handler does both, but in plain DOMreturn falseonly works in inlineon*attributes.What a strong answer coverspreventDefault()cancels the default browser action; propagation still happens.stopPropagation()halts travel to other elements; default action still happens.stopImmediatePropagation()also blocks other listeners on the same element.They're independent — you sometimes call both, sometimes one.
Quick self-checkA form submit handler calls only event.stopPropagation(). What happens?
-
Only propagation stops — the default submit still fires.
-
Correct — stopPropagation doesn't cancel the default submit; you'd need preventDefault for that.
-
stopPropagation does stop parents from receiving it.
-
The default submit action still occurs.
Follow-ups they push on- When would you call both on the same event?
- Why is calling stopPropagation broadly considered risky for delegation?
Red flag Believing stopPropagation also cancels the default action (it doesn't), or that preventDefault stops bubbling (it doesn't).
source: MDN — Event: preventDefault() ↗ -
How does CSS specificity work, and what wins between an ID selector and 10 classes?
Specificity is scored as a tuple (inline, IDs, classes/attributes/pseudo-classes, elements). Higher tuples win; ties are broken by source order (last wins).
An ID is (0,1,0,0). Ten classes is (0,0,10,0). The ID still wins because the ID column outranks the class column regardless of count — it is not a base-10 sum where 10 classes overflow into the ID column.
!importantoverrides normal specificity; inline styles outrank selectors. Use these sparingly.Follow-ups they push on- Where do !important and inline styles sit in the cascade?
- How do :where() and :is() affect specificity?
Red flag Treating specificity as a single base-10 number so '10 classes beat 1 ID' — columns do not carry over.
source: MDN — Specificity ↗ -
When would you use Flexbox versus CSS Grid?
Flexbox is for one-dimensional layout — a row or a column — where you distribute space along a single axis (nav bars, toolbars, centering, equal-height items in a row).
Grid is for two-dimensional layout — rows and columns together — where you place items into a defined grid (page layouts, card galleries, dashboards).
They compose: a grid cell can itself be a flex container. Reach for Grid when you care about both axes at once; Flexbox when content drives a single axis.
Follow-ups they push on- How do you center an element both horizontally and vertically with each?
- What does flex: 1 actually mean (flex-grow/shrink/basis)?
Red flag Calling Grid 'just for grids of images' or Flexbox 'two-dimensional'. The key distinction is 1D vs 2D.
source: MDN — Relationship of grid layout to other layout methods ↗ -
Explain CSS position values: static, relative, absolute, fixed, and sticky.
static— default; in normal flow, top/left ignored.relative— stays in flow but offset from its normal spot; becomes a positioning context for absolute children.absolute— removed from flow; positioned relative to the nearest positioned ancestor (else the initial containing block).fixed— removed from flow; positioned relative to the viewport, so it stays put on scroll.sticky— a hybrid: behaves likerelativeuntil it crosses a scroll threshold, then sticks likefixedwithin its container.Follow-ups they push on- What makes an ancestor a 'positioned' ancestor for absolute children?
- Why might position:sticky silently not work (overflow on an ancestor)?
Red flag Saying absolute is relative to the viewport (that is fixed), or that sticky is independent of its containing block.
source: MDN — position ↗ -
What is the difference between the HTML attribute and the DOM property (e.g. input value)?
The attribute is what is written in the HTML source; the property is the live value on the DOM object. They are linked at parse time but can diverge.
For an
<input value="hi">:getAttribute("value")returns the original"hi"(the default), whileinputEl.valuereflects what the user has currently typed. Editing the box changes the property, not the attribute.Some attributes are reflected (
id,className), others are not symmetric (value,checked). This is why React tracksvalueas state.Follow-ups they push on- Why does setAttribute('value', ...) not update what the user sees after they have typed?
- How does this relate to controlled vs uncontrolled inputs in React?
Red flag Assuming attribute and property always stay in sync. For value/checked the attribute is just the initial default.
source: MDN — Attributes ↗ -
How would you efficiently insert 1,000 DOM nodes without causing 1,000 reflows?
Build the nodes off the live DOM and insert once, so layout is recomputed a single time.
Use a
DocumentFragment:const frag = document.createDocumentFragment();for (const item of items) { const li = document.createElement("li"); li.textContent = item; frag.appendChild(li); }list.appendChild(frag);Appending to the fragment does not touch the rendered tree; the single
appendChild(frag)inserts all children in one operation. AvoidinnerHTML +=in a loop (re-parses everything each time) and avoid appending one-by-one to the live list.Follow-ups they push on- Why is innerHTML += in a loop both slow and unsafe?
- When would you use virtualization instead of inserting all 1,000 nodes?
Red flag Appending each node directly to the live DOM in the loop, or using innerHTML += which re-parses the whole list every iteration.
source: MDN — DocumentFragment ↗ -
What is a block formatting context (BFC), and name two ways to create one. Why is it useful?
A block formatting context is a self-contained region of layout where block boxes lay out and floats are managed independently of the outside. Inside a BFC, vertical margins don't collapse with elements outside it, and the BFC contains its floated children.
Ways to create one:
overflowother thanvisible(e.g.overflow: hidden/auto),display: flow-root(the purpose-built, side-effect-free option), being a flex/grid item,display: inline-block, or floating/absolute positioning.Classic uses: clearing floats (a floated child no longer overflows its parent's height), and stopping margin collapse between a parent and child.
display: flow-rootis the modern, intention-revealing way to do both.What a strong answer coversA BFC is an isolated layout region; floats and margins inside don't leak out.
Create with
display: flow-root,overflow≠ visible, flex/grid item, float, or absolute.Contains floats (no parent collapse) and blocks external margin collapsing.
display: flow-rootis the clean, side-effect-free way to establish one.
Follow-ups they push on- Why was `overflow: hidden` historically used to clear floats?
- What advantage does display: flow-root have over the overflow hack?
Red flag Using `overflow: hidden` to clear floats and accidentally clipping content or scrollbars; `display: flow-root` avoids those side effects.
source: MDN — Block formatting context ↗ -
What is a stacking context, and why might a higher z-index element still appear behind a lower one?
z-indexonly orders elements within the same stacking context. A stacking context is a self-contained layer: once formed, its children are painted as a unit, and theirz-indexvalues cannot escape it.So if element A (z-index: 9999) lives inside a parent that forms a stacking context with a low z-index, and element B (z-index: 1) is in a *sibling* context with a higher one, B paints on top — A's huge z-index is meaningless across contexts.
New stacking contexts are created by more than
position + z-index:opacity < 1,transform,filter,will-change,isolation: isolate, and being a flex/grid child with a z-index, among others. This is the usual cause of 'my z-index isn't working'.What a strong answer coversz-indexis only comparable within one stacking context, never across them.Once an ancestor forms a context, a child's z-index is trapped inside it.
Contexts are created by
position+z-indexbut alsoopacity < 1,transform,filter,will-change,isolation: isolate.A z-index:9999 child of a low context loses to a z-index:1 element in a higher-ranked sibling context.
Quick self-checkWhich of these does NOT, by itself, create a new stacking context?
-
Any opacity below 1 creates a stacking context.
-
A non-none transform creates a stacking context.
-
Correct — static positioning alone never forms a stacking context (and z-index is ignored on it).
-
This property exists precisely to create a stacking context.
Follow-ups they push on- Name three properties besides position that create a stacking context.
- How does `isolation: isolate` help contain z-index without side effects?
Red flag Assuming z-index is globally comparable. A larger z-index loses if its element sits in a lower-ranked ancestor stacking context.
source: MDN — Stacking context ↗
5.3 JavaScript that matters for the frontend 18
-
What is the difference between == and ===, and name a coercion gotcha.
===is strict equality: no type coercion — different types are never equal.==is loose equality: it coerces operands to a common type first, which produces surprising results.Gotchas:
0 == ""istrue,0 == "0"istrue, but"" == "0"isfalse(not transitive).null == undefinedistrue, yetnull == 0isfalse.NaN === NaNisfalse.Rule: default to
===; the one common, intentional==isx == nullto catch bothnullandundefined.Follow-ups they push on- Why is NaN not equal to itself, and how do you test for it?
- What does the abstract equality algorithm do for object vs primitive comparisons?
Red flag Claiming == is just === plus 'minor type stuff', then getting tripped by the non-transitive empty-string/zero cases.
source: MDN — Equality comparisons and sameness ↗ -
What is the difference between null and undefined, and what does typeof return for each?
undefinedmeans a variable has been declared but not assigned, a missing function argument, a missing object property, or a function with noreturn. The engine produces it.nullis an intentional 'no value' that *you* assign to signal emptiness.The famous quirk:
typeof undefinedis"undefined", buttypeof nullis"object"— a long-standing bug kept for backward compatibility. They are loosely equal (null == undefinedistrue) but not strictly equal (null === undefinedisfalse).Use
x == nullto test for both at once, or??(nullish coalescing) which treats onlynull/undefinedas missing.What a strong answer coversundefined: engine-produced 'not assigned / missing'.null: developer-assigned 'intentionally empty'.typeof undefined === 'undefined';typeof null === 'object'(a historical bug).null == undefinedis true;null === undefinedis false.??treats only null/undefined as missing, unlike||which also catches 0/''/false.
Quick self-checkWhat does typeof null evaluate to?
-
There is no 'null' typeof string; this is a common wrong guess.
-
That's typeof undefined, not typeof null.
-
Correct — a historical bug kept for backward compatibility.
-
typeof never throws here; it simply returns 'object'.
Follow-ups they push on- Why does ?? differ from || for falsy values like 0 and ''?
- How do you reliably check that a value is null or undefined but not 0/''?
Red flag Expecting typeof null to be 'null' (it's 'object'), or using || where ?? is needed and accidentally treating 0/'' as missing.
source: MDN — null ↗ -
What does this print, and why? for (var i = 0; i < 3; i++) { setTimeout(() => console.log(i), 1); }
It prints
3,3,3.varis function-scoped, so all three callbacks close over the samei. ThesetTimeoutcallbacks run after the synchronous loop finishes, by which pointihas been incremented to3.Fixes: use
let(block-scoped — each iteration gets a fresh binding, printing0 1 2); or capture per-iteration with an IIFE(j => setTimeout(() => console.log(j), 1))(i).Follow-ups they push on- Change var to let — what prints now and why?
- How does the IIFE version create a separate closure per iteration?
Red flag Answering 0 1 2 for the var version. The classic mistake is forgetting var is shared and the timers fire after the loop.
source: lydiahallie/javascript-questions (Q2) ↗ -
What does this print, and why? let count = 0; const fns = []; for (let i = 0; i < 3; i++) { fns.push(() => i); } console.log(fns.map((f) => f()));
It logs
[0, 1, 2].With
let, the loop creates a fresh binding ofifor each iteration, so each arrow closes over a differentiholding that iteration's value. (countis a red herring — it's never touched.)If this used
varinstead, all three closures would share one function-scopedi, and after the loop finishediwould be3, so it would log[3, 3, 3]. This is the canonical demonstration of whyletfixed the classic loop-closure bug.What a strong answer coversletgives each iteration its own binding of the loop variable.Each closure captures its iteration's
i, so the result is[0, 1, 2].With
var(function-scoped, one shared binding) it would be[3, 3, 3].Closures capture variables (bindings), not snapshot values.
Quick self-checkWhat is logged?
-
Correct — `let` creates a fresh `i` binding each iteration, captured by each closure.
-
That's the `var` result; `let` doesn't share one binding.
-
The arrows return `i` as it was each iteration: 0, 1, 2 — not shifted by one.
-
Each closure returns a valid captured number, not undefined.
Follow-ups they push on- Rewrite this with var to get [3, 3, 3], then explain the fix.
- How does this relate to the setTimeout-in-a-loop classic?
Red flag Answering [3, 3, 3] for the `let` version — that's the `var` behavior. let creates a new binding per iteration.
source: MDN — Closures (creating closures in loops) ↗ -
What does this print, and why? const obj = { name: 'obj', greet() { setTimeout(function () { console.log(this.name); }, 0); }, }; obj.greet();
It logs
undefined(in a browser,thisis the global object, wherenameis''; in strict mode/modulesthisisundefinedand it would throw).The inner
functionpassed tosetTimeoutis a plain function called by the timer, not as a method ofobj. Itsthisis therefore notobj— implicit binding only happens forobj.method()call syntax. The timer invokes it as a bare function.Fixes: use an arrow function in the timeout (inherits
greet'sthis), captureconst self = this, or.bind(this). This is the single most commonthis-loss bug in callbacks.What a strong answer coversthisis set by the call site; the timer calls the callback as a plain function.Plain-function
thisis the global object (sloppy mode) orundefined(strict/module).An arrow function in setTimeout inherits the enclosing method's
this(=obj).Alternatives:
const self = thiscapture, or.bind(this).
Quick self-checkWhat logs (assume a non-strict browser global where name is '')?
-
Only true if the callback were an arrow function or bound; a plain function loses `this`.
-
Correct — the timer calls the plain function with global (or undefined) `this`, not `obj`.
-
In sloppy mode `this` is the global object, so no error — it reads the global `name`.
-
`this` defaults to the global object in sloppy mode, never null.
Follow-ups they push on- Rewrite greet so it logs 'obj'.
- Why does an arrow function fix this but a regular function doesn't?
Red flag Assuming the callback inherits `obj` as `this` because it's defined inside a method. Only the call site sets a normal function's `this`.
source: MDN — this (callbacks) ↗ -
What is the difference between call, apply, and bind?
All three set a function's
thisexplicitly; they differ in when it runs and how arguments are passed.call(thisArg, a, b)— invokes immediately, arguments passed individually.apply(thisArg, [a, b])— invokes immediately, arguments passed as an array. (Mnemonic: Apply = Array.)bind(thisArg, a)— does not invoke; returns a new function withthis(and any leading args) permanently fixed. You call that later. A bound function cannot be re-bound, andnewon it ignores the boundthis.With spread,
call(...args)covers most apply cases today.What a strong answer coverscall: invoke now, args listed individually.apply: invoke now, args as an array (Apply = Array).bind: returns a new permanently-bound function; doesn't invoke.A bound function's
thiscan't be overridden by a later call/bind.
Follow-ups they push on- Can you re-bind a function that's already bound?
- How does spread syntax make apply less necessary?
Red flag Mixing up apply (array) and call (list), or thinking bind invokes the function immediately — it returns a new one.
source: MDN — Function.prototype.bind() ↗ -
Implement a throttle function, and explain how it differs from debounce.
Throttle guarantees
fnruns at most once perwaitwindow, no matter how often it's called — good for scroll/resize/mousemove. Debounce waits until calls *stop* forwaitms, then fires once — good for search-as-you-type.function throttle(fn, wait) {let last = 0;return function (...args) {const now = Date.now();if (now - last >= wait) {last = now;fn.apply(this, args);}};}This is a leading-edge throttle: it fires immediately, then ignores calls until the window elapses. The timestamp lives in a closure, and
fn.apply(this, args)forwards context and arguments.What a strong answer coversThrottle: at most one call per time window (steady cadence under continuous events).
Debounce: fires only after calls go quiet for
waitms.Throttle suits scroll/resize; debounce suits typeahead/validation.
The closure holds the last-run timestamp; forward
this/argsvia apply.
Follow-ups they push on- Add a trailing-edge call so the final event isn't dropped.
- When would you choose throttle over debounce for a scroll handler?
Red flag Implementing debounce and calling it throttle (resetting a timer on each call is debounce). Also dropping the trailing call so the last event is lost.
source: GreatFrontend — Throttle ↗ -
What is the difference between a shallow copy and a deep copy, and how do you make each?
A shallow copy duplicates only the top level; nested objects/arrays are still shared references. So mutating a nested value affects both copies. Make one with
{...obj},Object.assign({}, obj), orarr.slice().A deep copy recursively clones every level, so the copy is fully independent. Modern way:
structuredClone(obj)(handles Dates, Maps, Sets, cyclic refs). The old hackJSON.parse(JSON.stringify(obj))works only for plain JSON-safe data — it drops functions,undefined, andSymbols, and turnsDateinto a string.Key point: spread is shallow, so a nested array inside a spread copy is still linked to the original.
What a strong answer coversShallow copy shares nested references; spread/
Object.assign/sliceare shallow.Deep copy clones every level into an independent structure.
structuredClone()is the modern deep-copy API (handles Dates/Maps/Sets/cycles).JSON.parse(JSON.stringify(x))loses functions,undefined, Symbols, and Dates.
Follow-ups they push on- Why does the spread operator not deep-copy nested arrays?
- What types does JSON.stringify silently drop or mangle?
Red flag Believing spread or Object.assign deep-copies — nested objects stay shared. Reaching for JSON round-trip on data containing Dates/functions/undefined.
source: MDN — Shallow copy / Deep copy (structuredClone) ↗ -
What is the difference between function declarations and function expressions with respect to hoisting?
A function declaration (
function foo() {}) is hoisted whole — both its name and body — so you can call it on a line *above* where it's written.A function expression (
const foo = function () {}or an arrow) follows variable hoisting rules. Withconst/let, the binding is hoisted but in the temporal dead zone, so calling it early throwsReferenceError. Withvar, the variable hoists asundefined, so calling it early throwsTypeError: foo is not a function(it'sundefined, not callable yet).So declarations are usable before their line; expressions are not, and the error you get depends on
varvslet/const.What a strong answer coversFunction declarations are fully hoisted (callable before their definition).
Function expressions follow the variable's hoisting: TDZ for
let/const,undefinedforvar.Calling a
var-assigned expression early →TypeError(not a function).Calling a
let/constexpression early →ReferenceError(TDZ).
Quick self-checkWhat happens? foo(); var foo = function () { return 1; };
-
The expression isn't assigned yet at the call site.
-
`var foo` is hoisted, so foo exists — it's just undefined.
-
Correct — `var foo` hoists as undefined; calling undefined throws a TypeError.
-
The code is syntactically valid; the error is at runtime.
Follow-ups they push on- What error do you get calling a var function expression before assignment, and why?
- Are named function expressions hoisted by their name? (No — only inside their own scope.)
Red flag Assuming all functions are hoisted. Only declarations are; expressions hoist per their variable's rules (TDZ or undefined).
source: MDN — Hoisting ↗ -
What does this print? const shape = { radius: 10, diameter() { return this.radius * 2; }, perimeter: () => 2 * Math.PI * this.radius }; console.log(shape.diameter()); console.log(shape.perimeter());
It prints
20and thenNaN.diameteris a regular method: called asshape.diameter(),thisisshape, sothis.radiusis10→20.perimeteris an arrow function: arrows do not get their ownthis; they use the lexically enclosingthis(here the module/global scope), whereradiusis undefined.2 * Math.PI * undefined→NaN.Follow-ups they push on- Rewrite perimeter so it works.
- Why are arrow functions a bad choice for object methods but a good choice for callbacks?
Red flag Assuming the arrow's `this` is the object. Arrows ignore the call site and bind `this` lexically.
source: lydiahallie/javascript-questions (Q3) ↗ -
What does this print? function sayHi() { console.log(name); console.log(age); var name = "Lydia"; let age = 21; } sayHi();
It logs
undefined, then throws aReferenceError.var nameis hoisted and initialized toundefined, so the first log readsundefined.let ageis hoisted too but not initialized — it sits in the temporal dead zone until its declaration runs. Accessing it before that line throwsReferenceError: Cannot access 'age' before initialization, so the second log never completes.Follow-ups they push on- What exactly is the temporal dead zone?
- How does hoisting differ for function declarations vs function expressions?
Red flag Saying both are undefined, or that let is 'not hoisted at all'. It is hoisted but uninitialized (TDZ).
source: lydiahallie/javascript-questions (Q1) ↗ -
What is a closure? Give a practical use case.
A closure is a function bundled with references to the variables from the scope where it was defined. The inner function keeps those variables alive even after the outer function returns.
Use cases: private state (a counter factory where the count is inaccessible from outside), partial application / currying, memoization caches, and stateful callbacks like the timer ID inside a
debounce.Example:
function makeCounter() { let n = 0; return () => ++n; }const c = makeCounter(); c(); // 1—nis private and persists.Follow-ups they push on- How do closures cause memory leaks if you are not careful?
- How does debounce use a closure to remember the timer ID?
Red flag Defining a closure only as 'a function inside a function' without mentioning that it captures and persists the enclosing variables.
source: MDN — Closures ↗ -
What is event delegation, and why attach one listener to a parent instead of many to children?
Event delegation exploits bubbling: instead of binding a listener to every child, you bind one to a common ancestor and inspect
event.targetto find which child triggered it.Benefits: fewer listeners (lower memory), and it automatically handles dynamically added children without rebinding.
Example:
list.addEventListener("click", (e) => { const li = e.target.closest("li"); if (li) handle(li.dataset.id); });Use
event.targetfor the actual origin andevent.currentTargetfor the element the listener is on.Follow-ups they push on- What is the difference between event.target and event.currentTarget?
- Which events do not bubble, and how do you delegate those (capture phase / focusin)?
Red flag Confusing target with currentTarget, or assuming every event bubbles (focus/blur do not; focusin/focusout do).
source: MDN — Event bubbling and delegation ↗ -
Implement a debounce function.
Debounce delays calling
fnuntilwaitms have passed since the last call; every new call resets the timer. The timer id lives in a closure.function debounce(fn, wait) {let t;return function (...args) {clearTimeout(t);t = setTimeout(() => fn.apply(this, args), wait);};}Using a normal function (not an arrow) for the wrapper preserves the caller's
this, andfn.apply(this, args)forwards both. Common in search-as-you-type and resize handlers.Follow-ups they push on- How does throttle differ from debounce?
- Add a leading-edge (immediate) option.
- Why must the wrapper forward `this` and `args`?
Red flag Hoisting the timer outside the returned function incorrectly (shared across instances), or dropping `this`/`args` so the debounced fn loses context.
source: GreatFrontend — Debounce ↗ -
What is the difference between a microtask and a macrotask, and which queue drains first?
After each macrotask (and after the current synchronous run-to-completion finishes), the event loop drains the entire microtask queue before taking the next macrotask or rendering.
Microtasks: Promise
.then/.catch/.finallycallbacks,queueMicrotask,MutationObserver. They run as soon as the stack is empty, ahead of any timer.Macrotasks (tasks):
setTimeout,setInterval, I/O, message events, UI events. One per loop turn.Consequence: a resolved Promise always runs before a
setTimeout(0). And an unbounded chain of microtasks can starve rendering and timers, because the loop won't move on until the microtask queue is empty.What a strong answer coversOrder each turn: run a macrotask → drain all microtasks → (maybe render) → next macrotask.
Microtasks: Promise callbacks,
queueMicrotask,MutationObserver.Macrotasks:
setTimeout/setInterval, I/O, UI/message events.Resolved Promise beats
setTimeout(0); runaway microtasks can starve render/timers.
Quick self-checkWhich of these schedules a MICROTASK?
-
A macrotask (task) — runs after the microtask queue drains.
-
Correct — `.then` callbacks are microtasks.
-
A repeating macrotask, not a microtask.
-
A render-phase callback (before paint), not a microtask.
Follow-ups they push on- Why can microtasks starve the UI but a queue of setTimeouts can't as easily?
- Where does requestAnimationFrame sit relative to micro/macrotasks?
Red flag Thinking setTimeout(0) runs before a resolved Promise. Microtasks always drain fully before the next macrotask.
source: MDN — In depth: Microtasks and the JavaScript runtime environment ↗ -
How is `this` determined at call time? Walk through the binding rules.
For a normal function,
thisdepends on how it is called, checked in priority order:1.
new Fn()—thisis the freshly created object.
2.fn.call/apply/bind(obj)—thisis the explicitobj.
3.obj.fn()—thisis the receiverobj(implicit binding).
4. Plainfn()—thisisundefinedin strict mode, else the global object.Arrow functions ignore all of the above: they capture
thislexically from where they were defined. That is why arrows are handy in callbacks but wrong as object methods.Follow-ups they push on- Why does passing obj.method as a callback lose `this`?
- What does bind return, and can you re-bind a bound function?
Red flag Saying `this` is fixed by where a function is defined (true only for arrows). For normal functions it is the call site.
source: MDN — this ↗ -
Explain the event loop, the call stack, and the difference between microtasks and macrotasks. What prints? console.log(1); setTimeout(() => console.log(2), 0); Promise.resolve().then(() => console.log(3)); console.log(4);
It prints
1, 4, 3, 2.Synchronous code runs first on the call stack:
1, then4.When the stack is empty, the event loop drains the entire microtask queue before any macrotask.
Promise.thenis a microtask →3.setTimeoutis a macrotask →2, runs last.So: sync (
1,4) → all microtasks (3) → next macrotask (2).Follow-ups they push on- Where do queueMicrotask, MutationObserver, and requestAnimationFrame fit?
- Why can a runaway chain of microtasks starve rendering and timers?
Red flag Predicting `1 4 2 3`. The trap is thinking setTimeout(0) beats a resolved Promise — microtasks always drain first.
source: MDN — In depth: Microtasks and the JavaScript runtime environment ↗ -
How does prototypal inheritance work? What is the difference between __proto__ and prototype?
Every object has an internal link (
[[Prototype]], exposed as__proto__) to another object. Property lookups walk this prototype chain until found or it hitsnull.prototypeis a property on constructor functions: when you donew Fn(), the new object's__proto__is set toFn.prototype. So instances delegate toFn.prototypefor shared methods.Mnemonic:
prototypelives on the constructor;__proto__(better:Object.getPrototypeOf) lives on instances and points at the constructor'sprototype.Follow-ups they push on- How do ES6 classes map onto prototypes under the hood?
- Why put methods on the prototype instead of in the constructor?
Red flag Mixing up `prototype` (on the constructor) and `__proto__` (on the instance), or thinking class syntax is not prototype-based — it is sugar.
source: MDN — Inheritance and the prototype chain ↗
5.4 Browser networking & app architecture 17
-
What is the same-origin policy, and what problem does CORS solve?
The same-origin policy stops a page on origin A (scheme + host + port) from reading responses from origin B by default — it limits how a document loaded from one origin can interact with a resource from another, which protects user credentials.
CORS is the server's controlled opt-in: the server returns headers like
Access-Control-Allow-Origintelling the browser it is allowed to expose the response to that origin. CORS does not turn off security — it lets a server selectively relax the same-origin policy for trusted callers.Follow-ups they push on- What exactly counts as 'same origin'?
- Is CORS enforced by the browser or the server?
Red flag Saying CORS is a thing the client enables to bypass security. It is a server opt-in; the browser enforces it.
source: MDN — Cross-Origin Resource Sharing (CORS) ↗ -
What is tree-shaking, and what does your code need to do for it to work?
Tree-shaking is dead-code elimination at the module level: the bundler keeps only the exports you actually import and drops the rest, shrinking the bundle.
It relies on ES modules' static structure —
import/exportare statically analyzable, so the bundler can trace which exports are used. CommonJS (require) is dynamic and resists shaking.For it to work well: use ESM, import named members (not the whole namespace), avoid modules with side effects at import time, and mark packages
"sideEffects": falseinpackage.jsonso the bundler can safely prune. A stray top-level side effect can force a whole module to be kept.What a strong answer coversTree-shaking removes unused exports to reduce bundle size.
Requires static ESM
import/export; CommonJSrequireis too dynamic.Side-effectful modules can't be safely dropped;
"sideEffects": falsesignals safety.Import named members, not
import * as everything.
Follow-ups they push on- Why can CommonJS modules not be tree-shaken reliably?
- What does the package.json "sideEffects" field do?
Red flag Assuming any unused import is automatically dropped. Side effects, CommonJS, or namespace imports can defeat tree-shaking.
source: MDN — Tree shaking ↗ -
What is code-splitting and lazy loading, and how do they improve load performance?
Code-splitting breaks one large bundle into smaller chunks that can be loaded on demand instead of all upfront. Lazy loading is fetching a chunk only when it's actually needed — typically via the dynamic
import()expression, which returns a Promise and tells the bundler to emit a separate chunk.The payoff is a smaller initial bundle: less JS to download, parse, and execute before the page is interactive, which improves load time and INP. Common split points: per-route (load a route's code on navigation) and per-component (a heavy modal/chart loaded on first interaction).
React pairs
React.lazy(() => import('./X'))with<Suspense>for a fallback while the chunk loads.What a strong answer coversCode-splitting = multiple chunks; lazy loading = fetch a chunk on demand.
Dynamic
import()returns a Promise and creates a separate bundle chunk.Shrinks the initial bundle → faster parse/execute → better TTI/INP.
Split by route and by heavy on-interaction components.
Follow-ups they push on- How does React.lazy + Suspense work together?
- What's the risk of splitting too aggressively (many tiny chunks / waterfalls)?
Red flag Lazy-loading everything (request waterfalls, layout shift on load), or splitting code that's needed for first paint and delaying it.
source: MDN — JavaScript modules (dynamic import) ↗ -
What causes Cumulative Layout Shift (CLS), and how do you prevent it?
CLS measures unexpected movement of visible content during loading — content jumping as late-arriving elements push things around. Good is ≤ 0.1 at the 75th percentile.
Common causes: images/videos/ads without reserved space; web fonts swapping in and reflowing text (FOUT); content injected above existing content; and animating layout properties.
Fixes: always set
width/height(oraspect-ratio) on media so the browser reserves the box; reserve space for ads/embeds; usefont-display: optional/swapplus size-matched fallbacks to minimize font reflow; and never insert content above what the user is viewing unless in response to an interaction.What a strong answer coversCLS = sum of unexpected layout shifts; target ≤ 0.1 (p75).
Top cause: media without dimensions — set
width/heightoraspect-ratio.Reserve space for ads/embeds and avoid injecting content above the fold.
Tame font swap (FOUT) with
font-displayand metric-matched fallbacks.
Follow-ups they push on- Why does specifying width and height on an <img> prevent shift even before it loads?
- How can web fonts cause layout shift, and how do you reduce it?
Red flag Omitting image dimensions (relying on CSS alone) so the browser can't reserve space, or inserting banners above current content after load.
source: web.dev — Cumulative Layout Shift (CLS) ↗ -
How does HTTP caching work for assets? Explain Cache-Control, ETags, and cache busting.
Cache-Control is the primary header.
max-age=Nlets the browser use a cached copy without revalidating for N seconds;no-cachemeans 'cache it but revalidate before use';no-storemeans never cache;immutablepromises the file won't change.ETags enable conditional revalidation: the server sends a content hash; the browser later sends
If-None-Match, and the server returns a tiny304 Not Modifiedif unchanged — saving the payload but not the round-trip.Cache busting combines both worlds: give bundled assets a content hash in the filename (
app.a1b2c3.js) and serve them withmax-age=31536000, immutable. When content changes, the filename changes, so you cache forever yet always serve fresh files. Keep the HTML entry point short-lived.What a strong answer coversCache-Control: max-ageskips revalidation;no-cacherevalidates;no-storenever caches.ETag +
If-None-Match→304 Not Modifiedavoids re-downloading unchanged bytes.Cache busting: content-hashed filenames served
immutablelong-lived.Hash the assets, keep the HTML short-lived so new asset URLs are discovered.
Follow-ups they push on- Why is no-cache not the same as no-store?
- Why hash filenames instead of just lowering max-age?
Red flag Thinking `no-cache` means 'don't cache' (it means revalidate), or setting long max-age on un-hashed filenames so users get stale files.
source: MDN — HTTP caching ↗ -
What is the virtual DOM, and is it actually faster than direct DOM manipulation?
The virtual DOM is an in-memory tree of lightweight JS objects describing the UI. On a state change, the framework builds a new tree, diffs it against the previous one (reconciliation), and applies the minimal set of real DOM mutations.
It is not magically faster than hand-optimized direct DOM writes — diffing has its own cost. Its value is a declarative model: you describe the target UI and let the framework batch updates and avoid redundant reflows, which is faster than naive re-rendering and far easier to reason about than manual surgery.
Follow-ups they push on- Why do React lists need stable keys during reconciliation?
- How do fine-grained reactive frameworks (Solid/Svelte) avoid a VDOM entirely?
Red flag Asserting the virtual DOM is always faster than direct manipulation. The real win is the declarative programming model plus batched updates.
source: React docs — Preserving and Resetting State (reconciliation) ↗ -
What problem do bundlers and transpilers solve? Distinguish Webpack/Vite from Babel.
A bundler (Webpack, Vite, esbuild, Rollup) builds a dependency graph from your modules and produces a few optimized files — handling code-splitting, tree-shaking, asset imports, and minification. It solves 'too many modules and too many requests' and lets the browser load less.
A transpiler (Babel, the TS compiler, SWC) converts source into a form browsers/runtimes accept: modern JS → older JS, JSX →
createElementcalls, TypeScript → JS.They complement each other: a bundler usually runs a transpiler step. Vite additionally serves native ES modules in dev for instant startup.
Follow-ups they push on- What is tree-shaking and what does it require to work (ESM, side-effect-free)?
- Why is Vite's dev server fast compared to a classic Webpack dev build?
Red flag Treating bundler and transpiler as synonyms. Babel transforms syntax; Webpack/Vite assemble and optimize the module graph.
source: Vite — Why Vite ↗ -
What are the Core Web Vitals, and what does each measure?
Three field metrics for real-user experience, judged at the 75th percentile:
- LCP (Largest Contentful Paint) — loading; time for the largest content element to render. Good ≤ 2.5s.
- INP (Interaction to Next Paint) — responsiveness; the latency of interactions. Good ≤ 200ms. INP became a stable Core Web Vital in 2024, replacing FID.
- CLS (Cumulative Layout Shift) — visual stability; how much content unexpectedly shifts. Good ≤ 0.1.Typical fixes: optimize the LCP image / preload it; cut long tasks for INP; reserve space (width/height, aspect-ratio) for CLS.
Follow-ups they push on- Why did INP replace FID?
- What causes layout shift and how do you prevent it (dimensions, font swap)?
Red flag Citing FID as a current Core Web Vital (it was replaced by INP in 2024), or mixing up which metric covers loading vs responsiveness vs stability.
source: web.dev — Web Vitals ↗ -
What are the essentials of web accessibility (a11y) a frontend engineer must get right?
Start with semantic HTML — native
<button>,<a>,<label>, headings, and landmark elements give you roles, focus, and keyboard behavior for free.Key practices: meaningful
alttext on images (emptyalt=""for decorative ones); every form control associated with a<label>; full keyboard navigation with a visible focus indicator and logical tab order; sufficient color contrast; and ARIA only to fill gaps semantics cannot cover (custom widgets) — never to paper over a non-semantic<div>.Test with keyboard-only, a screen reader, and automated tools (axe/Lighthouse).
Follow-ups they push on- What does 'the first rule of ARIA' (don't use ARIA if a native element exists) mean?
- How do you make a custom dropdown keyboard-accessible?
Red flag Reaching for ARIA first instead of semantic HTML, removing focus outlines without a replacement, or treating alt text as optional.
source: MDN — What is accessibility? ↗ -
Compare cookies, localStorage, and sessionStorage for storing data in the browser.
Cookies (~4KB) are sent to the server with every matching request. Best for auth/session tokens, ideally
HttpOnly(JS cannot read them, mitigating XSS theft),Secure, andSameSite.localStorage (~5–10MB) is JS-only, persists until cleared, and is not sent to the server. Good for non-sensitive client state. Vulnerable to XSS, so never store tokens that must stay secret.
sessionStorage is like localStorage but scoped to a single tab and cleared when it closes.
Follow-ups they push on- Why store auth tokens in HttpOnly cookies rather than localStorage?
- What does SameSite do for CSRF protection?
Red flag Recommending localStorage for auth tokens (readable by any XSS), or thinking localStorage is sent to the server like cookies.
source: MDN — Web Storage API ↗ -
What is the difference between fetch and XMLHttpRequest, and does fetch reject on a 404?
fetchis the modern Promise-based API;XMLHttpRequestis the older event/callback-based one.fetchis cleaner, streams responses, and integrates withAbortControllerfor cancellation.Key gotcha:
fetchonly rejects on network failure, not on HTTP error status. A404or500still resolves — you must checkresponse.ok(orresponse.status) yourself and throw if it is false.response.json()returns a Promise, so you await it twice (the response, then the body).Follow-ups they push on- How do you cancel a fetch request?
- Does fetch send cookies by default cross-origin (credentials)?
Red flag Assuming a 404 lands in the catch block. It does not — fetch resolves; only network errors reject. Forgetting to check response.ok.
source: MDN — Using the Fetch API ↗ -
What is hydration in SSR, and why can it be costly? What problems does it cause?
Hydration is the client-side step where a framework takes server-rendered HTML and attaches event listeners and reconstructs component state, making the static markup interactive. The server sends visible HTML fast (good first paint), then the browser must download the JS, re-run the components, and 'wire up' the existing DOM.
It's costly because you effectively render twice — once on the server, once on the client — and the page can look ready but not respond to clicks until hydration finishes (the 'uncanny valley' / poor INP).
Mitigations: less client JS, partial/progressive hydration, islands architecture (hydrate only interactive bits), streaming SSR, and server components that never ship to the client.
What a strong answer coversHydration = attaching listeners/state to server-rendered HTML to make it interactive.
Work is duplicated: render on server, then re-render/wire-up on client.
Page can appear ready but be unresponsive until hydration completes (hurts INP).
Fixes: islands, partial/progressive hydration, streaming, server components.
Follow-ups they push on- How does an islands architecture (e.g. Astro) reduce hydration cost?
- What is the 'uncanny valley' of a hydrating page?
Red flag Thinking SSR alone makes a page interactive. The HTML is visible immediately, but interactivity waits for hydration.
source: web.dev — Rendering on the web (hydration) ↗ -
What is XSS, what are the main types, and how do you defend against it?
Cross-site scripting (XSS) is injecting attacker-controlled script that runs in a victim's page with the site's privileges (reading cookies, DOM, making requests as the user).
Types: stored (malicious input saved server-side and served to others), reflected (script bounced off a request like a search param), and DOM-based (client JS writes untrusted data into the DOM, e.g. via
innerHTML).Defenses: contextual output encoding/escaping (treat data as data, not markup); avoid
innerHTML/dangerouslySetInnerHTMLwith untrusted input — usetextContent; sanitize rich HTML with a vetted library (DOMPurify); set a strong Content-Security-Policy; and mark session cookiesHttpOnlyso injected JS can't read them.What a strong answer coversXSS runs attacker script in the user's session context.
Three types: stored, reflected, DOM-based.
Primary defense: contextual output encoding; prefer
textContentoverinnerHTML.Layer with CSP, HTML sanitization (DOMPurify), and HttpOnly cookies.
Follow-ups they push on- Why does an HttpOnly cookie limit the damage of an XSS?
- How does a Content-Security-Policy mitigate XSS?
Red flag Treating input validation as sufficient. The core fix is output encoding for the right context; CSP and sanitization are defense-in-depth, not a single switch.
source: OWASP — Cross Site Scripting (XSS) ↗ -
What is CSRF, and how is it different from XSS? How do you defend against it?
CSRF (cross-site request forgery) tricks a logged-in user's browser into sending an unwanted state-changing request to your site. It exploits the fact that browsers attach cookies automatically, so a forged request from another site rides the victim's session.
Key difference from XSS: XSS is a *code injection* (attacker runs script in your page); CSRF is a *request forgery* that needs no script on your page — it abuses ambient cookie auth. XSS can defeat most CSRF defenses, so fixing XSS comes first.
Defenses:
SameSitecookies (Lax/Strict) so cookies aren't sent on cross-site requests; anti-CSRF tokens (a per-session secret the attacker can't read); and checkingOrigin/Referer. Avoid using GET for state changes.What a strong answer coversCSRF: forged state-changing request riding the victim's auto-sent cookies.
XSS injects script; CSRF forges a request and needs no script on your page.
Defenses:
SameSitecookies, anti-CSRF tokens, Origin/Referer checks.Never perform state changes on GET; XSS can bypass CSRF tokens, so fix XSS too.
Follow-ups they push on- How does SameSite=Lax block a typical CSRF attack?
- Why can an XSS vulnerability defeat anti-CSRF tokens?
Red flag Conflating CSRF with XSS, or thinking a CSRF token alone is enough when an XSS hole can simply read it.
source: OWASP — Cross Site Request Forgery (CSRF) ↗ -
How would you optimize the Largest Contentful Paint (LCP) of a page?
LCP marks when the largest in-viewport element (often a hero image or headline block) renders; good is ≤ 2.5s at p75. Optimize the four phases of its timeline:
- TTFB: fast server / CDN, cache HTML, reduce redirects.
- Resource load delay: make the LCP image discoverable early — put it in the markup (not JS-injected),fetchpriority="high",<link rel="preload">, and *don't* lazy-load it.
- Resource load time: serve a right-sized, modern-format (AVIF/WebP), compressed image over a fast connection.
- Render delay: cut render-blocking CSS/JS so the element can paint.The single biggest lever is usually ensuring the LCP image is requested as early as possible and not deferred.
What a strong answer coversLCP = render time of the largest viewport element; good ≤ 2.5s (p75).
Break it into TTFB, load delay, load time, render delay and attack each.
Don't lazy-load the LCP image; make it discoverable early (
fetchpriority, preload, in-markup).Serve right-sized modern-format images; reduce render-blocking resources.
Follow-ups they push on- Why is lazy-loading the hero image an anti-pattern for LCP?
- How does fetchpriority="high" change the request ordering?
Red flag Lazy-loading or JS-injecting the hero image (the preload scanner can't find it early), or optimizing total page weight while ignoring the LCP element's own request timing.
source: web.dev — Optimize Largest Contentful Paint ↗ -
What is a CORS preflight request, and what triggers one versus a 'simple' request?
A preflight is an automatic
OPTIONSrequest the browser sends before the real request to ask the server whether the actual call is allowed. It carriesOrigin,Access-Control-Request-Method, andAccess-Control-Request-Headers.It is triggered by non-simple requests: methods other than GET/HEAD/POST, custom headers, or a
Content-Typeoutsideapplication/x-www-form-urlencoded,multipart/form-data, ortext/plain(e.g.application/json).A simple request skips preflight — but the server still must return
Access-Control-Allow-Originfor the script to read the response.Follow-ups they push on- Why does sending Content-Type: application/json trigger a preflight?
- How can Access-Control-Max-Age reduce preflight overhead?
Red flag Thinking simple requests need no CORS headers — they still need Access-Control-Allow-Origin to be readable. Forgetting application/json forces a preflight.
source: MDN — Preflight request ↗ -
Compare SPA, MPA, SSR, and SSG, and the tradeoffs of each.
MPA: server sends a full HTML page per navigation. Simple, great for content sites; full reloads between pages.
SPA: one HTML shell, JS renders routes client-side. Fast in-app navigation; weak initial load and SEO, needs JS to render anything.
SSR: server renders HTML per request, then hydrates on the client. Good first paint and SEO with dynamic data; higher server cost and TTFB.
SSG: HTML built at deploy time, served from a CDN. Fastest and cheapest; only fits content that does not change per request (or use ISR to revalidate).
Follow-ups they push on- What is hydration and why can it be costly?
- Where does ISR (incremental static regeneration) fit between SSR and SSG?
Red flag Conflating SSR with SSG (request-time vs build-time), or claiming SPAs are inherently bad for SEO without mentioning SSR/prerendering as the fix.
source: web.dev — Rendering on the web ↗
06 Senior Cross-Cutting 164 Q's
6.1 System design fundamentals 14
-
Explain the CAP theorem and how it actually informs a design decision.
CAP says that when a network partition happens, a distributed system can keep only two of Consistency, Availability, Partition tolerance — and since partitions are unavoidable in real networks, the real choice is C vs A during a partition.
CP (consistency over availability): on a partition, refuse or block requests rather than serve stale/conflicting data — pick this when correctness is non-negotiable (a bank balance, inventory you can oversell). AP (availability over consistency): keep serving on both sides of the partition and reconcile later (eventual consistency) — pick this when staleness is tolerable and uptime matters more (a social feed, a shopping cart, DNS).
The senior point: CAP only bites *during* a partition; the rest of the time you get both. And it is a spectrum — many stores let you tune consistency per request (e.g. quorum reads/writes), so you choose CP or AP per use case, not per company.
What a strong answer coversPartitions are inevitable, so the live tradeoff is Consistency vs Availability during a partition.
CP: refuse/block on partition to avoid stale data — banking, inventory, anything correctness-critical.
AP: stay up and reconcile later (eventual consistency) — feeds, carts, DNS.
CAP only constrains you during a partition; normally you get C and A both.
Often tunable per request (quorum reads/writes), so the choice is per use case, not absolute.
Quick self-checkDuring a network partition, an 'AP' system chooses to:
-
Correct — AP favors availability, accepting temporary inconsistency that is resolved after the partition heals.
-
That is CP behavior — sacrificing availability to avoid divergence.
-
Not a CAP option; the theorem is about C vs A, not total shutdown.
-
Impossible during a partition — that is precisely what CAP rules out.
Follow-ups they push on- Give a concrete system you'd build CP and one you'd build AP, and why.
- How do quorum reads/writes let you tune where you sit on the spectrum?
- What does 'eventual consistency' actually promise the client?
Red flag Stating you 'pick two of three' as a permanent architecture choice — partition tolerance is mandatory, so the decision is C-vs-A only when a partition occurs, and it can be tuned per request.
source: System Design Primer — CAP theorem ↗ -
Design a URL shortener (TinyURL / bit.ly). Walk me through it.
Clarify scope first: read-heavy (~100:1 reads:writes), so optimize the redirect path. Estimate: ~100M new URLs/day -> ~1.2K writes/s, ~120K reads/s; 7-char base62 = 62^7 ~= 3.5T codes, plenty for years. Storage ~500 bytes/row * 100M/day -> ~36TB over 2 years.
Core: a key-gen service maps short->long. Two strategies: (a) base62-encode a globally unique counter (e.g. a range-allocator / Snowflake-style ID) — no collisions, but reveals volume; (b) hash the long URL (MD5/SHA) and take a prefix, then collision-check. Store mapping in a KV store / sharded SQL; reads go cache-first (Redis, LRU on hot links) then DB.
Wrap up: use
301only if you do not need per-click analytics (browser caches it), else302; shard by hash of short code; push click analytics to a queue (Kafka) for async aggregation.Follow-ups they push on- How do you guarantee globally unique codes across shards without a single counter bottleneck?
- 301 vs 302 — which do you pick and what do you lose with each?
- How would custom aliases and link expiry change the design?
Red flag Jumping straight to a schema before clarifying the read/write ratio and scale. Also picking 301 while still wanting click analytics — the cached redirect never hits your server again.
source: ByteByteGo — Design A URL Shortener ↗ -
Why and how would you introduce a message queue between services? What does it buy you?
A queue (SQS, RabbitMQ) or a log (Kafka) decouples a producer from a consumer: the producer drops a message and moves on, the consumer processes it on its own schedule. That buys you three things — async (the user-facing request returns immediately while slow work happens in the background), buffering (a traffic spike fills the queue instead of overwhelming the consumer), and resilience (if the consumer is down, messages wait instead of being lost).
Use it for work that does not need a synchronous answer: sending email, generating thumbnails, fanning out notifications, ingesting events. You also gain independent scaling (add consumers to drain a backlog) and smoothing of bursty load.
The tradeoffs you must name: added operational complexity, eventual rather than immediate results, and the need to handle idempotency because most queues guarantee at-least-once delivery (the same message can arrive twice).
What a strong answer coversDecouples producer/consumer: async work, buffering of spikes, resilience when a consumer is down.
Use it for fire-and-forget work: email, thumbnails, notification fanout, event ingestion.
Enables independent scaling — add consumers to drain a backlog.
Most queues are at-least-once, so consumers must be idempotent (dedupe on a key).
Cost: more moving parts, eventual results, and ordering is not free (often per-partition only).
Follow-ups they push on- Why must queue consumers usually be idempotent?
- What's the difference between a queue (SQS/RabbitMQ) and a log (Kafka)?
- How does a dead-letter queue help, and when do messages land there?
Red flag Assuming exactly-once delivery and writing a non-idempotent consumer — at-least-once redelivery then double-charges, double-sends, or double-processes on the inevitable retry.
source: AWS — What is message queuing? ↗ -
Design a typeahead / search autocomplete service.
Clarify: top-k suggestions per prefix, ranked by popularity, with very low latency (every keystroke fires a request) and read-heavy load. Two halves — serving and data-gathering.
Serving: precompute the top-k completions for each prefix so a query is a single lookup, not a scan. A trie with the top-k cached at each node answers a prefix in O(prefix length); cache hot tries/results in Redis at the edge. Debounce on the client and cap suggestions so you do not hammer the backend.
Data-gathering (offline): aggregate query logs to count frequencies, then rebuild/update the trie periodically (e.g. via a batch job) rather than on every search — autocomplete tolerates being slightly stale. Wrap up: shard the trie by prefix range, discuss freshness vs cost of rebuild cadence, and personalization/spell-correction as extensions.
Follow-ups they push on- Why precompute top-k per prefix instead of querying at request time?
- How do you keep the suggestions fresh without rebuilding the trie on every query?
- How would you shard the trie across nodes?
Red flag Querying the database for matching terms on every keystroke and sorting at request time — that does not survive the read volume; the win is precomputing top-k per prefix offline and serving from a cached trie.
source: ByteByteGo — Design A Search Autocomplete System ↗ -
What is consistent hashing, and what specific problem does it solve that modulo hashing does not?
With naive
hash(key) % Nsharding, changing the node countNchanges the modulus, so almost every key remaps to a different node — adding or removing one cache/storage node reshuffles the entire keyspace and cold-starts everything.Consistent hashing maps both nodes and keys onto the same hash ring (0…2^m). A key is owned by the next node clockwise. Now adding or removing a node only remaps the keys between that node and its neighbor — roughly 1/N of keys, not all of them.
The refinement is virtual nodes: place each physical node at many points on the ring so load spreads evenly and removing a node redistributes its keys across many others instead of dumping them all on one neighbor. This is the standard partitioning scheme for distributed caches, Cassandra, and DynamoDB-style stores.
What a strong answer covershash(key) % Nremaps nearly all keys when N changes — catastrophic for a cache.Consistent hashing puts nodes + keys on a ring; a key goes to the next node clockwise.
Adding/removing a node only remaps ~1/N of keys (those between it and its neighbor).
Virtual nodes spread each physical node across many ring points for even load + smooth rebalancing.
It's the backbone of distributed caches, Cassandra, and DynamoDB-style partitioning.
Quick self-checkYou add one node to a cluster of N. Roughly what fraction of keys remap under consistent hashing vs `hash % N`?
-
Correct — consistent hashing moves only the keys near the new node; modulo changes the divisor and reshuffles almost everything.
-
Wrong for consistent hashing — its entire purpose is to bound key movement to ~1/N.
-
Wrong for `hash % N` — changing N changes the modulus, remapping nearly every key.
-
Adding capacity must move some keys onto the new node; zero movement is impossible.
Follow-ups they push on- Why do virtual nodes improve load balance and rebalancing?
- Roughly what fraction of keys move when you add one node to a ring of N?
- How does this connect to designing a distributed cache?
Red flag Saying consistent hashing 'avoids collisions' — it is about minimizing key movement when the node set changes, not about hash collisions; without virtual nodes load can still skew badly.
source: System Design Primer — Consistent hashing / sharding ↗ -
When and how do you add a cache to a read-heavy system, and what are the gotchas?
Add a cache when reads dominate, the same data is read far more often than it changes, and the database is the bottleneck. The most common pattern is cache-aside (lazy loading): the app reads the cache first; on a miss it reads the database, populates the cache with a TTL, and returns. Writes update the database and invalidate or update the cached entry. Alternatives are read-through/write-through (the cache layer itself loads/writes the DB) and write-back (write cache now, flush to DB async — fast but risks loss).
The gotchas are where seniority shows. Cache invalidation is the hard problem — stale data after a write if you forget to evict. Cache stampede / thundering herd: a hot key expires and thousands of requests hit the DB at once — mitigate with request coalescing, a short lock, or staggered TTLs. Cold start after a flush hammers the DB. And caching is for tolerable-staleness data — never cache something that must be strongly consistent (a bank balance) without care.
What a strong answer coversAdd a cache when reads ≫ writes, data is reused, and the DB is the bottleneck.
Cache-aside: read cache → miss → read DB → populate with TTL; writes invalidate the entry.
Alternatives: read-through/write-through (cache fronts the DB) and write-back (async flush, risks loss).
Invalidation is the hard part — stale reads after a write if you forget to evict.
Guard against stampede (hot key expiry → DB flood): coalescing, locks, staggered TTLs.
Quick self-checkIn the cache-aside pattern, what happens on a cache miss?
-
Correct — the application lazily loads on a miss and stores the result so subsequent reads hit the cache.
-
A miss is normal, not an error; the app should fall through to the source of truth.
-
That's read-through caching; in cache-aside the application does the DB read.
-
A miss is a read path, not a write; you populate the cache, you don't write the DB.
Follow-ups they push on- Why is cache invalidation famously the hard part of caching?
- How do you prevent a cache stampede when a popular key expires?
- What data should you NOT cache, and why?
Red flag Caching without an invalidation/TTL strategy — writes update the DB but leave stale entries in the cache, so users keep reading old data until the entry happens to expire.
source: System Design Primer — Caching ↗ -
Design a distributed rate limiter for a public API.
Clarify: client-side vs server-side (server-side), what dimensions to limit (per-user, per-IP, per-endpoint, global), and the action on limit (drop, queue, return
429withRetry-After). Pick an algorithm and justify it: token bucket (allows bursts, simple, most common), leaky bucket (smooths output), fixed window (cheap but boundary spikes), sliding-window log (accurate, memory-heavy), sliding-window counter (good accuracy/cost tradeoff).For a distributed fleet, counters must be shared: keep them in a central store like Redis, keyed by
userId:window, incremented atomically (e.g. a Lua script /INCR+EXPIRE) so the read-modify-write is race-free. Put the limiter at the edge / API gateway so rejected traffic never reaches your services.Discuss tradeoffs: local in-memory counters are fast but let bursts through across nodes; Redis adds a network hop and a single point to scale; allow a small over-limit margin to tolerate Redis latency.
Follow-ups they push on- Token bucket vs sliding-window counter — when do you prefer each?
- How do you keep the counter consistent across many API servers?
- What happens if Redis goes down — fail open or fail closed?
Red flag Using a non-atomic get-then-set on the counter, which races under concurrency and lets requests slip past the limit. Also putting the limiter behind the app instead of at the gateway.
source: ByteByteGo — Design A Rate Limiter ↗ -
Design a social media news feed (e.g. the Facebook/Twitter timeline).
Clarify: feed of posts from people you follow, ranked (recency or relevance), heavy read load. The core decision is fanout-on-write (push) vs fanout-on-read (pull).
Fanout-on-write: when a user posts, push the post id into every follower's precomputed feed cache. Reads are O(1) and fast — great for most users. But it explodes for celebrities with millions of followers (the hot-key / fanout problem).
Fanout-on-read: build the feed at request time by pulling recent posts from everyone the user follows. No write amplification, but reads are expensive and slow.
The standard answer is a hybrid: push for normal accounts, pull for a small set of high-follower accounts, then merge at read time. Cache assembled feeds (Redis), store posts in a sharded store, and rank with a separate scoring service.
Follow-ups they push on- How do you handle a celebrity with 50M followers under fanout-on-write?
- Where does ranking/ML scoring fit — write time or read time?
- How do you keep the feed cache from growing unbounded?
Red flag Committing to pure fanout-on-write without acknowledging the celebrity hot-key problem, or pure fanout-on-read and ignoring read latency at scale.
source: ByteByteGo — Design A News Feed System ↗ -
Design a distributed cache (like a multi-node Redis/Memcached layer).
Clarify: read-heavy lookups, low latency, data too large for one node's RAM, so partition across nodes. The key technique is consistent hashing: map both nodes and keys onto a hash ring so that adding/removing a node only remaps ~1/N of keys instead of remapping everything (which a naive
hash(key) % Nwould do).Discuss replication for availability (replica per shard, read from replicas), an eviction policy (LRU/LFU) since memory is bounded, and the write policy: write-through (write cache + DB together, consistent but slower) vs write-back (write cache, flush DB async, fast but risks loss) vs cache-aside (app reads cache, on miss reads DB and populates).
Wrap up: name failure modes — cache stampede on a hot key expiring (mitigate with request coalescing / a short lock), and the thundering herd on cold start.
Follow-ups they push on- Why consistent hashing instead of modulo hashing?
- How do you prevent a cache stampede when a hot key expires?
- Cache-aside vs write-through — what consistency do you give up?
Red flag Proposing `hash(key) % N` for sharding — adding one node reshuffles almost every key and cold-starts the whole cache.
source: ByteByteGo — Distributed Cache (System Design Interview) ↗ -
How do you do back-of-the-envelope estimation? Estimate QPS and storage for a service with 100M daily active users.
The point is order-of-magnitude reasoning, not precision. Start from DAU and an action rate. Say each of 100M users does ~10 reads/day -> 1B reads/day. Divide by ~86,400 s/day (~10^5) -> ~12K reads/s average; multiply by a peak factor of ~2-3x -> ~30K peak QPS.
Storage: rows/day * bytes/row * retention. If you store 1M new items/day at ~1KB each, that is ~1GB/day, ~365GB/year, ~1TB over 3 years — round freely.
Keep a few anchors memorized: ~10^5 seconds/day, reads usually dwarf writes (often 100:1), a memory read is ~ns, SSD ~µs, network round-trip cross-region ~tens of ms. State your assumptions out loud and round to clean powers of ten so the interviewer can follow.
Follow-ups they push on- What read:write ratio did you assume and why?
- How does the storage number change the database choice?
- What peak-to-average factor is reasonable, and why?
Red flag Reaching for a calculator and false precision. The interviewer wants to see assumptions stated and powers-of-ten arithmetic, not 31,536,000 seconds.
source: System Design Primer — Back-of-the-envelope ↗ -
Walk me through the 4-step framework you use to attack any system design interview.
(1) Understand the problem & scope: ask clarifying questions, separate functional from non-functional requirements (scale, latency, consistency, availability), and do capacity estimates. Pin down what is in and out of scope before drawing anything.
(2) Propose a high-level design and get buy-in: sketch the major boxes — clients, API/gateway, services, datastores, cache, queue — and the data flow. Confirm the interviewer agrees before going deep.
(3) Deep dive: pick the 1-2 components the interviewer cares about (the data model, the sharding strategy, the hot path) and go deep — algorithms, schema, partitioning, the actual bottleneck.
(4) Wrap up: name bottlenecks, single points of failure, and tradeoffs; mention what you would monitor and how you would scale the next 10x. The discipline is to drive the conversation, not silently draw.
Follow-ups they push on- How do you decide which component to deep-dive on?
- What non-functional requirements do you always ask about?
Red flag Skipping step 1 and diving into a database schema before clarifying scale, latency, and consistency needs — the single most common reason candidates fail the round.
source: ByteByteGo — A framework for system design interviews ↗ -
How do you identify bottlenecks and single points of failure in a design, and how do you remove them?
Trace the request path and ask, at each hop, what happens if this one component dies or saturates. A single load balancer, a single primary database, a single cache node, or a single region are classic SPOFs.
Remove SPOFs with redundancy + failover: run the load balancer in an active-passive pair, replicate the DB (primary + replicas, automatic promotion), spread services across multiple availability zones, and use health checks so traffic routes away from dead instances.
For bottlenecks, find the component nearest its capacity ceiling: stateless app tier scales horizontally behind the LB; a write-bound DB needs sharding or a queue to absorb bursts; a read-bound DB needs replicas + a cache. The senior move is to quantify it (this shard does X writes/s, the limit is Y) rather than hand-wave 'add more servers'.
Follow-ups they push on- How does making services stateless help you scale horizontally?
- How do you decide between adding read replicas vs sharding?
Red flag Treating 'add a load balancer' as the whole answer while the load balancer itself remains a single point of failure, or scaling a stateful service horizontally without externalizing session state.
source: System Design Primer — Availability patterns ↗ -
Design a web crawler that can crawl the public web.
Clarify: scale (billions of pages), politeness (respect robots.txt and per-host rate limits), freshness, and what you extract. Core loop: a URL frontier (a queue, partitioned by host so one host's pages go to one worker for politeness) feeds a fleet of fetchers; fetched HTML goes to a parser that extracts links, which are de-duplicated and fed back into the frontier.
Key components: a DNS cache (DNS resolution is a hidden bottleneck), a seen-URL set (Bloom filter / hash store) to avoid re-crawling, and content de-duplication (hash or Simhash of page content to skip near-duplicates). Store raw pages in object storage / a distributed file store.
Wrap up: politeness is the subtle part — partition the frontier by domain and apply a per-host crawl delay so you do not hammer one site; add priority queues so important pages get crawled sooner.
Follow-ups they push on- How do you avoid crawling the same URL (or near-duplicate content) twice?
- How do you stay polite to a single host while still being massively parallel?
- Why is DNS a bottleneck and how do you mitigate it?
Red flag Forgetting politeness/robots.txt and a de-dup mechanism — an interviewer reads that as someone who would get the crawler IP-banned and stuck in cycles.
source: ByteByteGo — Design A Web Crawler ↗ -
How do you choose between SQL and NoSQL for a system-design problem?
Drive it from access patterns and requirements, not preference. Reach for SQL (Postgres/MySQL) when you need strong consistency and multi-row transactions (ACID), rich ad-hoc queries and joins, and a stable relational schema — payments, orders, anything where correctness beats raw write throughput.
Reach for NoSQL when you need massive horizontal write scale, a flexible/evolving schema, or a specific access shape: a wide-column store (Cassandra/DynamoDB) for huge write volume and known key lookups, a document store (MongoDB) for nested aggregates, a KV store (Redis) for caching, a graph DB for relationship-heavy traversals.
The senior framing is the tradeoff: most NoSQL stores trade joins and strong consistency for partition tolerance and horizontal scale, and you must model the table around the query up front. State your access pattern, then justify the store.
What a strong answer coversChoose from access patterns + consistency needs, never from familiarity.
SQL: ACID transactions, joins, ad-hoc queries, stable relational schema (orders, payments).
NoSQL: horizontal write scale, flexible schema, query-shaped models (Cassandra/DynamoDB, Mongo, Redis).
NoSQL usually trades joins + strong consistency for scale — you model around the query first.
It is not all-or-nothing: polyglot persistence — SQL for the core, Redis for cache, a search index alongside.
Quick self-checkA payments service needs multi-row transactions and strong consistency. Best default store?
-
Correct — ACID transactions, joins, and strong consistency are exactly what relational stores guarantee.
-
Optimized for huge write throughput with eventual consistency — weak fit for cross-row transactional correctness.
-
Great as a cache, but not a durable system-of-record for transactional money movement.
-
Flexible schema, but you would be fighting it for multi-document ACID; not the default for payments.
Follow-ups they push on- Which store fits a write-heavy event log with known key lookups?
- What do you give up when you pick a wide-column store over Postgres?
- When would you run both a SQL store and a NoSQL store in the same system?
Red flag Declaring 'NoSQL because it scales' with no access pattern stated — many NoSQL stores need the schema modeled around the exact query, and you lose joins/transactions you may actually need.
source: System Design Primer — SQL or NoSQL ↗
6.2.1 Containers (Docker) 13
-
What is the difference between a Docker image and a container?
An image is the blueprint — an immutable, read-only stack of layers (filesystem + metadata like the default command) built from a Dockerfile. A container is a running (or stopped) instance of an image: Docker adds a thin writable layer on top of the read-only image layers and gives it an isolated process, network, and mount namespace.
The analogy: image is to container as a class is to an object, or a program on disk is to a process. You can spin up many containers from one image; each gets its own writable layer, so changes inside one container do not affect the image or the other containers.
Follow-ups they push on- What happens to data written inside a container when it is removed?
- Why are image layers read-only and the container layer writable?
Red flag Saying data persists in the image after a container writes to it — writes land in the container's ephemeral writable layer and vanish when the container is removed unless you mount a volume.
source: Docker docs — Images and layers ↗ -
What is the difference between Docker's default bridge network and a user-defined bridge network?
Both use the
bridgedriver, but a user-defined bridge adds the feature you almost always want: built-in DNS-based service discovery. Containers on the same user-defined network can reach each other by container name (http://api:3000), because Docker runs an embedded DNS resolver for that network.On the default
bridgenetwork, name resolution is not provided — containers can only reach each other by IP (or the legacy, deprecated--link), which is fragile because IPs change. User-defined networks also give you better isolation (only containers you attach can talk) and let you attach/detach containers on the fly.The practical takeaway: for any multi-container app, create a user-defined bridge (which is exactly what docker-compose does automatically) so services find each other by name rather than chasing IP addresses.
What a strong answer coversUser-defined bridge networks give automatic DNS — reach containers by name.
The default bridge has no name resolution (IP only, or deprecated
--link).User-defined networks add isolation — only attached containers can communicate.
Compose creates a user-defined network for you, which is why services resolve each other by service name.
Prefer user-defined bridges for any multi-container app; avoid relying on the default bridge.
Follow-ups they push on- Why is reaching containers by IP on the default bridge fragile?
- How does docker-compose use this under the hood?
- What does the `host` network driver change about all this?
Red flag Expecting container-name DNS resolution to work on the default `bridge` network — it doesn't; you must create a user-defined network (or use compose) to get name-based service discovery.
source: Docker docs — Networking overview ↗ -
What is a container registry, and what is the danger of deploying images tagged `:latest`?
A registry (Docker Hub, GHCR, ECR) is the remote store for images: you
pushbuilt images to it and nodespullthem at deploy time. An image is addressed byregistry/repository:tagplus an immutable content digest (sha256:...).The
:latesttag is the trap. It is just a mutable label, not a guarantee of newness — it points to whatever was last pushed with that tag, and it can be overwritten. So 'deploy:latest' is non-deterministic: two nodes pulling at different times can run different code, you can't tell which build is in production, and rollbacks are ambiguous. It also undermines caching (Docker may skip re-pulling a tag it already has, so you can silently run a stale image).The fix: deploy immutable, specific tags (a version or git SHA, e.g.
:1.4.2or:sha-abc123), or pin by digest. Reserve:latestfor casual local use only.What a strong answer coversA registry stores images; nodes
pullbyrepo:tagplus an immutablesha256digest.:latestis a mutable pointer, not 'the newest' — it can be overwritten and means different things over time.Deploying
:latestis non-deterministic: nodes can run different builds; rollbacks are ambiguous.Pin to a version or git SHA tag (or the digest) so a deploy is reproducible and traceable.
It also defeats reliable cache invalidation — you can silently keep running a stale image.
Quick self-checkWhat does the `:latest` tag actually guarantee about an image?
-
Correct — `latest` is just a tag like any other; it can point to an old or arbitrary build.
-
Only if someone re-tags every new build as latest; the tag itself enforces nothing.
-
Backwards — the immutable identifier is the sha256 digest, not the `latest` tag.
-
Pull behavior depends on pull policy/cache, not the tag name; Docker may reuse a cached `latest`.
Follow-ups they push on- Why is pinning by digest the strongest guarantee of running an exact image?
- How does `:latest` make a rollback ambiguous?
- What naming scheme would you use for production image tags?
Red flag Shipping `:latest` to production — it is mutable, so different nodes can run different code and you lose the ability to say exactly which build is live or roll back to a known-good one.
source: Docker docs — Push and pull / registries ↗ -
What is the difference between `COPY` and `ADD` in a Dockerfile, and which should you default to?
Both copy files into the image, but
ADDhas two extra, surprising behaviors: it can fetch a remote URL, and it auto-extracts local tar archives into the destination.COPYdoes exactly one thing — copy local files/directories from the build context — with no magic.The guidance (and Docker's own best practice) is to default to
COPYbecause it is explicit and predictable. ReserveADDfor the one case it is genuinely good at: copying-and-extracting a local tarball in a single step. For fetching remote files, prefer an explicitRUN curl/wget(or better,ADD's checksum options) so the intent and caching are clear.The trick the interviewer is checking: candidates who use
ADD https://...casually may not realize it bypasses the clarity ofCOPYand can silently auto-extract archives, leading to surprising image contents.What a strong answer coversCOPYcopies local build-context files only — no surprises.ADDalso fetches remote URLs and auto-extracts local tar archives.Default to
COPYfor predictability (Docker's own best-practice guidance).Use
ADDonly for its niche win: copy-and-extract a local tarball in one step.For remote downloads prefer explicit
RUN curl/wgetso caching and intent are clear.
Follow-ups they push on- What surprising thing happens if you `ADD` a local `.tar.gz` file?
- Why is `RUN curl` often preferred over `ADD <url>` for remote files?
- When is `ADD` genuinely the right choice?
Red flag Using `ADD` everywhere as a synonym for `COPY` — its auto-extraction of tar archives and URL fetching are silent, surprising behaviors; default to `COPY` and reach for `ADD` only deliberately.
source: Docker docs — Dockerfile reference (ADD / COPY) ↗ -
Why does the order of instructions in a Dockerfile matter? How does layer caching work?
Each Dockerfile instruction creates a layer. On rebuild, Docker reuses a cached layer as long as that instruction and everything it depends on are unchanged; the first instruction that changes invalidates that layer and every layer after it.
So you order from least-frequently-changing to most-frequently-changing. The classic example for a Node app:
COPY package.jsonthenRUN npm installBEFORECOPY . .. Dependencies change rarely, so the expensivenpm installlayer stays cached across most builds; only the cheap source-copy layer rebuilds when you edit code. If youCOPY . .first, every source edit busts the cache and reinstalls all dependencies.Follow-ups they push on- Where would you put `COPY package.json` vs `COPY . .` and why?
- How does a `.dockerignore` file interact with build caching?
Red flag Copying the whole source tree before installing dependencies — every code change then invalidates the dependency-install layer and forces a slow full reinstall.
source: Docker docs — Building best practices ↗ -
What is a `.dockerignore` file and why does it matter for both build speed and security?
.dockerignorelists paths excluded from the build context — the set of files the Docker daemon receives before building. Excludingnode_modules,.git, build output, and local env files makes the context smaller, so builds start faster and the cache is less likely to bust on irrelevant changes.The security angle: without it, a
COPY . .can sweep secrets (.env,.aws/, private keys,.githistory) straight into an image layer, where they persist even if a later layer deletes them. So.dockerignoreboth speeds up builds and keeps secrets out of the image.Follow-ups they push on- Why does deleting a secret in a later layer not actually remove it from the image?
- What belongs in a typical `.dockerignore`?
Red flag Believing that a `RUN rm secret` later in the Dockerfile removes the secret — layers are additive, so the file still lives in the earlier layer and can be extracted from the image history.
source: Docker docs — Building best practices (.dockerignore) ↗ -
When would you use docker-compose, and what problem does it solve?
docker-compose defines and runs a multi-container app from a single declarative YAML file. Instead of starting each container with a long
docker runand wiring up networks/volumes by hand, you describe the services (app, db, cache), their images/build contexts, ports, env, volumes, and dependencies, thendocker compose upbrings the whole stack up on a shared network where services reach each other by service name.Its sweet spot is local development and CI — reproducing a realistic multi-service environment (e.g. an API + Postgres + Redis) with one command. It is not an orchestrator; for production scheduling, self-healing, and scaling across many machines you reach for Kubernetes.
Follow-ups they push on- How do services in a compose file discover each other?
- Why is compose not a substitute for Kubernetes in production?
Red flag Pitching docker-compose as a production orchestration tool — it does not give you multi-node scheduling, self-healing, or rolling updates across a cluster.
source: Docker docs — Docker Compose overview ↗ -
What is the difference between a Docker volume and a bind mount, and when do you use each?
Both persist data outside the container's ephemeral writable layer, but they differ in who owns the storage. A named volume is managed by Docker in its own storage area (
/var/lib/docker/volumes/...); you reference it by name, Docker handles the location, and it is the portable, production-friendly default — great for databases and app data that must outlive a container.A bind mount maps a specific host path straight into the container. It is tied to the host's directory layout, so it is ideal for local development (mount your source code so edits show up live) but brittle and host-coupled for production.
Rule of thumb: volumes for data Docker should manage and that must survive container removal; bind mounts for sharing host files into a container during development. A third option,
tmpfs, keeps data in memory only — for secrets/scratch that should never hit disk.What a strong answer coversBoth survive the container's ephemeral writable layer; the difference is who owns the storage.
Named volume: Docker-managed, portable, the production default (databases, persistent app data).
Bind mount: a specific host path into the container — perfect for live-reloading source in local dev.
Bind mounts are host-coupled and brittle for production; volumes abstract the location away.
tmpfsmounts live in memory only — for scratch/secret data that must never touch disk.
Quick self-checkYou want a Postgres container's data to survive container recreation and stay portable across hosts. Use:
-
Correct — Docker manages the storage and location, so the data persists and the setup is portable.
-
Works on that host, but couples the container to a specific host path — not portable.
-
That layer is deleted with the container — data does not survive recreation.
-
tmpfs is in-memory and vanishes on stop — the opposite of durable persistence.
Follow-ups they push on- Why is a bind mount a poor choice for production data persistence?
- Where does a named volume actually live, and why does that make it portable?
- When would you reach for a tmpfs mount?
Red flag Relying on a bind mount in production — it couples the container to the host's exact directory layout, so the same image behaves differently (or breaks) on another host; use a named volume so Docker owns the storage.
source: Docker docs — Volumes ↗ -
Your container starts and immediately exits with code 0, and you don't know why. How do you debug it?
Exit code 0 means the main process finished successfully — a container lives exactly as long as its PID 1 runs, so if the command completes, the container stops. This is usually a misconception, not a bug: the image's
CMD/ENTRYPOINTran a one-shot command (or a process that daemonized into the background) instead of a long-running foreground process.Debug it:
docker ps -ato confirm the exit code,docker logs <container>to see what it printed, anddocker inspect <container>for the actual command and config. Then check whetherCMDruns a foreground process — a common trap is starting a server that forks into the background, so PID 1 returns and the container exits.Fix: make the entrypoint run a long-lived foreground process (e.g.
nginx -g 'daemon off;', or run the app directly rather than via a launcher that backgrounds it). For interactive debugging, override the entrypoint:docker run -it --entrypoint sh <image>.What a strong answer coversA container runs only as long as its PID 1; exit 0 = the main command completed normally.
Usual cause:
CMDran a one-shot command, or a server daemonized into the background so PID 1 returned.Inspect with
docker ps -a(exit code),docker logs, anddocker inspect(the actual command).Fix: run the process in the foreground (e.g.
nginx -g 'daemon off;').Drop into the image to poke around:
docker run -it --entrypoint sh <image>.
Follow-ups they push on- Why does a server that forks into the background cause the container to exit?
- How do you get a shell inside an image whose entrypoint exits immediately?
- How is exit code 0 different in meaning from 137 or 1?
Red flag Assuming a clean exit code 0 means something crashed — it means the foreground process finished; the real fix is running a long-lived foreground process as PID 1, not adding restart policies.
source: Docker docs — Run and manage containers ↗ -
Write a multi-stage Dockerfile for a Node app and explain why multi-stage builds matter.
A multi-stage build uses multiple
FROMstatements: a heavy build stage compiles/installs, then a slim runtime stage copies only the final artifacts. The build toolchain (compilers, dev dependencies) never ships in the final image, so it is smaller and has a smaller attack surface.FROM node:20 AS buildWORKDIR /appCOPY package*.json ./RUN npm ciCOPY . .RUN npm run buildFROM node:20-slimWORKDIR /appCOPY --from=build /app/dist ./distCOPY --from=build /app/node_modules ./node_modulesUSER nodeEXPOSE 3000CMD ["node", "dist/server.js"]The
COPY --from=buildpulls only built output from the earlier stage; the final image starts from a slim base and runs as the non-rootnodeuser.Follow-ups they push on- Why run as a non-root user in the final stage?
- How would you get an even smaller image (distroless / alpine)?
Red flag Shipping the full build image with dev dependencies and toolchain, or running as root in the final stage — bigger image, larger attack surface, and a container that can do more damage if compromised.
source: Docker docs — Multi-stage builds ↗ -
What is the difference between `CMD` and `ENTRYPOINT` in a Dockerfile?
Both define what runs when the container starts, but they compose differently.
ENTRYPOINTsets the fixed executable;CMDsets default arguments that are easy to override atdocker runtime.With
ENTRYPOINT ["python", "app.py"]the container always runs that; anything you pass todocker runis appended as args. With onlyCMD ["python", "app.py"], passing a command todocker runreplaces it entirely. A common pattern isENTRYPOINTfor the binary plusCMDfor default flags, sodocker run imageuses the defaults anddocker run image --other-flagoverrides just the flags.Prefer the exec form (JSON array) over the shell form so signals like
SIGTERMreach your process directly for clean shutdown.Follow-ups they push on- Why does the exec form matter for graceful shutdown / signal handling?
- How do `ENTRYPOINT` and `CMD` combine when both are present?
Red flag Using the shell form (`CMD node server.js`) so the app runs as a child of `/bin/sh`, which swallows `SIGTERM` — the container then gets SIGKILLed on stop instead of shutting down gracefully.
source: Docker docs — Dockerfile reference (CMD / ENTRYPOINT) ↗ -
Your Docker image is 1.2GB and builds take 10 minutes on every code change. How do you debug and fix it?
Two separate problems: image size and build time.
Size: run
docker history <image>to see which layers are fat. Usual culprits are a heavy base image (use-slim/-alpine/distroless), build toolchain shipped in the runtime image (fix with a multi-stage build copying only artifacts), and dev dependencies (npm ci --omit=dev). Combine relatedRUNsteps and clean package caches in the same layer so the cleanup actually shrinks the layer.Build time on every change: this is almost always cache invalidation from instruction order. Copy and install dependencies before copying source, add a
.dockerignoreso unrelated files do not bust the context, and enable BuildKit so independent stages build in parallel. After reordering, only the source layer rebuilds on a code edit, dropping the loop from minutes to seconds.Follow-ups they push on- Which tool shows you per-layer size, and what do you look for?
- Why does cleaning a cache in a separate `RUN` not reduce image size?
Red flag Adding `RUN rm -rf /var/cache/...` as a new layer after the install layer — additive layers mean the bytes still count; the cleanup must happen in the same `RUN` as the install.
source: Docker docs — Building best practices ↗ -
How do containers achieve isolation? What kernel features make a container different from a VM?
A container is just a regular Linux process that the kernel isolates using two features: namespaces and cgroups. Namespaces scope *what a process can see* — separate PID, network, mount, user, and hostname namespaces make the process believe it has its own process tree, network stack, and filesystem. cgroups scope *what it can use* — CPU, memory, and I/O limits. Together they give the illusion of a private machine while everything shares one host kernel.
That shared kernel is the key contrast with a VM: a VM runs a full guest OS with its own kernel on top of a hypervisor, so it is heavier (GBs, slow boot) but more strongly isolated. A container shares the host kernel, so it is lightweight (MBs, sub-second start) but the isolation is weaker — a kernel exploit can cross the boundary.
This is why containers pack densely and start fast, and why you don't run untrusted multi-tenant workloads on bare containers without extra sandboxing.
What a strong answer coversA container is a host process isolated by namespaces (what it can see) + cgroups (what it can use).
Namespaces: PID, network, mount, user, UTS — each process gets its own view of the system.
cgroups bound CPU/memory/IO so one container can't starve the others.
Containers share the host kernel (light, fast); VMs run a full guest OS + hypervisor (heavy, stronger isolation).
Weaker container isolation is why untrusted multi-tenant workloads need extra sandboxing (gVisor, microVMs).
Quick self-checkWhich pair of Linux kernel features primarily provides container isolation?
-
Correct — namespaces isolate what a process sees; cgroups limit what it can consume.
-
That describes a VM, not a container — containers have neither.
-
chroot only scopes the filesystem root; it is far short of full container isolation.
-
seccomp hardens syscalls and TLS is unrelated; neither provides the core view/resource isolation.
Follow-ups they push on- What do namespaces isolate vs what cgroups limit?
- Why does sharing the host kernel make containers faster but less isolated than VMs?
- When would you still prefer a VM (or microVM) over a plain container?
Red flag Describing a container as a 'lightweight VM' — there is no guest OS or hypervisor; it is a host process with kernel-enforced isolation, which is exactly why the isolation boundary is weaker than a VM's.
source: Docker docs — What is a container? ↗
6.2.2 Orchestration (Kubernetes) 13
-
What are the differences between a Service of type ClusterIP, NodePort, and LoadBalancer?
They form a ladder of increasing external exposure, and each builds on the previous.
ClusterIP (the default) gives the Service a stable virtual IP reachable only inside the cluster — perfect for service-to-service traffic that should never be public. NodePort opens a fixed high port (30000–32767) on every node, so external traffic to
nodeIP:nodePortreaches the Service; it builds on ClusterIP and is mostly a dev/debug or building-block mechanism, not a polished production front door. LoadBalancer provisions an external cloud load balancer (an AWS NLB/ALB, a GCP LB) that fronts the Service with a single external IP — the production way to expose one Service to the internet.The senior nuance: one LoadBalancer per Service gets expensive, so for many HTTP services you front them with a single Ingress (L7 routing/TLS) backed by one load balancer instead of a LoadBalancer Service each.
What a strong answer coversClusterIP (default): internal-only stable virtual IP — service-to-service traffic.
NodePort: opens a fixed port on every node; builds on ClusterIP, mainly dev/building-block.
LoadBalancer: provisions a cloud load balancer with an external IP — production single-service exposure.
Each type is a superset of the previous (LoadBalancer → NodePort → ClusterIP under the hood).
Many HTTP services? Use one Ingress instead of a LoadBalancer per Service to save cost.
Quick self-checkYou need internal-only communication between two microservices in the cluster. Which Service type?
-
Correct — it gives a stable in-cluster virtual IP with no external exposure, exactly right for service-to-service traffic.
-
Opens a port on every node to the outside world — unnecessary exposure for purely internal traffic.
-
Provisions an external cloud load balancer — overkill and externally exposed for internal-only traffic.
-
Just maps the Service to an external DNS name; not how you connect two in-cluster services.
Follow-ups they push on- Why would you front many services with an Ingress instead of a LoadBalancer each?
- What range do NodePorts fall in, and why isn't NodePort a great production front door?
- How does a LoadBalancer Service actually get its external IP?
Red flag Reaching for a LoadBalancer Service per microservice — each provisions (and bills for) a separate cloud load balancer; route many HTTP services through a single Ingress instead.
source: Kubernetes docs — Service (publishing types) ↗ -
Explain the core Kubernetes objects: Pod, Deployment, Service, and Ingress. How do they relate?
A Pod is the smallest deployable unit — one or more containers sharing a network namespace and storage. Pods are ephemeral; you rarely create them directly.
A Deployment is the controller you actually use: you declare a desired replica count and a pod template, and it manages a ReplicaSet to keep that many pods running, replacing crashed ones and handling rolling updates.
A Service gives that fluid set of pods a single stable virtual IP and DNS name, load-balancing across the matching pods (selected by labels) so callers do not chase changing pod IPs.
Ingress sits in front of Services to route external HTTP(S) traffic — host/path routing and TLS termination — to the right Service. So: Ingress -> Service -> Pods, with the Deployment keeping the pods alive underneath.
Follow-ups they push on- How does a Service know which pods to send traffic to?
- What is the difference between a Service of type ClusterIP, NodePort, and LoadBalancer?
Red flag Conflating a Service with an Ingress — a Service does L4 load-balancing inside the cluster, Ingress does L7 HTTP routing and TLS at the edge.
source: Kubernetes docs — Concepts ↗ -
What is a namespace in Kubernetes, and what problems does it actually solve (and not solve)?
A namespace is a virtual cluster-within-a-cluster: a scope for naming and a boundary for applying policy. It lets you partition one physical cluster among teams or environments (
team-a,staging) so names don't collide and you can attach ResourceQuotas (cap CPU/memory per namespace), RBAC (who can do what, where), and NetworkPolicies per slice.What it is good for: organization, quota, and access control on a shared cluster. What it is not: a hard security/isolation boundary. By default, pods in different namespaces can still reach each other over the network — namespaces alone do not isolate traffic; you need NetworkPolicies for that. And some objects are cluster-scoped (nodes, PersistentVolumes, namespaces themselves), so they live outside any namespace.
The senior point: namespaces are an organizational and policy primitive, not a substitute for multi-tenancy isolation between untrusted parties.
What a strong answer coversA namespace scopes names and is the unit for ResourceQuota, RBAC, and NetworkPolicy.
Great for partitioning a shared cluster by team or environment.
Not a network isolation boundary — cross-namespace pod traffic is allowed by default.
Use NetworkPolicies to actually restrict traffic between namespaces.
Some objects are cluster-scoped (nodes, PVs, namespaces) and aren't namespaced.
Quick self-checkBy default, can a pod in namespace `a` reach a pod in namespace `b` over the network?
-
Correct — namespaces scope names and policy objects, but pod-to-pod traffic is open across them unless restricted.
-
Wrong — that isolation requires explicit NetworkPolicies, not namespaces alone.
-
Deployment membership has nothing to do with cross-namespace network reachability.
-
Regular pods can reach across namespaces by default; this is simply incorrect.
Follow-ups they push on- Why don't namespaces stop pods in different namespaces from talking to each other?
- What do you add to get real network isolation between namespaces?
- Name a couple of resources that are cluster-scoped, not namespaced.
Red flag Treating namespaces as a security boundary for untrusted tenants — without NetworkPolicies (and often stronger isolation), pods across namespaces can still reach each other on the network.
source: Kubernetes docs — Namespaces ↗ -
What is the difference between a ConfigMap and a Secret? Is a Secret actually encrypted?
Both inject configuration into pods (as env vars or mounted files) and both keep config out of the image. The difference is intent: ConfigMaps hold non-sensitive config (feature flags, URLs); Secrets hold sensitive values (passwords, tokens, keys).
The gotcha: a Secret is only base64-encoded, not encrypted — base64 is trivially reversible, so anyone who can read the Secret object sees the value. To actually protect Secrets you must enable encryption-at-rest for etcd, lock down access with RBAC, and avoid committing Secret manifests to git. Many teams go further with an external secret store (Vault, cloud secret managers) and pull values in at runtime.
Follow-ups they push on- What two things must you configure to make Secrets meaningfully secure?
- Why is putting a Secret YAML in git dangerous even though it 'looks encoded'?
Red flag Claiming a Kubernetes Secret is encrypted by default — it is base64, which is encoding, not encryption. Without encryption-at-rest + RBAC it offers essentially no confidentiality.
source: Kubernetes docs — Secrets ↗ -
What is a StatefulSet, and how is it different from a Deployment? When do you need one?
A Deployment treats its pods as interchangeable, fungible replicas — random names, no stable identity, no per-pod storage. That is exactly right for stateless app servers.
A StatefulSet gives each pod a stable, sticky identity: a stable ordinal name (
db-0,db-1), stable network identity (a headless Service gives each a predictable DNS name), and its own persistent volume that survives reschedule and follows the pod. Pods are created/scaled/terminated in order (0, 1, 2 …), which matters for clustered systems that need a known startup/teardown sequence.You need a StatefulSet for stateful, clustered workloads where identity matters: databases, Kafka, ZooKeeper, Elasticsearch — anything where pod
db-0must keep beingdb-0with the same data. For stateless web/API tiers, always use a Deployment. The senior caveat: running databases in-cluster at all is a real decision; many teams prefer a managed database over a StatefulSet.What a strong answer coversDeployment pods are fungible (random names, shared/no per-pod storage) — for stateless apps.
StatefulSet gives each pod a stable ordinal identity (
db-0), stable DNS, and its own PVC.Pods come up / scale / terminate in order, which clustered systems rely on.
Use it for databases, Kafka, ZooKeeper, Elasticsearch — workloads where identity + data stick to the pod.
Caveat: consider a managed database instead of running stateful systems in-cluster.
Quick self-checkWhich workload genuinely requires a StatefulSet rather than a Deployment?
-
Correct — stable ordinal identity, stable DNS, and per-pod persistent storage are exactly what a StatefulSet provides.
-
Pods are interchangeable with no per-pod state — a Deployment is the right (and simpler) choice.
-
That is a Job, not a long-running StatefulSet.
-
That is a CronJob; it needs no stable pod identity or storage.
Follow-ups they push on- Why does a database need stable identity and per-pod storage that a web server doesn't?
- What role does the headless Service play for a StatefulSet?
- When would you avoid a StatefulSet and use a managed service instead?
Red flag Running a stateful, clustered system (a database, Kafka) under a plain Deployment — pods get random identities and can share/lose storage, so a rescheduled pod comes back as a different node with the wrong (or no) data.
source: Kubernetes docs — StatefulSets ↗ -
How do you control which node a pod lands on? Explain taints/tolerations vs node affinity.
Two mechanisms that work from opposite directions. Node affinity (and the simpler
nodeSelector) is a pod-side attraction: the pod says 'schedule me on nodes with labelgpu=true'. It can be hard (requiredDuringScheduling) or soft/preferred.Taints and tolerations are a node-side repulsion: you taint a node (
kubectl taint nodes node1 gpu=true:NoSchedule) so it repels all pods by default, and only pods that carry a matching toleration are allowed on. So a taint reserves a node; a toleration is a pod's permission slip to land on a tainted node.The key distinction: affinity *attracts* a pod toward nodes; a taint *repels* pods away from a node unless they tolerate it — and a toleration alone does not *force* a pod onto that node (you pair it with affinity for that). Use taints to dedicate expensive/special nodes (GPU, spot) and affinity to steer pods toward the right hardware; add pod anti-affinity to spread replicas across nodes/zones for HA.
What a strong answer coversNode affinity / nodeSelector: pod-side *attraction* toward nodes with matching labels.
Taints: node-side *repulsion* — a tainted node rejects pods unless they tolerate the taint.
Tolerations: a pod's permission to schedule onto a tainted node (but doesn't force it there).
Combine: taint dedicates a node (GPU/spot), affinity steers the right pods to it.
Pod anti-affinity spreads replicas across nodes/zones for availability.
Follow-ups they push on- Why doesn't a toleration alone guarantee a pod runs on the tainted node?
- How would you dedicate GPU nodes so only ML workloads land there?
- How does pod anti-affinity improve availability?
Red flag Assuming a toleration *attracts* a pod to a tainted node — a toleration only lets the pod tolerate the taint; to actually steer it there you also need node affinity/nodeSelector.
source: Kubernetes docs — Taints and Tolerations ↗ -
Why do you set both a readiness probe and a preStop hook + terminationGracePeriod for zero-downtime shutdown?
When a pod is deleted (a rolling update, a scale-down), two things happen in parallel, which is the source of the race: Kubernetes sends the container
SIGTERM, and it (asynchronously) removes the pod from Service endpoints. Because endpoint removal propagates through kube-proxy/iptables with a small delay, the load balancer can keep sending new requests to a pod that has already started shutting down — causing dropped connections mid-rollout.The fix is to give that propagation time to win the race. A
preStophook that sleeps a few seconds delays the actual shutdown so in-flight endpoint removal completes before the app stops accepting connections. TheterminationGracePeriodSecondsmust be long enough to cover the preStop sleep plus the app draining in-flight requests after SIGTERM, before Kubernetes escalates to SIGKILL. Readiness probes handle the *startup* side (no traffic until ready); preStop + grace period handle the *shutdown* side.The app must also handle SIGTERM to stop accepting new work and finish in-flight requests — otherwise it gets SIGKILLed and drops connections regardless.
What a strong answer coversOn pod deletion, SIGTERM and endpoint removal happen in parallel — that's the race.
Endpoint removal propagates with a delay, so traffic can still arrive at a terminating pod.
A
preStopsleep delays shutdown until endpoint removal propagates (drains the LB).terminationGracePeriodSecondsmust cover preStop + in-flight drain before SIGKILL.The app must catch SIGTERM and finish in-flight requests, or it gets force-killed.
Follow-ups they push on- Why can a pod still receive traffic after it gets SIGTERM?
- What happens if the grace period is shorter than your preStop + drain time?
- Why isn't a readiness probe alone enough for graceful shutdown?
Red flag Relying on SIGTERM handling alone and skipping the preStop delay — endpoint removal hasn't propagated yet, so the load balancer keeps routing new requests to the dying pod and connections drop mid-rollout.
source: Kubernetes docs — Pod Lifecycle (termination) ↗ -
What is the difference between a liveness probe and a readiness probe? What breaks if you confuse them?
A liveness probe answers 'is this container healthy?' If it fails, the kubelet restarts the container. A readiness probe answers 'can this pod take traffic right now?' If it fails, the pod is pulled out of the Service's endpoints but is NOT restarted.
Use readiness for slow startup or temporary unavailability (warming a cache, waiting on a dependency); use liveness only for unrecoverable hangs.
The classic mistake: pointing a liveness probe at a deep health check that also depends on a database. When the DB hiccups, every pod fails liveness and gets restarted simultaneously — turning a transient blip into a full self-inflicted outage. There is also a startupProbe for slow-booting apps so liveness does not kill them before they finish starting.
Follow-ups they push on- Why should a liveness probe usually NOT check downstream dependencies?
- When would you add a startupProbe?
Red flag Using a liveness probe that depends on a database or downstream service — a transient outage then triggers a restart storm across all pods, amplifying the incident instead of riding it out.
source: Kubernetes docs — Configure Liveness, Readiness and Startup Probes ↗ -
How does a rolling update work in a Deployment, and how do you roll back a bad release?
When you change a Deployment's pod template, the Deployment controller creates a new ReplicaSet and shifts pods gradually: it scales the new ReplicaSet up and the old one down, governed by
maxSurge(how many extra pods above desired during the update) andmaxUnavailable(how many can be missing). With readiness probes in place, traffic only moves to new pods once they report ready, so there is no downtime.Kubernetes keeps the old ReplicaSets around, so rollback is just
kubectl rollout undo deployment/<name>— it scales the previous ReplicaSet back up. You watch progress withkubectl rollout status. TunemaxSurge/maxUnavailableto trade rollout speed against capacity headroom.Follow-ups they push on- What do maxSurge and maxUnavailable control?
- Why does a rolling update need readiness probes to be safe?
- How is a rolling update different from blue-green or canary?
Red flag Rolling out without readiness probes — Kubernetes considers a pod 'available' as soon as the container starts and sends it traffic before the app can actually serve, causing a wave of errors mid-rollout.
source: Kubernetes docs — Performing a Rolling Update ↗ -
A pod is stuck in CrashLoopBackOff. Walk me through how you debug it.
CrashLoopBackOff means the container keeps starting and exiting, and Kubernetes is backing off between restarts. Work the evidence:
kubectl describe pod <pod>— read the Events and the last container state (exit code, OOMKilled, reason).kubectl logs <pod> --previous— the logs from the crashed instance (current logs may be empty because it just restarted).Common causes: the app crashes on startup (bad config / missing env var / unreachable dependency — visible in logs); exit code 137 / OOMKilled means it exceeded its memory limit (raise the limit or fix the leak); a failing liveness probe restarting a healthy-but-slow app (add a startupProbe); or a bad image/command. Fix the root cause rather than just bumping restart limits.
Follow-ups they push on- Why use `kubectl logs --previous` here?
- What does exit code 137 tell you?
Red flag Reading only `kubectl logs <pod>` (which shows the freshly restarted container, often empty) instead of `--previous`, and missing that an OOMKill or a too-aggressive liveness probe is the actual cause.
source: Kubernetes docs — Debug Running Pods ↗ -
What is the difference between resource requests and limits, and how do they affect scheduling and stability?
A request is the amount of CPU/memory a container is guaranteed; the scheduler uses requests to decide which node a pod fits on. A limit is the hard ceiling the container may not exceed.
The behaviors differ by resource. Exceed a memory limit and the container is OOMKilled. Exceed a CPU limit and the container is throttled (slowed), not killed. If you set no requests, the scheduler packs pods blindly and nodes get oversubscribed; if requests are far below real usage, you overcommit and nodes thrash. The senior point is the QoS class: pods with requests == limits are Guaranteed and evicted last under node memory pressure; pods with no requests/limits are BestEffort and evicted first.
Follow-ups they push on- What happens when a container exceeds its CPU limit vs its memory limit?
- How do requests and limits determine a pod's QoS class and eviction order?
Red flag Setting limits without requests (or omitting both) — the scheduler cannot reason about capacity, leading to oversubscribed nodes and BestEffort pods that are the first to be evicted under pressure.
source: Kubernetes docs — Resource Management for Pods and Containers ↗ -
Walk me through what happens, end to end, when you run `kubectl apply -f deployment.yaml`.
kubectlsends the manifest to the API server, which authenticates, authorizes (RBAC), runs admission controllers, and persists the desired state to etcd. Nothing is running yet — you have only recorded intent.Controllers then reconcile. The Deployment controller sees a new Deployment and creates a ReplicaSet; the ReplicaSet controller creates Pod objects to reach the desired replica count. The scheduler watches for unscheduled pods and binds each to a suitable node based on requests, affinity, and taints. On each chosen node, the kubelet sees a pod assigned to it, pulls the image, and starts the container via the container runtime, reporting status back to the API server.
The whole system is a declarative control loop: you state the desired state, and independent controllers continuously drive the actual state toward it.
Follow-ups they push on- Which component decides which node a pod runs on?
- Why is this described as a reconciliation/control loop rather than imperative execution?
Red flag Describing it as imperative ('kubectl starts the container') — kubectl only records desired state; controllers and the kubelet asynchronously reconcile reality toward it.
source: Kubernetes docs — Kubernetes Components ↗ -
How does the Horizontal Pod Autoscaler work, and why does it need resource requests set?
The HPA is a control loop (default every 15s) that scales a Deployment's replica count up or down to keep an observed metric near a target. The classic case: target 50% average CPU. It reads current per-pod usage from the metrics server and applies roughly
desiredReplicas = ceil(currentReplicas × currentMetric / targetMetric).The catch interviewers probe: CPU/memory targets are expressed as a percentage of the pod's resource request. If you set no CPU request, there is no denominator, so the HPA cannot compute utilization and will not scale on CPU. So requests are a prerequisite, not optional.
Discuss the rest: HPA changes *replica count* (horizontal), distinct from the Vertical Pod Autoscaler which resizes a pod; it can scale on custom/external metrics (queue depth, RPS) not just CPU; and you add a stabilization window to prevent flapping (rapid scale up/down thrash) on noisy metrics.
What a strong answer coversHPA control loop adjusts replica count to keep a metric near target:
ceil(replicas × current/target).CPU/memory targets are a percentage of the pod's request — no request means no denominator, no scaling.
Horizontal (more pods) vs Vertical Pod Autoscaler (bigger pods) — different tools.
Can scale on custom/external metrics (queue depth, RPS), not just CPU.
A stabilization window prevents flapping on noisy/bursty metrics.
Follow-ups they push on- Why does an HPA on CPU silently do nothing if you forgot to set CPU requests?
- When would you scale on a custom metric like queue length instead of CPU?
- How is HPA different from the cluster autoscaler?
Red flag Configuring an HPA on CPU but omitting CPU resource requests — utilization is computed relative to the request, so with no request the HPA has nothing to divide by and never scales.
source: Kubernetes docs — Horizontal Pod Autoscaling ↗
6.2.3 CI/CD 12
-
What is the difference between continuous integration, continuous delivery, and continuous deployment?
Continuous integration (CI): developers merge to a shared branch frequently, and every push automatically builds and runs the test suite, so integration problems surface in minutes, not at a big-bang merge.
Continuous delivery (CD): every change that passes CI is automatically built into a deployable, release-ready artifact and pushed through environments up to a staging gate — but the final push to production is a manual button.
Continuous deployment: the same pipeline, with the manual gate removed — every change that passes all automated checks goes straight to production, no human in the loop. The distinction people get wrong is delivery (human approves the prod release) vs deployment (fully automated to prod).
Follow-ups they push on- Where exactly is the manual gate in delivery vs deployment?
- What must be true about your test suite to safely do continuous deployment?
Red flag Using 'continuous delivery' and 'continuous deployment' interchangeably — the difference is whether a human approves the production release.
source: GitHub docs — About continuous integration ↗ -
Write a basic GitHub Actions workflow that runs tests on every pull request. Explain the trigger, jobs, and steps.
A workflow is YAML in
.github/workflows/. The top-levelonsets the trigger,jobsare units that run on a runner, and each job hassteps.name: CIon:pull_request:branches: [main]jobs:test:runs-on: ubuntu-lateststeps:- uses: actions/checkout@v4- uses: actions/setup-node@v4with:node-version: 20- run: npm ci- run: npm teston: pull_requesttriggers on every PR tomain; the singletestjob runs on a fresh Ubuntu runner; steps check out the code, set up Node, install deps deterministically withnpm ci, and run the suite. Jobs run in parallel by default;needs:makes one wait on another.Follow-ups they push on- How do you make a deploy job run only after the test job passes?
- Why `npm ci` instead of `npm install` in CI?
Red flag Forgetting `actions/checkout` (the runner starts empty, so the build has no source), or using `npm install` instead of `npm ci` so the lockfile is not respected and builds become non-reproducible.
source: GitHub docs — Writing workflows / quickstart ↗ -
Explain the typical stages of a CI/CD pipeline: build, test, deploy. What runs where?
Build: compile/transpile, install dependencies, and produce a versioned, immutable artifact (a binary, a bundle, or — most commonly — a container image) pushed to a registry. The key principle is build once and promote that same artifact through every environment.
Test: run fast unit tests first (fail early), then integration tests, then optionally end-to-end tests, plus quality and security scans (lint, SAST, dependency/vulnerability scan). Order from cheapest/fastest to slowest so the pipeline fails fast.
Deploy: ship the already-built artifact to staging, run smoke tests, then promote to production with a rollout strategy (rolling/blue-green/canary) and health checks that can trigger automatic rollback. Building a fresh artifact per environment is the anti-pattern — you would no longer be testing what you ship.
Follow-ups they push on- Why build the artifact once and promote it rather than rebuilding per environment?
- Why run unit tests before integration and e2e tests?
Red flag Rebuilding the artifact separately for staging and production — you then deploy something you never actually tested, defeating the point of the pipeline.
source: GitHub docs — About continuous deployment ↗ -
Why and how do you cache dependencies in CI? What's the difference between caching and an artifact?
CI runners start clean every run, so without caching you re-download every dependency on each build — slow and wasteful. A dependency cache restores files like
node_modules/~/.npmkeyed on a hash of the lockfile (package-lock.json): a cache *hit* restores them in seconds; a cache *miss* (lockfile changed) rebuilds and saves a fresh cache. In GitHub Actions thesetup-*actions can do this with onecache:line, or you useactions/cachedirectly.The distinction interviewers want: a cache is a build-time optimization — it is keyed, can be evicted, and you must never *depend* on it existing (a miss must still produce a correct build). An artifact is an *output* you deliberately persist — the built binary/image/test report you pass between jobs or download later. Cache = speed, may vanish; artifact = a result you must keep.
Key the cache carefully: too broad and you serve stale deps; too narrow and you never hit it. Hashing the lockfile is the sweet spot.
What a strong answer coversRunners are ephemeral; caching avoids re-downloading deps every run.
Key the cache on a lockfile hash — hit restores fast, miss rebuilds and re-saves.
Cache = build-time speedup, evictable, must never be *required* for correctness.
Artifact = a deliberate output you persist (binary/image/report) and pass between jobs.
Bad cache keys cause stale dependencies (too broad) or constant misses (too narrow).
Quick self-checkWhat is the right cache key for a Node project's `node_modules` cache?
-
Correct — the cache stays valid exactly as long as the locked dependency set is unchanged.
-
Branch name doesn't change when deps change, so you'd serve stale modules after a lockfile update.
-
Every commit gets a new key, so you almost never hit the cache — defeating the purpose.
-
Never invalidates, so it serves stale dependencies forever after the first save.
Follow-ups they push on- Why must your build still succeed on a cache miss?
- What goes wrong if your cache key is the branch name instead of the lockfile hash?
- When would you use an artifact instead of a cache?
Red flag Treating a cache like an artifact and depending on it being present, or keying it too loosely so a stale `node_modules` is restored after the lockfile changed — leading to 'works in CI but with old deps' bugs.
source: GitHub docs — Caching dependencies to speed up workflows ↗ -
How do you run the same CI job across multiple language versions or OSes efficiently?
Use a build matrix. Instead of copy-pasting a near-identical job per Node version or OS, you declare a matrix and CI fans out one job per combination automatically, running them in parallel. In GitHub Actions:
strategy:matrix:node: [18, 20, 22]os: [ubuntu-latest, windows-latest]That single job definition expands to 6 parallel jobs (3 versions × 2 OSes), each on its own runner. You can
include/excludespecific combinations and setfail-fast(cancel the rest on first failure) on or off depending on whether you want full results.The value is coverage without duplication: test the support matrix you promise users, catch a version-specific break early, and keep the workflow DRY. The tradeoff is runner minutes — a wide matrix multiplies cost, so test the combinations that matter, not every permutation.
What a strong answer coversA matrix fans one job definition out into one parallel job per combination.
matrix: { node: [...], os: [...] }expands to the cross-product, each on its own runner.include/excludetune specific combos;fail-fastcontrols cancel-on-first-failure.Gives coverage of your support matrix without duplicating job YAML.
Cost grows with the cross-product — test combinations that matter, not every permutation.
Follow-ups they push on- What does `fail-fast: false` change about a matrix run?
- How would you exclude one specific version/OS combination?
- What's the cost tradeoff of a very wide matrix?
Red flag Duplicating an entire job per version/OS instead of using a matrix — it's verbose, drifts out of sync, and you forget to update one copy; the matrix keeps all combinations defined in one place.
source: GitHub docs — Running variations of jobs in a workflow (matrix) ↗ -
Why is a fast CI feedback loop so important, and how do you keep a pipeline fast as it grows?
The whole point of CI is fast feedback on whether a change is safe. A pipeline that takes 40 minutes breaks the developer's flow — they context-switch, stack up un-merged PRs, and start ignoring or working around the signal. Speed is what keeps CI trustworthy and keeps people integrating frequently.
Keep it fast as it grows: parallelize (split the test suite across runners / use a matrix), fail fast by ordering cheap checks first (lint and unit tests before slow e2e), cache dependencies and build outputs, and only run what changed for large monorepos (path filters / affected-project detection). Build the artifact once and promote it rather than rebuilding per stage.
The senior framing: treat pipeline duration as a product metric you budget and watch — when a stage gets slow, profile it like you would slow code. A flaky or slow pipeline is a tax on every single merge.
What a strong answer coversCI exists for fast feedback; a slow pipeline breaks flow and erodes trust in the signal.
Parallelize test suites and use matrices to spread work across runners.
Fail fast: cheap checks (lint, unit) before slow ones (integration, e2e).
Cache deps/build outputs and only run what changed in big monorepos.
Treat pipeline duration as a tracked metric — profile a slow stage like slow code.
Follow-ups they push on- Why does ordering fast tests before slow ones matter even at the same total cost?
- How does 'only test what changed' work in a monorepo?
- What's the cost of letting a pipeline creep to 40 minutes?
Red flag Letting pipeline time creep unbounded — once feedback takes tens of minutes, developers batch changes and stop trusting CI, which defeats the purpose of continuous integration entirely.
source: GitHub docs — About continuous integration ↗ -
A deploy to production succeeds but the app is broken; rolling back code didn't fix it. How do you reason about the failure and prevent it?
First separate the layers: a 'green' deploy only means the *pipeline* succeeded, not that the app *works*. If rolling back the code didn't fix it, the breakage is almost certainly not in the code artifact — look at the things that aren't versioned with the image: a database migration that already ran (and is irreversible), a changed config/feature flag, a new infra/secret value, or a dependency/external service.
The migration case is the classic trap: code rolls back instantly, but a schema change (dropped column, altered type) does not, so old code now hits an incompatible schema. The discipline is backward-compatible, expand-then-contract migrations — deploy schema changes that both old and new code can run against, ship code, then remove the old shape in a later release — so rolling back code is always safe.
Prevention: add post-deploy smoke tests/health checks that gate the rollout (so a broken deploy auto-rolls-back before users see it), decouple migrations from code deploys, use feature flags to separate 'deployed' from 'released', and ensure rollbacks are actually tested, not assumed.
What a strong answer coversA green pipeline ≠ a working app — 'success' is about the deploy, not behavior.
If code rollback didn't help, the cause is unversioned state: migrations, config, flags, secrets, deps.
Irreversible DB migrations are the classic trap — code reverts, schema doesn't.
Fix with expand-then-contract backward-compatible migrations so rollback is always safe.
Prevent with post-deploy smoke tests that gate/auto-rollback, plus feature flags to separate deploy from release.
Follow-ups they push on- Why doesn't rolling back code fix a forward database migration?
- What does an expand-then-contract migration look like in practice?
- How do feature flags let you separate 'deployed' from 'released'?
Red flag Assuming a code rollback always restores a known-good state — irreversible schema migrations and out-of-band config changes aren't part of the artifact, so the rollback leaves old code running against a changed world.
source: GitHub docs — About continuous deployment ↗ -
Why is trunk-based development paired with feature flags so common in CI/CD, and what problem does it solve over long-lived branches?
Long-lived feature branches drift away from
mainfor days or weeks, so when they finally merge you get merge hell — big, painful, conflict-ridden integrations exactly when you can least afford surprises. That defeats the 'continuous' in continuous integration, whose whole premise is integrating *frequently* so problems surface in small, cheap increments.Trunk-based development has everyone commit small changes to
main(or very short-lived branches merged within a day), keeping the branch always releasable. The obvious tension: how do you merge unfinished work without shipping it? Feature flags — you merge the code behind an off-by-default flag, so it's integrated and tested continuously but invisible to users until you flip it on. This also decouples deploy from release: deploying code and exposing a feature become separate decisions, enabling canary/gradual rollouts and instant kill-switches.Senior framing: small frequent merges + flags keep integration cheap and continuous and make release a runtime toggle rather than a deployment event — at the cost of flag hygiene (you must clean up stale flags).
What a strong answer coversLong-lived branches drift from
main→ painful big-bang merges that defeat continuous integration.Trunk-based: small frequent commits to
main, kept always releasable.Feature flags let you merge unfinished work off-by-default — integrated and tested, not yet exposed.
Flags decouple deploy from release: shipping code and turning a feature on are separate decisions.
Enables canary/gradual rollout + instant kill-switch; cost is flag hygiene (remove stale flags).
Follow-ups they push on- How do feature flags let you merge incomplete work to main safely?
- What does 'decoupling deploy from release' buy you operationally?
- What's the maintenance cost of feature flags over time?
Red flag Sitting on a long-lived branch 'until the feature is done' — it diverges from main and turns into a high-risk merge; the CI premise is to integrate small changes continuously, using flags to hide the unfinished parts.
source: GitHub docs — About continuous integration ↗ -
How do you handle secrets (API keys, deploy credentials) in a CI/CD pipeline?
Never hardcode secrets in source, the workflow file, or build logs. Inject them at runtime from a secret store: GitHub Actions encrypted secrets / environments, or an external manager like HashiCorp Vault, AWS Secrets Manager, or a cloud key vault. The CI system makes them available as masked env vars so they do not print in logs.
Stronger still: prefer short-lived, scoped credentials over long-lived static keys — for cloud deploys, use OIDC so the workflow exchanges its identity token for temporary cloud credentials, eliminating stored long-lived keys entirely. Scope secrets to the environment that needs them and gate production secrets behind required reviewers. And remember a secret echoed into a log or committed to git is compromised forever — rotate it.
Follow-ups they push on- Why is OIDC-based short-lived credential exchange better than a stored static cloud key?
- What do you do the moment a secret leaks into a build log?
Red flag Putting credentials in the repo or in plain workflow env, or echoing a secret in a debug step — once it lands in git history or a log it must be treated as permanently compromised and rotated.
source: GitHub docs — Using secrets in GitHub Actions ↗ -
Compare blue-green and canary deployment strategies. When would you choose each?
Blue-green runs two full environments: blue (current) serves all traffic while green (new) is deployed and verified, then you flip traffic to green at once. Rollback is instant — flip back to blue. Cost: you run double the infrastructure during the cutover, and a bad release hits 100% of users the moment you switch.
Canary releases the new version to a small slice of traffic (say 5%), watches error rates and latency, then gradually ramps to 100%. It limits blast radius and catches problems with real traffic before everyone is exposed, but it is more complex (traffic splitting, automated metric analysis) and the rollout is slower.
Pick blue-green when you want a clean, instant, all-or-nothing switch and can afford duplicate capacity; pick canary when blast-radius control matters and you have the observability to judge a partial rollout.
Follow-ups they push on- What does each strategy give you for rollback?
- What observability do you need to run a canary safely?
Red flag Calling a deployment a 'canary' when there is no automated metric analysis gating the ramp — without watching error/latency on the small slice, you have just slowed down a full rollout, not limited blast radius.
source: AWS — Blue/Green vs Canary deployment strategies ↗ -
Your CI build passes locally but fails intermittently in the pipeline. How do you approach a flaky build?
Flakiness almost always comes from hidden non-determinism. Hunt the usual sources: tests that depend on execution order or shared mutable state; reliance on real time/timezone, random seeds, or wall-clock sleeps instead of waiting on a condition; tests hitting real networks/external services; and concurrency races. The 'works locally' clue points at environment differences — different dependency versions, missing lockfile pinning, or fewer CPUs on the runner exposing a race.
Approach: make it reproducible (run the suite repeatedly, randomize order, run in a clean container matching CI), then isolate the offending test and fix the root cause. Pin dependencies with a lockfile and
npm ci, mock external calls, and replace sleeps with explicit waits. Blanket auto-retry hides flakes and erodes trust in the suite — fix, do not paper over.Follow-ups they push on- Why does 'passes locally' point you toward environment/ordering differences?
- Why is blindly retrying failed tests a bad long-term fix?
Red flag Slapping an automatic retry on the whole suite so red turns green — the underlying race or shared-state bug stays, and the team stops trusting CI failures.
source: GitHub docs — Continuous integration concepts ↗ -
What is a deployment gate / required approval, and where do manual gates belong in a pipeline?
A gate is a condition that must pass before a stage proceeds — automated (tests green, security scan clean, smoke checks pass) or manual (a required human approval). In GitHub Actions you implement this with environments that have required reviewers and optionally a wait timer or branch restrictions; a job targeting that environment pauses until approved.
Where gates belong: automated quality gates everywhere (fail fast on tests/lint/scans), and a manual approval only at the boundary you actually want a human to own — typically the promotion to production. That manual prod gate is exactly the line between continuous *delivery* (human approves prod) and continuous *deployment* (no gate). You also gate to protect the production *secrets/credentials*, which are scoped to that environment and unlocked only after approval.
The senior framing: minimize manual gates (they create bottlenecks and false confidence) and lean on strong automated checks; reserve human approval for genuinely high-risk, irreversible promotions.
What a strong answer coversA gate blocks a stage until a condition passes — automated (tests/scans) or manual (approval).
GitHub Actions: environments with required reviewers / wait timer pause a job until approved.
Put automated gates everywhere (fail fast); reserve manual approval for the prod promotion.
That manual prod gate is the line between continuous delivery and continuous deployment.
Environment gates also protect prod secrets, unlocked only after the gate passes.
Follow-ups they push on- How does a required-reviewer environment gate relate to delivery vs deployment?
- Why can too many manual gates be worse than fewer, stronger automated ones?
- How does gating an environment also protect production credentials?
Red flag Gating every stage with manual approvals 'to be safe' — it creates bottlenecks and rubber-stamp approvals; strong automated gates plus a single human gate at prod promotion is the better pattern.
source: GitHub docs — Using environments for deployment ↗
6.2.4 Infrastructure as Code (Terraform) 13
-
What is the difference between `terraform plan` and `terraform apply`?
planis a dry run: Terraform refreshes state, compares your desired configuration against the current state, and prints the exact set of actions it would take — what gets created, updated in place, replaced (destroy+create), or destroyed — without changing anything. It is your review-before-you-touch-prod safety check, and you can save it to a file.applyexecutes those changes against the real providers and then writes the new state. If you pass a saved plan file, apply runs exactly that plan with no surprises; without one, apply shows the plan again and asks for confirmation. The senior habit is to always read the plan output (especially anything marked for replacement/destruction) before approving an apply.Follow-ups they push on- What does it mean when a plan shows a resource will be replaced rather than updated in place?
- Why apply a saved plan file in automation?
Red flag Running `apply -auto-approve` in CI without reviewing the plan — you can silently destroy and recreate a stateful resource (like a database) that a config change forced to be replaced.
source: Terraform docs — terraform plan / apply ↗ -
What is the difference between a Terraform provider and a resource?
A provider is a plugin that teaches Terraform how to talk to a specific platform's API —
aws,google,azurerm,cloudflare,kubernetes. You configure it once (region, credentials), and it exposes the set of resource and data-source types for that platform.A resource is a single managed object you declare —
resource "aws_s3_bucket" "assets" { ... }describes one bucket. The provider knows how to create, read, update, and delete that resource type via the platform's API. So: the provider is the integration layer; resources are the things you actually provision through it. A data source is the read-only sibling — it looks up existing infrastructure without managing it.Follow-ups they push on- How is a data source different from a resource?
- Can one Terraform config use multiple providers at once?
Red flag Confusing a resource with a data source — a resource is created and managed by Terraform; a data source only reads existing infrastructure and never creates anything.
source: Terraform docs — Providers ↗ -
What is the Terraform state file, and why does it matter so much?
State is Terraform's record (
terraform.tfstate, JSON) mapping each resource in your config to the real-world object it created — IDs, attributes, and metadata. Terraform needs it to know what it already manages, so on the nextplanit can diff your desired config against reality and compute the minimal set of changes.Without state, Terraform could not tell the difference between 'create a new resource' and 'this resource already exists, just update it', and it would have no way to know what to destroy. State also caches attribute values and tracks dependencies. Because it can contain sensitive values (passwords, keys) in plaintext, it must be protected — which leads straight into remote state.
Follow-ups they push on- Why can't Terraform just query the cloud provider instead of keeping state?
- Why is committing tfstate to a git repo dangerous?
Red flag Treating state as a disposable cache or committing it to git — it can hold secrets in plaintext, and a lost/corrupt state file orphans real infrastructure that Terraform no longer recognizes.
source: Terraform docs — State ↗ -
What are input variables, outputs, and locals in Terraform, and how do they differ?
They're the three ways data flows through a config. Input variables (
variable) are the parameters a module accepts from its caller — the public 'function arguments' (region, instance size), set via.tfvars, CLI flags, or env vars, and typed/validated. Outputs (output) are the values a module exposes back to its caller or the CLI — the 'return values' (a created VPC's ID, a load balancer's DNS name) that other modules consume. Locals (locals) are named intermediate expressions used *inside* a config to avoid repetition — computed once, referenced aslocal.name, never settable from outside.The mental model: variables are inputs (caller → module), outputs are results (module → caller), locals are private helpers (internal only). This is exactly what makes a module a clean interface: callers only touch its variables and outputs, never its internals.
A practical note: mark sensitive variables/outputs
sensitive = trueso Terraform redacts them in plan/apply logs.What a strong answer coversVariables: a module's input parameters (caller → module), typed and validatable.
Outputs: values a module returns (module → caller / CLI), consumed by other modules.
Locals: private named expressions, computed once, used internally to avoid repetition.
Together, variables + outputs form a module's clean public interface; locals stay internal.
Use
sensitive = trueto redact secret variables/outputs from logs.
Follow-ups they push on- Why can't a local be set from outside the module?
- How does one module consume another module's output?
- When would you mark a variable or output `sensitive`?
Red flag Confusing locals with variables — a local is a computed internal helper that callers can't override, while a variable is the external input; using a local where you needed a configurable input makes the module non-parameterizable.
source: Terraform docs — Variables and outputs ↗ -
How does Terraform decide the order to create resources? What are implicit vs explicit dependencies?
Terraform builds a dependency graph from your config and creates/updates/destroys resources in the order that graph implies, parallelizing wherever there's no dependency between resources. You rarely specify order yourself.
Implicit dependencies are inferred from references: if a security group rule uses
aws_vpc.main.id, Terraform knows the VPC must exist first, because the rule reads an attribute of the VPC. This is the idiomatic, preferred way — wire resources together by referencing each other's attributes and the ordering falls out automatically (and correctly, including on destroy, which runs in reverse).Explicit dependencies use
depends_onto force an ordering Terraform can't infer — typically when there's a *hidden* relationship not expressed through a reference (e.g. an app needs an IAM policy attached before it runs, but doesn't reference the attachment's attributes). Usedepends_onsparingly; over-using it usually means you should have referenced the attribute instead.What a strong answer coversTerraform builds a dependency graph and parallelizes independent resources automatically.
Implicit deps: inferred from attribute references (
aws_vpc.main.id) — the idiomatic way.Referencing attributes gets ordering right for create *and* destroy (reverse order) for free.
Explicit deps (
depends_on): force an order for a hidden relationship not expressed by a reference.Use
depends_onsparingly — usually a missing attribute reference is the real fix.
Follow-ups they push on- Why is an implicit dependency via attribute reference preferred over `depends_on`?
- Give an example where `depends_on` is genuinely necessary.
- How does the graph handle destroy ordering?
Red flag Sprinkling `depends_on` everywhere to 'be safe' — it serializes resources that could run in parallel and hides the real relationships; reference the attribute you depend on and let Terraform infer the order.
source: Terraform docs — Resource dependencies ↗ -
What are Terraform modules and why do you use them?
A module is a reusable, parameterized bundle of Terraform resources — a directory with input variables, resources, and outputs. Instead of copy-pasting the same 200 lines to stand up a VPC or a service in dev, staging, and prod, you write it once as a module and call it three times with different inputs.
The payoff is DRY infrastructure, consistency (every environment provisions the same way), and an interface boundary: callers only deal with the module's variables and outputs, not its internals. Every Terraform config has an implicit root module; you compose it from child modules (your own, or versioned modules from the registry). The trap is over-abstracting too early — wrap something in a module once you actually have repetition, not speculatively.
Follow-ups they push on- How do you pass data in and out of a module?
- How do you pin a module to a specific version and why?
Red flag Over-modularizing on day one — wrapping a single-use resource in a deeply nested module hierarchy adds indirection without the reuse that justifies it.
source: Terraform docs — Modules ↗ -
Why is Infrastructure as Code better than clicking through a cloud console, and what is the difference between declarative and imperative IaC?
IaC makes infrastructure versioned, reviewable, and reproducible. Config lives in git, so changes go through pull requests and code review, you have an audit trail, you can roll back, and you can stand up an identical environment on demand instead of relying on someone remembering which buttons they clicked. It eliminates configuration drift and snowflake servers.
Declarative vs imperative: declarative (Terraform) means you describe the desired end state and the tool figures out the steps and the diff to get there — apply it twice and nothing extra happens (idempotent). Imperative (a shell/SDK script) means you spell out the steps to take, and re-running can double-create or fail because it does not reason about current state. Terraform is declarative, which is why
plancan show you precisely what will change before anything happens.Follow-ups they push on- Why does declarative IaC give you idempotency for free?
- How does putting infra in git change your change-management process?
Red flag Describing Terraform as a script that 'runs commands to build infra' — that is the imperative mental model; Terraform reconciles toward a declared end state and is idempotent.
source: Terraform docs — What is Terraform / intro ↗ -
What is the difference between `count` and `for_each` for creating multiple resources, and why does it matter for state?
Both create multiple instances of a resource, but they key the instances differently in state, and that's the whole game.
countproduces a list indexed by integer position —resource[0],resource[1].for_eachproduces a map keyed by a stable string —resource["web"],resource["db"].The trap with
count: because instances are positional, removing an item from the middle of the list shifts every later index, so Terraform thinks those resources changed identity and proposes to destroy-and-recreate them. Withfor_each, each instance is bound to its own key, so deleting one only affects that one — the rest stay put.Guidance: use
countfor N identical, order-independent copies (or a simple on/off toggle,count = var.enabled ? 1 : 0); usefor_eachwhenever you iterate over a set/map of distinct things (named buckets, subnets per AZ) so that adding or removing one doesn't churn the others.What a strong answer coverscount→ list indexed by integer position;for_each→ map keyed by a stable string.Removing a middle
countelement shifts later indices, forcing destroy/recreate of unrelated resources.for_eachbinds each instance to its key, so add/remove touches only that instance.Use
countfor N identical copies or an on/off toggle (count = enabled ? 1 : 0).Use
for_eachfor a set/map of distinct named things (buckets, subnets per AZ).
Quick self-checkYou manage 5 distinct named S3 buckets and sometimes remove one from the middle. Which is safer?
-
Correct — each bucket is bound to its key, so removing one doesn't disturb the others' state addresses.
-
Removing a middle element shifts later indices, causing Terraform to recreate unrelated buckets.
-
Works but is not DRY and defeats the purpose of iterating; for_each is the idiomatic safe choice.
-
Same positional-index problem as any count, plus you can't give them distinct names cleanly.
Follow-ups they push on- Why does deleting the first of three `count` resources recreate the other two?
- When is `count` still the right choice over `for_each`?
- How do you reference a specific instance under each approach?
Red flag Using `count` over a list of distinct named resources — removing or reordering an element shifts every later index, so Terraform destroys and recreates resources you never intended to touch; `for_each` keyed by name avoids the churn.
source: Terraform docs — The for_each meta-argument ↗ -
Why is `terraform destroy` (or an accidental resource replacement) so dangerous, and how do you guard against it?
Terraform faithfully executes the declared end state — including deletion. The danger is that a config change can force a replace (destroy + create) of a resource you assumed would update in place: changing an attribute marked 'ForceNew' (an EC2 instance's AMI, a database's engine, a subnet) makes Terraform plan to destroy the old object and create a new one. On a stateful resource like a production database, that's data loss executed by a routine-looking apply.
Guards, layered: (1) read the plan — anything showing
-/+ destroy and then createor# forces replacementis a red flag, never-auto-approveblindly. (2) Addlifecycle { prevent_destroy = true }on critical resources so Terraform errors out rather than destroying them. (3) Usecreate_before_destroywhere a replacement is acceptable but downtime isn't. (4) Take backups / enable deletion protection on the cloud side as a last line. (5) For stateful data stores, often manage them outside the same Terraform lifecycle as ephemeral compute.The trick being tested: knowing that 'update' can silently mean 'replace', and that the plan output is your safety check.
What a strong answer coversA config change to a ForceNew attribute makes Terraform destroy + recreate — potential data loss.
The plan shows it as
-/+/# forces replacement— that's your red flag to stop.lifecycle { prevent_destroy = true }makes Terraform refuse to destroy critical resources.create_before_destroyavoids downtime when a replace is genuinely acceptable.Layer cloud-side deletion protection / backups; manage stateful stores apart from ephemeral compute.
Follow-ups they push on- How do you tell from a plan that a resource will be replaced rather than updated in place?
- What does `prevent_destroy` actually do when a destroy is attempted?
- Why separate a production database's lifecycle from your app's Terraform?
Red flag Approving a plan without noticing a `# forces replacement` on a stateful resource — Terraform will dutifully destroy the production database and create a fresh empty one, and `apply` doesn't ask 'are you sure this is a DB?'.
source: Terraform docs — The lifecycle meta-argument ↗ -
What is remote state and state locking, and what problem do they solve on a team?
Local state lives on one engineer's laptop — useless for a team and easy to lose. Remote state stores the state file in a shared backend (S3, Azure Blob, GCS, Terraform Cloud) so everyone reads and writes the same source of truth, and sensitive state is not scattered across machines.
State locking prevents two people from running
applyagainst the same state at the same time. Backends acquire a lock (e.g. S3 with a DynamoDB lock table, or native locking in Terraform Cloud) for the duration of the operation; a second concurrent apply is blocked until the lock releases. Without locking, two simultaneous applies interleave writes and corrupt the state file, leaving Terraform's view inconsistent with reality.Follow-ups they push on- What corrupts the state if two engineers apply at the same time without a lock?
- How do you implement locking with an S3 backend?
Red flag Using a shared remote backend without locking — concurrent applies race on the state file and corrupt it, after which plans no longer match reality.
source: Terraform docs — Backends and remote state ↗ -
What is configuration drift, and how do you detect and reconcile it in Terraform?
Drift is when the real infrastructure no longer matches what Terraform's state/config says — typically because someone made a change by hand in the cloud console ('ClickOps') outside Terraform.
Detection:
terraform planrefreshes state against the provider and shows the divergence as changes it wants to make; aplanthat proposes changes you did not author is drift. Reconcile in one of two directions: bring the real resource back in line by re-applying your config, or, if the manual change is desirable, update the Terraform config to match (and apply). For resources created outside Terraform,terraform importbrings them under management.The durable fix is process: make Terraform the single source of truth, restrict console write access, and run plan in CI on a schedule to catch drift early.
Follow-ups they push on- How does a scheduled `plan` in CI help you catch drift?
- When would you update the config to match reality instead of reverting reality?
Red flag Letting people make changes in the cloud console alongside Terraform — the next apply silently reverts their manual fix (or vice versa), and the two views of reality keep fighting.
source: Terraform docs — Manage resource drift ↗ -
How do you bring an existing, manually-created cloud resource under Terraform management?
You import it — Terraform's state knows nothing about resources it didn't create, so you have to tell it. The two-part move: (1) write a matching
resourceblock in your config for the existing object, then (2) bring it into state, either with the CLIterraform import <resource_address> <real_id>or, in modern Terraform, animportblock that does it as part ofplan/apply(and can even generate config).The critical detail interviewers probe: importing only updates state, it does not write your configuration. If your hand-written resource block doesn't match the real object's settings, the very next
planwill propose changes to 'fix' the real resource back to your (incomplete) config. So after importing you runplanand iterate on the config until the plan is clean (no changes) — that confirms config, state, and reality all agree.This is also how you remediate drift / ClickOps: adopt the orphaned resource instead of destroying and recreating it.
What a strong answer coversTerraform ignores anything it didn't create — you must import existing resources into state.
Two steps: write a matching
resourceblock, thenterraform import(or animport {}block).Import updates state only — it does not generate or fix your config.
Iterate until
planshows no changes, proving config + state + reality agree.It's the safe way to adopt ClickOps/orphaned resources without destroy-and-recreate.
Quick self-checkAfter `terraform import` of an existing bucket, the next `plan` wants to modify it. Why?
-
Correct — import never writes config, so any mismatch shows up as a proposed change until you align the block.
-
Import does neither — it just records the existing resource in state.
-
It can, precisely via import; this is incorrect.
-
A normal import doesn't corrupt state; the diff comes from a config/real-world mismatch.
Follow-ups they push on- Why does a fresh import often produce a plan that wants to change the resource?
- What's the difference between the CLI `import` command and an `import` block?
- How does import help you fix drift without recreating infrastructure?
Red flag Running `terraform import` and assuming you're done — import only writes state, not config, so a mismatched resource block makes the next apply try to 'correct' the real resource; you must get a clean plan first.
source: Terraform docs — Import existing resources ↗ -
How do you manage multiple environments (dev / staging / prod) in Terraform, and why are workspaces often the wrong tool?
The common patterns: separate state per environment with a shared module. You write the infrastructure once as a module, then have a thin per-environment root config (
environments/prod,environments/staging) that calls the module with different variables (instance sizes, counts) and, crucially, its own backend/state file. This isolates blast radius — a badapplyin staging can't touch prod's state.Terraform workspaces let one config switch between multiple state files (
default,dev,prod) without copying code. They're tempting for environments but are usually the wrong fit: they share the same backend and code, it's easy to runapplyagainst the wrong workspace by accident (no separate credentials/approval boundary), and they don't capture genuinely different configs well. They're better suited to short-lived, near-identical parallel copies (e.g. per-feature-branch ephemeral envs).Senior answer: isolate prod with its own state, backend, and credentials; use modules for DRY; reserve workspaces for ephemeral, structurally-identical environments.
What a strong answer coversDefault pattern: one shared module + thin per-env root configs with separate state/backends.
Separate state per env isolates blast radius — staging mistakes can't corrupt prod.
Workspaces swap state files on one config/backend — convenient but no real isolation boundary.
Workspace risk: applying to the wrong environment with no separate credentials/approval.
Use workspaces for ephemeral, identical envs; use separate state+backend for dev/staging/prod.
Follow-ups they push on- Why does sharing a backend across environments via workspaces increase risk?
- How do modules keep multi-environment configs DRY?
- When are workspaces genuinely the right tool?
Red flag Using a single workspace-switched config for prod and staging — one fat-fingered `terraform workspace select` and an `apply` hits the wrong environment, with no separate backend or credential boundary to stop it.
source: Terraform docs — Workspaces ↗
6.2.5 Cloud fundamentals 12
-
What is the difference between a region and an availability zone, and how do you use them for high availability?
A region is a geographic area (e.g. us-east-1). Inside each region are multiple availability zones (AZs) — physically separate data centers with independent power, cooling, and networking, connected by high-bandwidth, low-latency links (single-digit ms).
For high availability, spread your workload across multiple AZs in a region: if one AZ loses power, the others keep serving, and a load balancer routes around the failed zone. That protects against a data-center-level failure with negligible latency cost. Going multi-region adds protection against a whole-region outage and lets you serve users closer to them, but it is far more complex (cross-region replication, data consistency, higher latency between regions). The pragmatic default is multi-AZ within one region; reach for multi-region when you genuinely need regional fault tolerance or global low latency.
Follow-ups they push on- Why is multi-AZ the common HA default rather than multi-region?
- What new problems does going multi-region introduce?
Red flag Confusing the two, or running everything in a single AZ and calling it 'in the cloud so it's highly available' — one AZ failure then takes the whole service down.
source: AWS — Regions and Availability Zones ↗ -
Walk me through the core cloud compute, storage, and networking primitives and when you'd reach for each.
Compute: VMs (EC2-style — full control, you manage the OS), containers (ECS/EKS — packaged apps, orchestrated), and serverless functions (Lambda — event-driven, no servers to manage, scales to zero). Move up that ladder as you want less operational overhead and more elasticity.
Storage: object storage (S3 — cheap, durable, infinite-scale blobs: images, backups, static assets), block storage (EBS — a virtual disk attached to one VM, for databases/filesystems), and file storage (EFS/NFS — a shared filesystem across many machines). Match the access pattern: blobs over HTTP -> object; a disk for one instance -> block; shared POSIX filesystem -> file.
Networking: a VPC is your isolated private network; subnets segment it (public vs private); security groups are instance-level firewalls; and a load balancer spreads traffic across instances. The skill is mapping a workload to the cheapest primitive that fits its access and durability needs.
Follow-ups they push on- When would you pick object storage over block storage?
- When does serverless make sense vs a long-running container?
Red flag Reaching for a full VM you have to patch and babysit when a managed/serverless option fits, or using a database on object storage (wrong access pattern) instead of block storage.
source: AWS — Types of cloud computing / core services ↗ -
What is the cloud shared responsibility model, and why does it matter?
Security is split between the provider and you. The provider is responsible for security OF the cloud — the physical data centers, hardware, the hypervisor, and the managed-service infrastructure. You are responsible for security IN the cloud — your data, IAM users and permissions, network config (security groups, public/private subnets), OS patching on VMs you run, and application-level security.
The line shifts with the service tier: with a raw VM you patch the OS; with a managed database the provider patches it but you still own access control and your data; with serverless even more moves to the provider, but IAM and data are always yours. It matters because most cloud breaches are customer-side misconfigurations — a public S3 bucket or an over-permissive IAM policy — not the provider being hacked.
Follow-ups they push on- How does the responsibility line move between a self-managed VM and a managed service?
- Whose fault is a publicly exposed storage bucket under this model?
Red flag Assuming 'the cloud provider handles security' end to end — IAM, data, and network configuration are always the customer's responsibility, and that is where most breaches actually happen.
source: AWS — Shared Responsibility Model ↗ -
What is the difference between vertical and horizontal scaling in the cloud, and which does the cloud make easy?
Vertical scaling (scale up) means giving one instance more resources — a bigger CPU/RAM tier. It is simple and needs no app changes, but you hit a hardware ceiling, usually need a restart/downtime to resize, and the single box is still a single point of failure.
Horizontal scaling (scale out) means adding more instances behind a load balancer. It scales effectively without limit and improves availability (lose one node, the rest serve), which is exactly what cloud auto-scaling groups automate — add instances when load rises, remove them when it falls. The catch is the app must be stateless (or externalize session state to a shared store like Redis) so any instance can handle any request. The cloud's elasticity is built around horizontal scaling; that is why 'make services stateless' is such a load-bearing design rule.
Follow-ups they push on- Why does horizontal scaling require stateless services?
- What does an auto-scaling group buy you over manually resizing an instance?
Red flag Trying to scale a stateful, session-on-the-box service horizontally — requests landing on a different instance lose the session, so you are forced back into sticky sessions or a single big vertical box.
source: AWS — Auto Scaling / scaling concepts ↗ -
What is the difference between authentication and authorization in cloud IAM, and how do roles fit in?
Authentication answers 'who are you?' — proving identity (a user signing in, a service presenting credentials or a token). Authorization answers 'what are you allowed to do?' — evaluating policies to decide whether that proven identity may perform an action on a resource. Authn comes first; authz comes after. They're distinct: a correctly authenticated user can still be denied an action.
In cloud IAM, policies are the authorization rules (allow/deny on actions + resources), attached to identities. An IAM role is an identity with policies but no permanent credentials — instead, a trusted principal (an EC2 instance, a Lambda, another account, a federated user) assumes the role and receives temporary, auto-rotating credentials. That's why roles are the best-practice way to grant permissions to services: no long-lived access keys to leak.
So: authn = identity, authz = permissions (policies), and roles = a way to hand out scoped, temporary permissions to whoever/whatever assumes them.
What a strong answer coversAuthentication = prove who you are; authorization = what you're allowed to do (policies).
Authn happens first; an authenticated identity can still be denied by authorization.
Policies encode authorization (allow/deny on actions + resources).
An IAM role has no permanent credentials — principals assume it for temporary ones.
Roles are best practice for services (EC2/Lambda): no long-lived keys to leak.
Quick self-checkAn EC2 instance needs to read one S3 bucket. The best-practice way to grant this is:
-
Correct — the instance assumes the role and gets temporary, scoped credentials with no static keys to leak.
-
Static keys on the box are exactly what roles exist to avoid — they leak and don't rotate.
-
Violates least privilege — a compromised instance would then have full account access.
-
Exposes the data to the whole internet; a classic catastrophic misconfiguration.
Follow-ups they push on- Why are IAM roles with temporary credentials safer than static access keys for a service?
- Can an authenticated identity ever be denied? Why?
- What does it mean for a principal to 'assume' a role?
Red flag Conflating authentication with authorization — proving identity (authn) does not grant any permission; access is still decided by the policies evaluated at the authorization step.
source: AWS — IAM identities (roles) / how IAM works ↗ -
What is object storage (like S3), and why is it not a filesystem or a database?
Object storage stores data as objects — a blob of bytes plus metadata and a unique key — in a flat namespace (a bucket), accessed over HTTP APIs (
GET/PUT), not a mounted disk. It's built for massive scale, very high durability (S3 famously targets eleven 9s by replicating across devices/AZs), and cheap capacity. Ideal for images, video, backups, logs, static website assets, and data-lake files.Why it's not a filesystem: there are no real directories (the '/' in a key is cosmetic — it's a flat key space), you can't do partial in-place edits efficiently (you generally replace the whole object), and there's no POSIX file locking or low-latency random byte access like a block device. Why it's not a database: no transactions, no rich queries/joins, no secondary indexes — it's a key→blob store, not a query engine.
The skill is matching the access pattern: whole-blob read/write over HTTP, write-once-read-many, durability over mutability → object storage. Mutable structured records you query → a database. A disk for an OS/DB → block storage.
What a strong answer coversObjects = blob + metadata + key in a flat bucket namespace, accessed via HTTP APIs.
Built for scale, extreme durability (S3 ~11 nines), and low cost — images, backups, logs, assets.
Not a filesystem: no real directories, no efficient partial edits, no POSIX locking/random access.
Not a database: no transactions, joins, or queries — it's key→blob.
Match access pattern: whole-blob, write-once-read-many → object storage.
Quick self-checkWhich workload is the BEST fit for object storage like S3?
-
Correct — whole-blob, write-once-read-many, durability-focused: exactly object storage's sweet spot.
-
Needs ACID transactions and queries — that's a database, not object storage.
-
A VM needs low-latency block storage it can mount as a disk, not HTTP object access.
-
That's an in-memory KV store (Redis); object storage isn't a low-latency cache.
Follow-ups they push on- Why is the '/' in an S3 key not a real directory?
- When would block storage be the right choice over object storage?
- What makes object storage so durable?
Red flag Using object storage as a database or a mutable filesystem — there are no transactions/queries and no efficient in-place edits, so a workload needing those will be slow, awkward, or incorrect.
source: AWS — What is object storage? (S3) ↗ -
Compare the IaaS, PaaS, and SaaS service models. Who manages what at each level?
It's a ladder of how much the provider manages vs you. IaaS (raw VMs, networking, storage — EC2) gives you the infrastructure; you still manage the OS, runtime, and app. Most control, most operational burden. PaaS (App Engine, Heroku, managed databases) hands you a platform — you push code and the provider runs the OS, runtime, scaling, and patching; you manage only your app and data. SaaS (Gmail, Salesforce) is finished software you just use; the provider manages essentially everything, you manage only your data and configuration.
The through-line is the shared responsibility line moving up as you go IaaS → PaaS → SaaS: you trade control and flexibility for less operational work. (Serverless/FaaS sits near PaaS — even the runtime instance is abstracted, scaling to zero.)
The senior framing: pick the highest level that still meets your control/customization needs, so you don't waste engineering effort managing layers a provider would handle for free.
What a strong answer coversIaaS (EC2): provider runs hardware/virtualization; you run OS, runtime, app — most control.
PaaS (App Engine, managed DBs): push code; provider runs OS/runtime/scaling/patching.
SaaS (Gmail, Salesforce): finished software; you manage only your data and config.
The responsibility line moves up IaaS → PaaS → SaaS: less control, less ops burden.
Pick the highest level that still meets your control needs to minimize wasted ops effort.
Quick self-checkOn a managed PaaS, which layer are YOU still responsible for?
-
Correct — PaaS runs the OS, runtime, scaling, and patching; your code and data remain yours.
-
That's the provider's job on PaaS — it's a core reason to choose it.
-
The platform handles scaling for you on PaaS.
-
Always the provider's responsibility at every cloud service level.
Follow-ups they push on- Where does serverless / FaaS sit on this ladder?
- What do you give up moving from IaaS to PaaS?
- How does this map onto the shared responsibility model?
Red flag Defaulting to IaaS and hand-managing OS/runtime/scaling when a PaaS would handle it — you pay in engineering time for control you don't actually need.
source: AWS — Types of cloud computing (IaaS/PaaS/SaaS) ↗ -
How do you control and reason about cloud cost? What's the difference between on-demand, reserved, and spot pricing?
Cloud's elasticity cuts both ways: pay-per-use is great until idle or oversized resources quietly bleed money. The compute pricing tiers trade flexibility for cost: on-demand is full price, no commitment — for spiky or unpredictable workloads; reserved instances / savings plans commit to 1–3 years for a big discount — for steady, predictable baseline load; spot uses spare capacity at up to ~90% off but can be reclaimed with little notice — for fault-tolerant, interruptible work (batch jobs, CI, stateless workers that can be killed and rescheduled).
The broader cost levers: right-size (most instances are over-provisioned), auto-scale so you pay for what you use and scale to zero where possible (serverless), watch egress/data-transfer (a sneaky cost), set lifecycle policies to tier cold data to cheaper storage, and tag resources so you can attribute spend. Set budgets and alerts so surprises page you, not finance.
Senior framing: match the pricing model to the workload's tolerance for interruption and predictability — steady baseline on reserved, bursts on on-demand, interruptible bulk on spot.
What a strong answer coversOn-demand: full price, no commitment — spiky/unpredictable workloads.
Reserved / savings plans: 1–3yr commit for big discount — steady baseline load.
Spot: up to ~90% off spare capacity but reclaimable anytime — fault-tolerant, interruptible work.
Levers: right-size, auto-scale/scale-to-zero, watch egress, tier cold data, tag for attribution.
Set budgets + alerts so cost surprises page engineers early.
Follow-ups they push on- What kind of workload is safe to run on spot instances, and what isn't?
- Why is data egress an easy cost to overlook?
- How does auto-scaling change your cost profile vs a fixed fleet?
Red flag Running interruptible bulk work on full-price on-demand (or worse, putting a stateful production service on spot) — the first wastes ~90% of the spend, the second gets reclaimed out from under you with little warning.
source: AWS — EC2 instance purchasing options (on-demand/reserved/spot) ↗ -
What does it mean for an architecture to be 'cloud-native', and why design for failure?
Cloud-native means building for the cloud's actual characteristics rather than lifting a fixed on-prem server into a VM. Core ideas: treat servers as cattle, not pets (instances are disposable and replaceable, not hand-tended); make services stateless so they scale horizontally and any instance can handle any request; externalize state to managed stores; automate provisioning with IaC; and design for failure — assume any instance, AZ, or dependency can die at any moment.
Why design for failure: at cloud scale, hardware *will* fail constantly — it's a statistical certainty, not an edge case. So you build in redundancy (multi-AZ), health checks and auto-replacement (a dead instance is terminated and a new one launched automatically), retries with backoff and circuit breakers for flaky dependencies, and graceful degradation. The famous expression of this is Netflix's Chaos Monkey, which kills production instances on purpose to prove the system survives.
Senior framing: the cloud doesn't give you reliability for free — it gives you the *primitives* (multiple AZs, auto-scaling, managed failover) and you must architect to use them.
What a strong answer coversCloud-native = build for the cloud's traits, not a lifted-and-shifted pet server.
Cattle not pets: instances are disposable, replaced automatically, never hand-tended.
Stateless services + externalized state enable horizontal scaling and easy replacement.
Design for failure: at scale hardware *will* fail — redundancy, health checks, retries, circuit breakers.
The cloud gives primitives (multi-AZ, auto-scale, failover); you must architect to use them.
Follow-ups they push on- What does 'cattle not pets' mean for how you operate servers?
- Why is statelessness a prerequisite for treating instances as disposable?
- What is a circuit breaker protecting you from?
Red flag Lifting an on-prem 'pet' server into a single cloud VM and calling it cloud-native — without statelessness, redundancy, and automated replacement, you've just moved a single point of failure into someone else's data center.
source: AWS — Reliability pillar (Well-Architected Framework) ↗ -
An EC2 instance in a private subnet can't reach the internet to pull package updates. How do you diagnose and fix it?
A private subnet by definition has no route to an internet gateway, so instances there can't make outbound internet calls directly — that's the intended design, not a bug. The fix for *outbound-only* access is a NAT gateway: place it in a public subnet, and add a route in the private subnet's route table sending
0.0.0.0/0to the NAT gateway. The NAT allows egress (and the return traffic for connections it initiated) but blocks unsolicited inbound — so the instance can pull updates while staying unreachable from the internet.Work the diagnosis like a checklist down the path: (1) the private subnet's route table — is there a
0.0.0.0/0 → nat-...route? (2) the NAT gateway itself — is it in a *public* subnet that routes to an internet gateway? (3) security group outbound rules — egress allowed? (4) NACL — does the subnet's stateless ACL allow both the outbound request and the inbound return traffic? (5) DNS resolution working?The senior tell: knowing that a NAT gateway (not an internet gateway) is the correct egress mechanism for private subnets, and checking the stateless NACL return-traffic rule that bites people.
What a strong answer coversPrivate subnet = no internet-gateway route by design; direct outbound fails as intended.
Fix outbound-only access with a NAT gateway in a public subnet + a
0.0.0.0/0 → NATprivate route.NAT allows egress + return traffic but blocks unsolicited inbound — instance stays private.
Diagnose down the path: route table → NAT placement → SG egress → NACL (return traffic!) → DNS.
Stateless NACLs must explicitly allow the inbound return traffic, a common silent culprit.
Follow-ups they push on- Why a NAT gateway rather than an internet gateway for a private-subnet instance?
- Why must the NAT gateway itself live in a public subnet?
- Which stateless rule on a NACL commonly breaks return traffic?
Red flag Attaching an internet gateway route to the private subnet 'to fix it' — that makes the subnet public and the instance internet-reachable, defeating the security design; the correct egress path is a NAT gateway.
source: AWS — NAT gateways ↗ -
Explain the principle of least privilege in cloud IAM, with a concrete example.
Least privilege means every identity (user, role, service) gets exactly the permissions it needs to do its job and nothing more. The smaller the granted permission set, the smaller the blast radius if those credentials leak or the service is compromised.
Concrete example: a Lambda that only reads from one bucket should have a policy granting
s3:GetObjectscoped to that specific bucket's ARN — nots3:*on*. Wildcards likeAction: */Resource: *are the classic violation. In practice: prefer roles with temporary credentials over long-lived access keys, scope policies to specific actions and resource ARNs, start from deny and add only what is needed, and review/trim permissions over time. Pair it with separation of duties so no single role can both deploy and exfiltrate.Follow-ups they push on- Why prefer IAM roles with temporary credentials over static access keys?
- How do you discover and trim over-broad permissions after the fact?
Red flag Granting broad wildcard policies (`s3:*` on `*`) 'to get it working' and never tightening them — one leaked key then has the run of the whole account.
source: AWS — IAM security best practices (least privilege) ↗ -
Why might a company choose managed cloud services over self-hosting, and what are the tradeoffs?
Managed services (RDS instead of running your own Postgres, EKS instead of bootstrapping Kubernetes) shift operational burden to the provider: patching, backups, failover, scaling, and HA come built in, so a small team ships faster and pages less. You trade money and some control for time and reliability.
The tradeoffs: higher direct cost, less control over versions/tuning/internals, and vendor lock-in (managed offerings differ across clouds, raising switching cost). Self-hosting gives maximum control and can be cheaper at very large, steady scale, but you now own the on-call, the upgrades, and the failure modes. The senior answer weighs team size, scale, and how differentiating the capability is: do not burn your scarce engineers running undifferentiated infrastructure a managed service handles well.
Follow-ups they push on- How does vendor lock-in factor into choosing a managed service?
- At what scale might self-hosting actually become the cheaper choice?
Red flag Defaulting to self-hosting core infrastructure 'to save money' on a small team — the hidden cost is the engineering time and on-call burden of operating it, which usually dwarfs the managed-service bill.
source: AWS — What is managed services / cloud value ↗
6.2.6 Networking 13
-
What is a TLS/SSL certificate, who issues it, and how does a browser decide to trust it?
A certificate binds a public key to an identity (a domain name) and is signed by a Certificate Authority (CA). When you connect, the server presents its certificate during the TLS handshake; the browser verifies the CA's signature, walks the chain of trust up to a root CA that's pre-installed in the OS/browser trust store, and checks the cert isn't expired, matches the hostname, and hasn't been revoked. If all that holds, the browser trusts that it's really talking to that domain (this is the authentication part of TLS).
The chain matters: a root CA signs intermediate CAs, which sign your leaf certificate, so the server sends leaf + intermediates and the browser anchors trust at the root it already trusts. A self-signed cert isn't signed by a trusted CA, so browsers warn — fine for internal/dev, not for public sites. Today most public certs come from Let's Encrypt (free, automated via ACME) and are short-lived, renewed automatically.
Senior tell: trust is anchored in pre-installed root CAs and verified via the signature chain — not 'the browser checks the cert with the website'.
What a strong answer coversA cert binds a public key to a domain identity, signed by a Certificate Authority.
Browser verifies the chain of trust up to a root CA in its pre-installed trust store.
It also checks expiry, hostname match, and revocation before trusting.
Chain: root CA → intermediate(s) → leaf; the server sends leaf + intermediates.
Self-signed = not CA-trusted (browser warns); public certs now mostly Let's Encrypt via ACME.
Follow-ups they push on- Why does a self-signed certificate trigger a browser warning?
- What is the chain of trust, and where is it anchored?
- Why are automated, short-lived certs (Let's Encrypt/ACME) now the norm?
Red flag Thinking the browser validates a certificate by checking with the website itself — trust is anchored in pre-installed root CAs and verified through the signature chain; the site never gets to vouch for its own identity.
source: Cloudflare — What is an SSL certificate? ↗ -
What is a reverse proxy, and how is it different from a forward proxy and a load balancer?
A reverse proxy (e.g. nginx) sits in front of your servers and faces clients: clients connect to it, and it forwards requests to backends. It centralizes TLS termination, caching, compression, request routing, and security (it hides backend topology and can absorb attacks). The client doesn't know which backend served it.
A forward proxy sits in front of clients and faces the internet — it proxies outbound requests on behalf of users (corporate egress filtering, anonymity, caching outbound). So the two differ by which side they represent: reverse proxy works for the servers, forward proxy works for the clients.
A load balancer is a specific job — distributing traffic across multiple backends — that a reverse proxy often performs, but a reverse proxy also does TLS, caching, and routing beyond just balancing. In practice nginx is frequently both reverse proxy and load balancer.
Follow-ups they push on- Which side does each proxy represent — the client or the server?
- Is every reverse proxy a load balancer? Is every load balancer a reverse proxy?
Red flag Mixing up forward and reverse proxies — a forward proxy acts on behalf of the client (outbound), a reverse proxy acts on behalf of the server (inbound).
source: Cloudflare — What is a reverse proxy? ↗ -
What does a CDN do, and how does it speed up content delivery?
A CDN is a globally distributed network of edge servers that cache copies of your content close to users. When a user requests an asset, they're served from the nearest edge location instead of a round trip to a single origin — turning a 100-300ms origin fetch into a 5-20ms edge cache hit and slashing latency for users far from your origin.
Beyond latency, a CDN offloads traffic from your origin (the origin only serves cache misses, so it handles far less load and survives traffic spikes), and the edge often adds TLS termination, compression, and DDoS protection. It's ideal for static and cacheable content — images, CSS/JS, video, downloads. Dynamic, per-user responses are harder to cache, though edge compute and smart cache keys help. Cache invalidation (knowing when to purge stale content) is the recurring hard part.
Follow-ups they push on- What kinds of content cache well on a CDN, and what doesn't?
- How does a CDN reduce load on your origin, not just latency?
Red flag Thinking a CDN only helps latency — it also massively offloads the origin (origin only serves cache misses), which is often the bigger win during traffic spikes.
source: Cloudflare — What is a CDN? ↗ -
What is a VPC, and what's the difference between a public and a private subnet?
A VPC (Virtual Private Cloud) is your own logically isolated network inside the cloud, with a private IP range you control. You carve it into subnets, each living in one availability zone.
The public/private distinction is about reachability from the internet, controlled by routing. A public subnet has a route to an internet gateway, so resources there can have public IPs and be reached from the internet — that's where you put load balancers and bastion hosts. A private subnet has no direct internet route inbound; that's where you put app servers and databases so they can't be reached directly. Private-subnet resources still make outbound calls (e.g. to pull updates) through a NAT gateway, which allows egress but not unsolicited inbound. The pattern: internet -> load balancer in a public subnet -> app/database in private subnets.
Follow-ups they push on- How does a private-subnet instance reach the internet for outbound updates?
- Where would you place a database and why?
Red flag Putting databases in a public subnet for convenience — they become directly reachable from the internet; databases belong in private subnets behind a load balancer or bastion.
source: AWS — VPC subnets (public/private) ↗ -
What is the difference between TCP and UDP, and when would you choose UDP?
TCP is connection-oriented and reliable: a handshake sets up the connection, then it guarantees ordered, complete, error-checked delivery, retransmitting lost packets and applying flow/congestion control. That reliability costs overhead and latency — the handshake, acks, and head-of-line blocking when a lost packet stalls everything behind it. It's the default for anything that must arrive intact: HTTP, database connections, file transfer.
UDP is connectionless and 'fire-and-forget': no handshake, no ordering, no retransmission, no congestion control — just send datagrams. It's leaner and lower-latency, but the application must tolerate (or itself handle) loss and reordering. Choose UDP when timeliness beats completeness: live video/voice (a dropped frame is better than a stalled stream), real-time gaming, and DNS (one small request/response where setting up a TCP connection would be wasteful).
The modern twist: QUIC/HTTP-3 runs over UDP and rebuilds reliability/ordering in userspace to dodge TCP's head-of-line blocking — UDP as a foundation, not a compromise.
What a strong answer coversTCP: connection + handshake, reliable, ordered, congestion-controlled — HTTP, DBs, file transfer.
UDP: connectionless, no ordering/retransmission — lean and low-latency, app handles loss.
Choose UDP when timeliness beats completeness: live video/voice, gaming, DNS.
TCP's reliability adds latency (handshake, acks, head-of-line blocking on loss).
QUIC/HTTP-3 runs on UDP and re-adds reliability in userspace to avoid TCP head-of-line blocking.
Quick self-checkFor a live voice/video call, why is UDP often preferred over TCP?
-
Correct — real-time media values timeliness over completeness; a momentary glitch beats a frozen stream.
-
Backwards — that's TCP; UDP guarantees none of those.
-
UDP provides no encryption; that's a separate concern (e.g. DTLS/SRTP).
-
TCP can carry anything; the reason to avoid it here is latency from retransmission, not capability.
Follow-ups they push on- Why does a single lost packet in TCP stall everything behind it (head-of-line blocking)?
- Why does DNS traditionally use UDP for a typical query?
- How does QUIC get reliability while running over UDP?
Red flag Calling UDP 'unreliable so never use it' — for latency-sensitive, loss-tolerant traffic (voice, video, gaming, DNS) UDP is the correct choice, and modern protocols like QUIC build on it deliberately.
source: Cloudflare — What is the difference between TCP and UDP? ↗ -
Explain how DNS resolution works end to end, and what the common record types do.
DNS turns a name into an IP through a hierarchy of caches and authoritative servers. The browser/OS cache is checked first; on a miss, a recursive resolver (your ISP's or e.g. 8.8.8.8) does the legwork: it asks a root server (which points to the right TLD), then the TLD server for
.com(which points to the domain's authoritative nameserver), then the authoritative nameserver, which returns the actual record. Results are cached at each level for the record's TTL, so most lookups never travel the full chain.Common records: A (name → IPv4) and AAAA (→ IPv6); CNAME (alias one name to another name); MX (mail servers); TXT (arbitrary text — SPF/DKIM, domain verification); NS (delegates a zone to nameservers). TTL is the lever for caching vs agility: a long TTL means fewer lookups but slow propagation when you change records; a short TTL flips that — which is why you lower TTL *before* a planned migration.
Senior tell: knowing the resolver, not the browser, walks root→TLD→authoritative, and that DNS is a globally cached, eventually-consistent system (TTL governs staleness).
What a strong answer coversCaches first (browser/OS), then a recursive resolver walks root → TLD → authoritative.
Each level caches the answer for its TTL, so most lookups short-circuit.
A/AAAA = name→IPv4/IPv6; CNAME = alias to another name; MX = mail; TXT = SPF/verification; NS = delegation.
TTL trades caching vs agility — lower it before a migration so changes propagate fast.
DNS is globally cached and eventually consistent; stale answers persist until TTL expires.
Quick self-checkYou're migrating to a new server IP next week and want DNS to cut over quickly. What do you do first?
-
Correct — a short TTL means caches expire quickly, so the new IP propagates fast at cutover.
-
Backwards — a high TTL makes stale answers linger longer, slowing propagation.
-
That causes an outage (no resolution) rather than a smooth cutover.
-
Record type doesn't control propagation speed; TTL does.
Follow-ups they push on- Why lower a record's TTL before a planned IP migration?
- What's the difference between an A record and a CNAME?
- Who actually walks the root→TLD→authoritative chain — the browser or the resolver?
Red flag Changing a DNS record and expecting it to take effect instantly — old answers stay cached until the TTL expires, so propagation is governed by TTL; you lower TTL ahead of time for fast cutovers.
source: Cloudflare — What is DNS? / DNS records ↗ -
What is the difference between an L4 and an L7 load balancer, and when would you use each?
An L4 (transport-layer) load balancer routes by IP and TCP/UDP port without looking at the payload. It is fast, low-overhead, protocol-agnostic, and preserves the connection — ideal for raw TCP/UDP, very high throughput, low latency, or when you need a static IP (e.g. AWS NLB).
An L7 (application-layer) load balancer understands HTTP: it can route on hostname, URL path, headers, and cookies, terminate TLS, do sticky sessions, and apply content-based rules (e.g. send
/apito one service,/staticto another). That intelligence costs a bit more processing (e.g. AWS ALB). Rule of thumb: web/HTTP traffic that needs path/host routing or TLS termination -> L7; non-HTTP, ultra-high-throughput, or static-IP needs -> L4.Follow-ups they push on- Which layer can do path-based routing, and why can't the other?
- Where does TLS termination happen in each case?
Red flag Claiming an L4 load balancer can route by URL path or host header — it never inspects the HTTP payload, so content-based routing requires L7.
source: Cloudflare — What is load balancing? ↗ -
What load balancing algorithms exist, and how do sticky sessions and health checks fit in?
Common algorithms: round-robin (rotate through backends — simple, assumes roughly equal requests/servers); least-connections (send to the backend with the fewest active connections — better when request durations vary widely); weighted variants (bias toward bigger instances); and hash-based (e.g. hash the client IP or a key to pin a client consistently to one backend).
Health checks are what make a load balancer safe: it periodically probes each backend and stops routing to any that fail, so a dead or unhealthy instance is automatically taken out of rotation and traffic flows only to healthy ones. Without them the LB cheerfully sends traffic into a black hole.
Sticky sessions (session affinity) pin a given client to the same backend (via a cookie or IP hash) so server-local session state keeps working. It's a crutch: it undermines even load distribution and breaks when that backend dies. The senior view is to avoid the need for it by making services stateless and externalizing session state to a shared store (Redis), so any backend can serve any request and the LB is free to balance optimally.
What a strong answer coversAlgorithms: round-robin, least-connections (varied request durations), weighted, hash-based.
Health checks remove failed backends from rotation — without them the LB routes into dead servers.
Sticky sessions pin a client to one backend so server-local state works — but it's a crutch.
Stickiness undermines even balancing and breaks when the pinned backend dies.
Better: stateless services + shared session store (Redis) so any backend serves any request.
Follow-ups they push on- When is least-connections a better choice than round-robin?
- Why do sticky sessions undermine even load distribution?
- How does externalizing session state let you drop sticky sessions?
Red flag Relying on sticky sessions to hold user state on a specific server — when that server is removed (scale-in, failure, deploy) the session is lost; externalize session state so any backend can serve the request.
source: Cloudflare — What is load balancing? ↗ -
What is HTTP/2 (and HTTP/3), and what problems did they solve over HTTP/1.1?
HTTP/1.1 sends one request/response per connection at a time; a slow response blocks the ones behind it on that connection (head-of-line blocking), so browsers worked around it by opening many parallel connections — wasteful and still limited.
HTTP/2 introduced multiplexing: many requests/responses share one TCP connection as interleaved streams, so a slow response no longer blocks others *at the HTTP layer*. It also added header compression (HPACK) and stream prioritization. But because it still rides on a single TCP connection, a lost TCP segment stalls *all* streams — TCP-level head-of-line blocking remained.
HTTP/3 fixes that by moving off TCP onto QUIC (over UDP): streams are independent at the transport layer, so one lost packet only stalls its own stream, not the others. QUIC also folds the transport + TLS handshake together for faster (often 0-1 RTT) connection setup and better mobility across networks. The arc: HTTP/2 solved app-layer HOL blocking, HTTP/3 solved transport-layer HOL blocking.
What a strong answer coversHTTP/1.1: one in-flight request per connection → app-layer head-of-line blocking, many parallel sockets.
HTTP/2: multiplexes many streams on one TCP connection + header compression (HPACK).
HTTP/2's flaw: still one TCP connection, so a lost segment causes transport-level HOL blocking.
HTTP/3 runs on QUIC over UDP — independent streams, so one lost packet stalls only its stream.
QUIC also merges transport+TLS handshake for faster (0-1 RTT) setup and connection migration.
Quick self-checkWhat was the key transport change in HTTP/3 versus HTTP/2?
-
Correct — moving off TCP onto QUIC makes a lost packet affect only its own stream.
-
The opposite — HTTP/3 keeps multiplexing; it changed the transport, not the multiplexing model.
-
QUIC bakes in TLS; HTTP/3 is encrypted by design, not faster from dropping it.
-
HTTP semantics are unchanged; only the underlying transport differs.
Follow-ups they push on- Why does HTTP/2 still suffer head-of-line blocking despite multiplexing?
- How does running over QUIC/UDP let HTTP/3 avoid that?
- What did header compression (HPACK) buy HTTP/2?
Red flag Claiming HTTP/2 eliminated head-of-line blocking entirely — it removed it at the HTTP layer, but a single lost TCP packet still stalls all streams; only HTTP/3 over QUIC removes the transport-level HOL blocking.
source: Cloudflare — What is HTTP/3? ↗ -
Users intermittently get 502/504 errors from your service behind a load balancer. How do you debug it?
First decode the codes: a 502 Bad Gateway means the load balancer got an invalid/empty response from a backend; a 504 Gateway Timeout means the backend didn't respond within the LB's timeout. Both point *downstream of the LB* — the LB is reachable, so the problem is the backends or the path to them, and 'intermittent' suggests some backends or some requests, not all.
Work it methodically: (1) check backend health in the LB — are some targets failing health checks and flapping in/out of rotation? (2) backend logs/metrics — crashes, restarts, OOM, or slow endpoints (504 often = a slow query or a downstream dependency timing out). (3) timeout mismatch — a classic 502 cause is the LB's idle/keep-alive timeout being *longer* than the backend's, so the backend closes a connection the LB still tries to reuse; align them (backend keep-alive ≥ LB idle timeout). (4) capacity — are 5xx spikes correlated with load (saturated backends, exhausted connection pools, thread starvation)? (5) recent deploys/config changes.
Senior tell: distinguishing 502 (bad/empty upstream response, often the keep-alive timeout mismatch) from 504 (upstream too slow), and following the request path from LB → backend → that backend's dependencies.
What a strong answer covers502 = LB got an invalid/empty response from a backend; 504 = backend timed out — both downstream of the LB.
Check backend health checks — flapping targets cause intermittent failures.
Inspect backend logs/metrics: crashes, OOM, slow endpoints, dependency timeouts (typical 504).
Classic 502: LB idle timeout > backend keep-alive, so the backend drops a connection the LB reuses — align them.
Correlate 5xx with load (saturated pools/threads) and recent deploys.
Follow-ups they push on- What's the difference in meaning between a 502 and a 504?
- How does a keep-alive/idle timeout mismatch produce intermittent 502s?
- Why does 'intermittent' point you toward specific backends or load rather than the LB itself?
Red flag Blaming the load balancer for 502/504s — both codes mean the LB reached the backend but the backend gave a bad or slow response; the fix is almost always in the backends, their dependencies, or a timeout mismatch, not the LB config alone.
source: Cloudflare — What is a 502 Bad Gateway error? ↗ -
Walk me through what happens, step by step, when a user types a URL and hits Enter.
DNS resolution first: the browser checks its cache, then the OS, then a recursive resolver, which walks root -> TLD -> authoritative nameserver to get the IP (often a CDN edge or load balancer IP).
Then the TCP connection (handshake) to that IP on port 443, and a TLS handshake to negotiate keys and verify the server's certificate so the channel is encrypted. The browser sends the HTTP request; it likely lands on a CDN edge or load balancer, which either serves cached content or forwards to an origin server. The server (behind a reverse proxy / LB, possibly hitting app servers, caches, and databases) returns the response. Finally the browser parses HTML, fetches sub-resources (CSS/JS/images, often from the CDN), and renders. This question is a checklist of the whole stack — name DNS, TCP, TLS, LB/reverse proxy, CDN, and origin.
Follow-ups they push on- Where does the CDN fit, and what does it save you?
- What does the TLS handshake actually establish before any HTTP is sent?
Red flag Skipping straight to 'the server returns HTML' and omitting DNS, the TLS handshake, and the load-balancer/CDN hops — the interviewer is probing breadth across the whole networking stack.
source: Cloudflare — What is DNS? ↗ -
How does the TLS handshake work, and what does HTTPS actually give you?
HTTPS = HTTP over TLS. TLS provides three things: encryption (eavesdroppers can't read traffic), authentication (you're talking to the real server, via its certificate), and integrity (tampering is detected).
The handshake establishes a shared session key without ever sending it in the clear. The client sends a ClientHello (supported versions/ciphers); the server responds with its choice plus its certificate. The client validates the certificate against a trusted Certificate Authority chain (this is the authentication step). They then perform a key exchange — modern TLS 1.3 uses ephemeral Diffie-Hellman so each session gets a fresh key (forward secrecy: stealing the server's key later can't decrypt past traffic). Once the shared symmetric key is derived, the rest of the session uses fast symmetric encryption. TLS 1.3 also trimmed the handshake to one round trip. The asymmetric crypto is only used to bootstrap the symmetric key.
Follow-ups they push on- What is forward secrecy and why does ephemeral key exchange give it to you?
- Why switch from asymmetric to symmetric encryption after the handshake?
Red flag Saying the whole session is encrypted with the server's public/private key pair — asymmetric crypto only bootstraps a shared symmetric key; the bulk traffic uses fast symmetric encryption.
source: Cloudflare — What happens in a TLS handshake? ↗ -
What is the difference between a security group and a network ACL, and how do they implement defense in depth?
Both are virtual firewalls in a VPC but operate at different scopes and behave differently. A security group is attached to an instance/ENI and is stateful: if you allow an inbound request, the response is automatically allowed back out — you only write the rules you care about, and you can only specify allow rules.
A network ACL is attached to a subnet and is stateless: it evaluates inbound and outbound traffic independently (so you must allow the return traffic explicitly), it supports both allow and deny rules, and rules are evaluated in numbered order. So security groups guard individual resources, NACLs guard the whole subnet boundary.
Using both is defense in depth: the NACL is a coarse subnet-level gate (e.g. block a bad IP range for everything in the subnet) and the security group is the fine-grained per-instance control. An attacker has to get past both layers.
Follow-ups they push on- Why does a stateless NACL require you to allow return traffic explicitly?
- Why run both instead of relying on just the security group?
Red flag Treating a security group as stateless and adding redundant outbound rules for return traffic, or assuming a NACL is stateful and forgetting to allow the response, which silently drops connections.
source: AWS — Compare security groups and network ACLs ↗
6.2.7 Observability 12
-
Define SLI, SLO, SLA, and error budget — how do they relate?
An SLI (Service Level *Indicator*) is a measured quantity of service health — e.g. the proportion of HTTP requests that succeed under 300ms.
An SLO (Service Level *Objective*) is the internal target for an SLI over a window — e.g. 99.9% of requests succeed over 28 days. It is what you *aim* for.
An SLA (Service Level *Agreement*) is a contract with customers that includes consequences (refunds, penalties) if you miss it. SLAs are looser than SLOs so you have headroom before you owe anyone money.
The error budget is
1 − SLO— the allowed amount of unreliability (0.1% for a 99.9% SLO). It turns reliability into a currency: while budget remains you can ship fast and take risks; when it is exhausted you freeze risky launches and prioritize stability. It is the mechanism that lets dev and ops stop arguing about pace versus reliability.What a strong answer coversSLI = the measurement; SLO = the internal target on that measurement; SLA = the externally-promised, consequence-bearing version.
Set the SLO tighter than the SLA so you get warning before breaching the contract.
Error budget = 1 − SLO — the explicit, spendable allowance of failure over the window.
100% is the wrong reliability target: it is impossibly expensive and leaves no budget to ship features.
When the budget is spent, the policy is to halt risky releases until reliability recovers.
Quick self-checkYour SLO is 99.9% success over 28 days. What is the error budget?
-
Correct — the error budget is 1 − SLO, so 100% − 99.9% = 0.1% of requests can fail before the objective is breached.
-
Backwards — 99.9% is the success *target*, not the allowed failure.
-
Wrong on two counts: the SLO is an internal target (not the contract), and it explicitly tolerates 0.1% failure.
-
The error budget is derived purely from the SLO; the SLA's penalties are a separate, looser commitment.
Follow-ups they push on- Why is targeting 100% availability the wrong goal?
- What should happen operationally when the error budget is fully consumed?
- Why is an SLA usually looser than the corresponding SLO?
Red flag Conflating SLO and SLA, or setting them equal — the SLA must be looser than the SLO, and the error budget only makes sense as the gap below the SLO target.
source: Google SRE Book — Service Level Objectives ↗ -
How do Prometheus and Grafana divide responsibilities in a typical stack?
Prometheus is the time-series database and collector: it *pulls* (scrapes) metrics from instrumented targets, stores them, and evaluates alerting rules. Querying is done with PromQL.
Grafana is the visualization/dashboard layer: it queries Prometheus (and many other sources) and renders graphs, tables, and alerts for humans.
The one-liner: Prometheus collects and stores the numbers; Grafana makes them legible. They are complementary, not competitors — you commonly run both together.
Follow-ups they push on- Why does Prometheus prefer a pull model over push?
- Where does Alertmanager fit relative to Prometheus?
Red flag Thinking Grafana stores metrics — it is a query/visualization front-end over data sources, not a TSDB.
source: Grafana — Prometheus data source ↗ -
What are the three pillars of observability, and what question does each one answer?
Logs are timestamped, discrete records — the narrative of *what happened* on one service. Best for forensic, after-the-fact debugging of a specific event.
Metrics are aggregated numbers over time (counters, gauges, histograms) — they answer *how much / how often / is the trend bad?* Cheap to store, great for dashboards and alerting thresholds.
Traces follow a single request across service boundaries — they answer *where did the time go / which hop failed?* in a distributed system.
The strong answer ties them together: a metric alert tells you something is wrong, a trace localizes which service, and logs from that service explain why.
Follow-ups they push on- Which pillar is most expensive to store at scale, and why?
- How do you correlate a log line with the trace it belongs to?
Red flag Treating the three as interchangeable, or claiming logs alone give you observability — logs do not show cross-service latency the way traces do.
source: Sematext — Three Pillars of Observability ↗ -
What's the difference between a counter, a gauge, and a histogram in Prometheus, and when do you use each?
A counter only ever increases (or resets to zero on restart): total requests served, total errors. You don't read its raw value — you apply
rate()to get per-second throughput. A counter answers 'how many, cumulatively?'A gauge goes up and down: current memory usage, in-flight requests, queue depth, temperature. You read it directly; it answers 'what is the value *right now*?'
A histogram samples observations into configurable buckets (e.g. request durations) so you can compute quantiles like p95/p99 with
histogram_quantile(). It answers 'what does the *distribution* look like?' — essential for latency, where the mean lies.Pick by the question: cumulative count → counter; point-in-time level → gauge; distribution/percentiles → histogram.
What a strong answer coversCounter = monotonically increasing; query with
rate(), never read raw (it resets on restart).Gauge = a value that rises and falls; read directly for current state.
Histogram = bucketed observations enabling quantiles (p95/p99) via
histogram_quantile().Latency belongs in a histogram, not a gauge or an average — the tail is what hurts users.
Quick self-checkYou want p99 request latency on a dashboard. Which metric type do you instrument?
-
Correct — histograms bucket observations so `histogram_quantile()` can compute p95/p99 latency.
-
A counter only tracks a monotonically increasing total (e.g. request count), not a distribution.
-
A gauge captures a single current value, not the spread needed for a percentile.
-
You cannot reconstruct a latency distribution from a plain counter — percentiles require bucketed observations.
Follow-ups they push on- Why do you apply `rate()` to a counter instead of reading its value?
- What's the difference between a Prometheus histogram and a summary?
Red flag Using a gauge for an ever-growing total (so a restart silently resets it and breaks your dashboards), or averaging latency instead of using a histogram for percentiles.
source: Prometheus — Metric types ↗ -
What problem does OpenTelemetry solve?
OpenTelemetry (OTel) is a vendor-neutral standard — APIs, SDKs, and the Collector — for generating and exporting traces, metrics, and logs.
The problem it solves: before OTel, each backend (Datadog, Jaeger, New Relic, Prometheus) had its own agent and instrumentation library, so switching vendors meant re-instrumenting your code. With OTel you instrument *once* against a common API, then point the Collector at whatever backend you choose — no code change to switch or fan out to several.
It is now a CNCF project and the de-facto wire format (OTLP) for telemetry.
Follow-ups they push on- What does the OTel Collector do that an in-process SDK exporter doesn't?
- How does context propagation let a trace span multiple services?
Red flag Calling OpenTelemetry a 'monitoring tool' or a backend — it generates and ships telemetry; it does not store or visualize it (that's Prometheus, Grafana, Jaeger, etc.).
source: OpenTelemetry — What is OpenTelemetry? ↗ -
What are the four golden signals, and why is each one worth alerting on?
Google SRE's four golden signals for a user-facing system are latency, traffic, errors, and saturation.
Latency — how long requests take; crucially, track *successful* and *failed* latency separately, since a fast error can hide a problem. Traffic — demand on the system (requests/sec, transactions/sec). Errors — the rate of failed requests, including the sneaky ones that return 200 but are wrong. Saturation — how 'full' the most constrained resource is (memory, I/O, CPU), the leading indicator of imminent degradation.
If you can only instrument four things, these give you the broadest coverage of user-visible health. RED is essentially the request-side subset (rate/errors/duration); saturation adds the resource-pressure dimension.
What a strong answer coversThe four: latency, traffic, errors, saturation — broad coverage from a minimal set.
Measure latency of failures separately from successes — a fast 500 skews the average and masks the outage.
Saturation is a *leading* indicator: it warns before latency and errors blow up.
RED (Rate/Errors/Duration) maps onto the request-facing three; saturation is the resource lens (the S in USE-style thinking).
Follow-ups they push on- Why must you separate the latency of failed requests from successful ones?
- How do the golden signals overlap with the RED method?
Red flag Folding failed-request latency into your overall latency metric — a flood of instant errors makes p50 latency look great while users are seeing failures.
source: Google SRE Book — Monitoring Distributed Systems (Golden Signals) ↗ -
Why is structured logging preferred over plain-text logs, and what is a correlation/trace ID for?
Structured logging emits each log as machine-parseable key/value data (typically JSON) —
{"level":"error","user_id":42,"latency_ms":910}— instead of a free-text sentence. The payoff: you can index, filter, and aggregate on fields (level=error AND service=checkout) in a log platform, rather than writing fragile regexes against prose.A correlation ID (a.k.a. request/trace ID) is a unique identifier generated at the edge and propagated through every service and log line for a single request. It lets you reconstruct the entire path of one request across many services by filtering on one value — turning scattered log lines into a coherent story, and linking logs to the matching distributed trace.
Together they make logs queryable *and* joinable, which is what makes them useful at scale.
What a strong answer coversStructured logs are key/value (JSON) — indexable and filterable on fields, not parsed from prose.
A correlation/trace ID is generated at the edge and propagated so all log lines for one request share it.
Filtering on the correlation ID reconstructs one request's journey across every service it touched.
It also bridges logs and traces — the same ID ties a log line to its span in a distributed trace.
Follow-ups they push on- How does a correlation ID get propagated across an async message queue?
- Why does free-text logging become unmanageable in a microservices fleet?
Red flag Logging unstructured prose (or, worse, logging secrets/PII into those fields) — it forces brittle text parsing and can leak sensitive data into the log store.
source: OpenTelemetry — Logs / Correlation ↗ -
What is tail-based sampling in distributed tracing, and why use it over head-based sampling?
Tracing every request at full volume is too expensive to store, so you sample. The question is *when* you decide.
Head-based sampling decides at the *start* of a trace — e.g. keep 1% of requests, chosen randomly at the root. It is cheap and simple, but blind: it might throw away the slow or errored traces, which are exactly the ones you want.
Tail-based sampling buffers the spans of a trace and decides *after* it completes, so it can keep traces based on outcome — every error, every request over 1s, plus a baseline sample of normal ones. You get the interesting traces without storing everything.
The tradeoff: tail-based needs to buffer complete traces (memory/coordination in the Collector) and is operationally heavier, but it captures the long tail that head-based sampling probabilistically discards.
What a strong answer coversSampling exists because storing 100% of traces is prohibitively expensive at scale.
Head-based decides at trace start (cheap, stateless) but can discard the slow/errored traces you most need.
Tail-based decides after the trace finishes, so it can retain all errors and high-latency traces.
Tail-based costs more: it must buffer whole traces and coordinate spans before deciding.
Follow-ups they push on- Why can't head-based sampling preferentially keep error traces?
- What infrastructure does the OTel Collector need to do tail-based sampling?
Red flag Using uniform head-based sampling and then being surprised that the rare production error has no trace — the random sample almost never captured it.
source: OpenTelemetry — Sampling ↗ -
What makes a good alert? Why do teams end up with alert fatigue, and how do you fix it?
A good alert is actionable, urgent, and user-impacting — it pages a human only when something needs a human to intervene *now*. The SRE guidance is to alert on symptoms (users are seeing errors / latency, the SLO is burning) rather than causes (CPU is at 80%), because a high CPU that isn't hurting anyone is not worth waking someone.
Alert fatigue sets in when too many alerts fire — noisy thresholds, alerts on causes that self-heal, duplicate pages for one incident — so on-call engineers start ignoring them, and the real page gets lost in the noise.
Fixes: alert on SLO burn rate rather than raw thresholds; route non-urgent signals to a dashboard or ticket instead of a page; deduplicate and group related alerts; and ruthlessly delete or tune any alert that consistently fires without requiring action. Every page should be reviewed: was it actionable?
What a strong answer coversPage only on symptoms users feel (errors, latency, SLO burn) — not on causes that may be harmless.
Every page must be actionable and urgent; if no human action is needed now, it shouldn't page.
Alert fatigue comes from noisy/duplicate/self-healing alerts; people then ignore the real one.
Fix it with burn-rate alerts, deduplication/grouping, ticket-not-page routing, and pruning useless alerts.
Follow-ups they push on- What is multi-window, multi-burn-rate alerting and why is it better than a static threshold?
- Why is paging on high CPU usually a bad idea?
Red flag Alerting on every resource metric (cause-based alerting) — it buries the few symptom-based pages that actually matter and trains on-call to dismiss notifications.
source: Google SRE Workbook — Alerting on SLOs ↗ -
What's the difference between monitoring and observability?
Monitoring watches for *known* failure modes: you decide in advance what to measure, set thresholds, and alert when a line is crossed. It answers questions you predicted.
Observability is the property of a system that lets you ask *new* questions about its internal state from the outside, without shipping new code — to debug failures you did not anticipate.
The relationship: monitoring is a subset of what observable systems enable. You still need both — monitoring catches the predictable, observability lets you investigate the unknown-unknowns in complex distributed systems.
Follow-ups they push on- What property of your telemetry makes a system observable rather than just monitored?
- Why do microservices raise the bar for observability versus a monolith?
Red flag Saying observability is 'just monitoring with more dashboards' — the distinction is exploring unknown-unknowns versus alerting on known thresholds.
source: TechTarget — The 3 pillars of observability ↗ -
Your Prometheus storage is exploding after a deploy. What's the most likely cause and the fix?
Almost always a high-cardinality label. Each unique combination of label values is a separate time series; adding an unbounded label like
user_id,request_id,email, or a raw URL with IDs multiplies series count explosively.Fix: drop the offending label, or replace it with a bounded one. Use
http_method, status-code *class* (2xx/5xx),route*template* (/users/:id, not/users/8123), andservice— values with a small, fixed set.If you genuinely need per-user detail, that belongs in logs or traces (high cardinality there is fine), not in metric labels.
Follow-ups they push on- Why is high cardinality cheap in tracing but catastrophic in metrics?
- How would you find which metric is the culprit?
Red flag Putting unbounded identifiers (user IDs, request IDs, timestamps) into metric labels — the classic cardinality blow-up.
source: Sematext — Three Pillars of Observability (cardinality) ↗ -
What are the RED and USE methods, and when would you use each?
RED (Rate, Errors, Duration) is request-centric — for services/endpoints, you watch request rate, error rate, and latency distribution. It answers 'is this service healthy from the caller's view?'
USE (Utilization, Saturation, Errors) is resource-centric — for every resource (CPU, disk, network, memory) you watch how busy it is, how much work is queued, and its error count. It answers 'is this machine/resource a bottleneck?'
Use RED for your request-serving services and USE for the infrastructure underneath them; they are complementary lenses.
Follow-ups they push on- Why is a latency *percentile* (p99) more useful than a mean for the D in RED?
- What are the four golden signals and how do they relate to RED?
Red flag Alerting on averages instead of percentiles — a healthy mean hides a brutal p99 tail.
source: Grafana — RED method ↗
6.2.8 Deployment strategies 12
-
Compare blue-green, canary, and rolling deployments — define each and give the tradeoff.
Blue-green: run two full environments; blue serves prod while green gets the new version, then flip all traffic at once. Tradeoff: instant rollback (flip back), but you pay for double infrastructure.
Canary: release to a small slice of traffic/users first, watch metrics, then ramp up. Tradeoff: limits blast radius and catches real-world bugs early, but needs good monitoring and automated rollback, and the rollout is slower.
Rolling: replace instances in batches in place until all run the new version. Tradeoff: no extra infrastructure and simple, but both versions run simultaneously during the roll, rollback is slower, and bugs surface gradually.
Choice comes down to risk tolerance, infra budget, and how fast you need to recover.
Follow-ups they push on- Which strategies require your two versions to be backward/forward compatible at the same time?
- How does a canary differ from a rolling deploy mechanically?
Red flag Confusing canary with rolling — canary targets a *traffic/user* slice and is metric-gated; rolling replaces *instances* batch by batch regardless of who they serve.
source: Unleash — Comparing deployment strategies ↗ -
What's the difference between continuous delivery and continuous deployment?
Both build on continuous integration (merge and test small changes frequently) and keep
mainin an always-releasable state. The difference is the last step.Continuous delivery: every change that passes the pipeline is *ready* to deploy, but the actual push to production is a manual decision — a human clicks the button. You can release any time, on demand.
Continuous deployment: there is no manual gate — every change that passes all automated checks deploys to production automatically. It demands very strong test coverage, automated rollback, and good observability, because nothing stops a bad change but the pipeline itself.
So: continuous *delivery* makes release a one-click choice; continuous *deployment* removes the click. Many teams do CD-delivery and reserve full auto-deploy for services where they trust their safety nets.
What a strong answer coversBoth rest on CI and an always-releasable
main.Continuous delivery = always *ready* to ship, but a human triggers the production release.
Continuous deployment = every passing change auto-ships to prod, with no manual gate.
Continuous deployment requires strong automated tests, rollback, and observability to be safe.
Quick self-checkWhat is the single distinguishing feature of continuous *deployment* versus continuous *delivery*?
-
Correct — continuous deployment removes the human approval step that continuous delivery keeps.
-
That's continuous integration, which both delivery and deployment share.
-
True of both — it's a precondition, not the distinguishing feature.
-
Flags are useful in both and aren't what separates the two practices.
Follow-ups they push on- What safety nets must be in place before you trust full continuous deployment?
- Where do feature flags fit in a continuous-deployment pipeline?
Red flag Using 'CD' loosely — interviewers care that you distinguish *delivery* (manual release trigger) from *deployment* (fully automatic), and know the latter's higher safety-net bar.
source: Atlassian — Continuous integration vs delivery vs deployment ↗ -
What is a deployment rollback, and why is 'roll forward' often preferred in practice?
Rollback restores the previous known-good version after a bad deploy. With blue-green it is a traffic flip; with rolling it means re-deploying the old image batch by batch.
Many mature teams prefer roll forward — ship a fix as a new deploy — because rollback can be unsafe when the bad version already wrote incompatible data or ran a forward-only migration. You cannot 'un-migrate' easily, and an old binary against a new schema can corrupt things.
Strong answer: keep deploys small and frequent so the diff to fix or revert is tiny, make migrations backward-compatible so rollback stays an option, and automate whichever path you choose.
Follow-ups they push on- When is rollback strictly impossible?
- How do small, frequent deploys make both rollback and roll-forward safer?
Red flag Assuming rollback is always safe — irreversible migrations or data written by the new version can make rolling back worse than rolling forward.
source: Google SRE Book — Release Engineering ↗ -
Why should deployments be automated and repeatable rather than a manual checklist?
Manual deploys are slow, error-prone, and unrepeatable — the same human running the same steps will eventually skip one under pressure, and the process lives in one person's head. Automation makes the deploy deterministic and self-documenting: the pipeline *is* the runbook.
The SRE principle is that releases should be hermetic and reproducible — build from a known, version-controlled source with pinned tools so the same inputs always produce the same artifact, independent of the machine running the build. Combined with automated tests as gates, this lets you deploy frequently and safely, and makes rollback a known, rehearsed action rather than improvisation during an incident.
Frequent small automated deploys also shrink each change's blast radius — easier to test, easier to bisect, easier to revert.
What a strong answer coversManual steps are non-repeatable and fail under pressure; automation makes deploys deterministic.
Builds should be hermetic/reproducible — pinned source and tools, same inputs → same artifact.
Automated test gates let you deploy frequently and safely, with rehearsed rollback.
Small, frequent, automated releases shrink each deploy's blast radius.
Follow-ups they push on- What does a 'hermetic build' mean and why does it aid reproducibility?
- How does deploy frequency relate to the size of each change's risk?
Red flag Relying on a manual, tribal-knowledge deploy checklist — it doesn't scale, drifts from reality, and turns every release into a risk that only one person can run.
source: Google SRE Book — Release Engineering ↗ -
Your canary shows no errors and gets promoted to 100%, then production falls over an hour later. What likely went wrong?
The canary passed because the failure mode wasn't *visible at canary scale or duration*. The usual suspects:
- A slow resource leak (memory, file descriptors, connection-pool exhaustion) that only crosses the limit after an hour of uptime — the short bake never reached it.
- A load/scale effect: at 1% traffic a new query or lock was fine; at 100% it saturates the database or a downstream dependency that the small canary never pressured.
- Cold→warm transitions: caches were warm on the old fleet but the new version's cache was cold under full load, or a thundering-herd on cutover.
- Time/cron-triggered behavior (a batch job, TTL expiry) that simply hadn't fired during the canary window.Response: roll back (or roll forward a fix), then fix the *process* — longer bake time, load-aware canary analysis, and dependency/saturation metrics, not just error rate.
What a strong answer coversCanaries miss bugs that need time (leaks, scheduled jobs) or scale (DB/lock saturation at full traffic) to manifest.
1% traffic gives no statistical power for rare paths and no pressure on shared downstreams.
Cold caches / thundering herd on full cutover can sink a version that looked fine warm.
Fix the process: longer bake time, watch saturation and dependencies, not error rate alone.
Follow-ups they push on- How would a memory leak escape a 15-minute canary but kill the fleet in an hour?
- Why can a query be fine at 1% traffic and lethal at 100%?
Red flag Trusting a short, low-traffic canary as proof of safety — error-rate-only, brief canaries are blind to leaks, scale effects, and time-triggered behavior.
source: Google SRE Workbook — Canarying Releases ↗ -
Why does any zero-downtime deploy require old and new versions to be compatible, and what breaks if they aren't?
Rolling, canary, and blue-green (during the flip) all have a window where both versions serve traffic simultaneously against shared state — the same database, the same message formats, the same caches and clients. If the versions aren't mutually compatible, that window corrupts data or throws errors.
Concrete breakages: the new version writes a message/field the old version can't parse (or vice versa); the new schema drops or renames a column the old code still reads; a client gets a v2 response from the new instance then a v1 from the old one on the next request. Rollback is the same problem in reverse — the old version must tolerate data the new version already wrote.
The discipline is backward and forward compatibility via expand/contract: change in additive, tolerant steps (add before you read, deploy before you require, contract only after everything is upgraded) so any two adjacent versions can coexist.
What a strong answer coversEvery zero-downtime strategy has a window where N and N+1 run together on shared state.
Incompatibility there means corrupted data or runtime errors, not a clean failure.
Rollback needs the same property in reverse: old code must tolerate data new code wrote.
The cure is expand/contract / parallel change — additive, tolerant steps so adjacent versions coexist.
Follow-ups they push on- How does expand/contract make a column rename safe across a rolling deploy?
- Why does message-queue schema evolution need both forward and backward compatibility?
Red flag Assuming a deploy is atomic — for the duration of any rolling/canary/blue-green cutover two versions coexist, so a breaking change to schema or wire format corrupts the in-flight overlap.
source: Martin Fowler — ParallelChange (expand/contract) ↗ -
Your service uses blue-green deploys. A migration adds a NOT NULL column. Why is this dangerous, and how do you ship it safely?
During the flip (and any rollback) both the old and new code may run against the *same* database. An old instance does not know about the new column; if it is
NOT NULLwith no default, the old code's inserts fail. A destructive migration also makes rollback impossible.Safe approach is expand/contract (a.k.a. parallel change):
1. Expand: add the column as nullable / with a default — old and new code both work.
2. Deploy code that writes (and backfills) the new column.
3. Backfill existing rows.
4. Contract: only after all code uses it, add theNOT NULLconstraint and drop old paths.The rule: schema changes must be backward-compatible with the version still running.
Follow-ups they push on- How does the same problem bite a rolling deploy?
- Why should you never rename a column in a single migration?
Red flag Coupling a destructive/forward-only schema change to the same release as the code that needs it — it breaks the still-running old version and blocks rollback.
source: Martin Fowler — ParallelChange (expand/contract) ↗ -
When would you choose blue-green over canary, and vice versa?
Blue-green suits big-bang releases where you want an instant, all-or-nothing cutover and the cleanest possible rollback — e.g. a major version where running both versions side by side for long is undesirable, and you can afford the duplicate environment.
Canary suits fast-evolving services where you want to validate a change against *real* production traffic before full exposure, and where a small percentage of affected users is an acceptable way to catch regressions monitoring can detect.
Real answer mentions constraints: canary needs solid metrics + automated rollback; blue-green needs budget for two environments and a story for shared state (DB, caches).
Follow-ups they push on- What makes automated rollback feasible for a canary but trickier for blue-green?
- How do feature flags let you decouple deploy from release entirely?
Red flag Recommending canary without acknowledging it is useless without good observability to decide promote-vs-roll-back.
source: TechTarget — canary vs blue/green vs rolling ↗ -
What's the difference between a deployment and a release, and why does the distinction matter?
Deploy = getting new code running in production. Release = exposing that behavior to users. Feature flags let you separate the two: you can deploy dark code that is off, then flip it on (release) independently — and turn it off without a redeploy.
Why it matters: it shrinks risk. Deploys become routine and frequent; releases become a business decision (flag on for 5%, then 50%, then all). Rollback of a feature is a config flip, not a redeploy. It also enables trunk-based development — unfinished work hides behind a flag instead of a long-lived branch.
Follow-ups they push on- What is the operational cost of accumulating stale feature flags?
- How do flags enable canary-style releases without canary infrastructure?
Red flag Conflating deploy and release — assuming code is live to users the instant it is deployed, when a flag may gate it.
source: Martin Fowler — Feature Toggles ↗ -
How does Kubernetes implement a rolling update, and what knobs control its safety?
A Deployment's
RollingUpdatestrategy spins up new-version Pods and tears down old ones gradually, governed by two knobs:-
maxUnavailable— how many Pods below the desired count you tolerate during the roll (availability floor).
-maxSurge— how many extra Pods above desired you allow (capacity ceiling).Kubernetes only routes traffic to a Pod once its readiness probe passes, so a broken new version that never becomes ready stalls the rollout instead of taking traffic.
kubectl rollout undoreverts to the prior ReplicaSet.For canary/blue-green you layer in a service mesh or progressive-delivery controller (Argo Rollouts, Flagger) — vanilla Deployments only do rolling.
Follow-ups they push on- Why is a correct readiness probe essential for a safe rolling update?
- What does maxSurge=0, maxUnavailable=0 do — and why is it a deadlock?
Red flag Forgetting readiness probes — without them Kubernetes sends traffic to Pods that are up but not actually ready to serve.
source: Kubernetes — Rolling updates ↗ -
What's the operational cost of feature flags, and how do you keep them from becoming tech debt?
Flags decouple deploy from release and are powerful, but each one adds a branch to your code's runtime behavior. Costs: combinatorial explosion (N flags = 2^N possible states you can't all test), stale flags that linger long after a rollout completes and confuse readers, and the risk of a flag becoming a permanent, undocumented config knob.
The fix is treating flags as short-lived by default: a release toggle exists only to ramp a feature, and you delete it (and its dead branch) the moment the feature is 100% rolled out. Distinguish flag *kinds* — release toggles are transient; ops/kill-switches and permissioning toggles are long-lived and managed differently. Track flags in a registry with an owner and an expiry, and add cleanup to the definition of done.
What a strong answer coversEach flag doubles the runtime state space — 2^N combinations quickly become untestable.
Stale release toggles are tech debt: dead branches that mislead readers and rot.
Categorize toggles — release (short-lived) vs ops/kill-switch and permissioning (long-lived) — and manage each differently.
Give every flag an owner and expiry; deleting the flag is part of finishing the feature.
Follow-ups they push on- Why are short-lived release toggles managed differently from long-lived kill-switches?
- How does an unbounded set of flags undermine your test strategy?
Red flag Leaving release toggles in the code after the feature is fully rolled out — they accumulate into untested, confusing dead branches and a combinatorial test nightmare.
source: Martin Fowler — Feature Toggles (Managing toggles) ↗ -
What metrics actually decide whether to promote or roll back a canary?
A canary is only as good as the signal you judge it by. The decision should be automated and metric-gated, comparing the canary against the baseline (the current stable version) over the same window — not against historical numbers, since traffic shifts.
Watch the user-facing signals: error rate, latency percentiles (p95/p99, not the mean), and request success/throughput, plus key business metrics where relevant (checkout completion, sign-ups). Saturation of the canary's resources is a secondary guard. If any guardrail metric on the canary is statistically worse than baseline beyond a threshold, auto-roll-back; otherwise ramp traffic up in stages.
The pitfalls to design around: too short a bake time (a slow leak or a cache that hasn't warmed won't show yet), too little canary traffic (no statistical power), and comparing against the wrong baseline.
What a strong answer coversCompare the canary against the concurrent baseline, over the same window — not against historical data.
Gate on user-facing signals: error rate, latency percentiles, success/throughput, plus business KPIs.
Automate the verdict: breach a guardrail → auto-roll-back; otherwise ramp up in stages.
Give it enough bake time and traffic volume for the signal to be statistically meaningful.
Follow-ups they push on- Why compare against a concurrent baseline rather than yesterday's numbers?
- What kind of bug would a 10-minute canary with 1% traffic still miss?
Red flag Promoting a canary too quickly or on too little traffic — slow leaks, cold caches, and rare-path errors don't surface in a short, low-volume bake, so the 'green' canary ships a latent bug.
source: Google SRE Workbook — Canarying Releases ↗
6.3 Security fundamentals 15
-
What is the root cause shared by all injection attacks, and why is parameterization the fix?
Every injection flaw — SQL, OS command, LDAP, NoSQL, XPath, even XSS — has the same root cause: untrusted data is interpreted as code because data and instructions travel on the same channel. The interpreter can't tell which bytes you meant as a value and which as syntax, so attacker input rewrites the command's structure.
Parameterization fixes this by *separating the channels*: the query/command template (the code) is sent and compiled independently of the parameters (the data), so user input is bound as a literal value and can never change the parsed structure.
SELECT * FROM users WHERE id = ?with a bound parameter treats'; DROP TABLEas a harmless string.This is why the generalized defense is 'keep code and data separate' — prepared statements for SQL, argument arrays (not shell strings) for OS commands, and context-aware encoding for output. Escaping/blocklisting is a fragile fallback, not the primary control.
What a strong answer coversRoot cause of all injection: untrusted data is parsed as code because they share one channel.
Parameterization sends template and data separately, so input binds as a literal and can't alter structure.
Generalizes beyond SQL: arg arrays for OS commands, parameterized APIs for LDAP/NoSQL, encoding for output (XSS).
Escaping/blocklists are fallbacks, not the fix — they miss encodings and edge cases.
Quick self-checkWhy do parameterized queries prevent SQL injection?
-
Correct — separating the code channel from the data channel is exactly what stops input from being parsed as SQL.
-
Parameterization doesn't strip anything; it binds input as data. Character-stripping is the fragile escaping approach.
-
Encryption is unrelated; injection is about how the query is parsed, not whether it's encrypted in transit.
-
Least privilege is useful defense-in-depth but doesn't prevent the injection itself.
Follow-ups they push on- Why is OS command injection still possible even with a 'parameterized' shell call if you pass a single string?
- How does XSS fit the same 'data interpreted as code' model?
Red flag Thinking injection is a SQL-specific problem solved by a SQL-specific trick — it's a universal code/data-confusion flaw, and the universal fix is separating the two, not escaping characters.
source: OWASP — Injection Prevention Cheat Sheet ↗ -
What's the difference between authentication and authorization, and why must both be enforced server-side?
Authentication (authn) is *who are you?* — verifying identity (password, token, passkey). Authorization (authz) is *what are you allowed to do?* — checking that the verified identity has permission for this action/resource. Authn always comes first; authz decides what that authenticated identity may access.
Both must be enforced server-side because the client is fully under the attacker's control: hiding a button, disabling a form field, or checking a role in JavaScript stops only honest users. An attacker just crafts the HTTP request directly (curl, Burp), bypassing every front-end check. The browser is a convenience layer, never a trust boundary.
So the server must, on every request, verify the credential *and* re-check that this identity is permitted — front-end checks are UX, not security.
What a strong answer coversAuthn = who you are (verify identity); authz = what you may do (verify permission). Authn precedes authz.
The client is attacker-controlled — any check in JS/HTML can be bypassed by crafting the raw request.
Enforce both on the server, every request; front-end checks are UX, not a trust boundary.
Skipping the server-side authz re-check is exactly the Broken Access Control (#1) failure.
Quick self-checkAn admin-only button is hidden in the UI for non-admins, but the /admin/delete endpoint has no server-side role check. What's true?
-
Correct — hiding UI is not authorization; the server must enforce the role check on the request itself.
-
Hiding the button stops only honest users; an attacker crafts the HTTP request directly.
-
Authentication proves identity but doesn't check permission — an authenticated non-admin still gets through.
-
This is broken access control (missing authorization), not script injection.
Follow-ups they push on- How can an attacker bypass a front-end-only role check?
- Where do authentication failures (A07) differ from access-control failures (A01)?
Red flag Enforcing access control only in the UI (hidden buttons, disabled fields) — the server must re-verify, since the client can forge any request directly.
source: OWASP — Authorization Cheat Sheet ↗ -
What is SQL injection, and what is the *one* correct defense?
SQL injection is when untrusted input is concatenated into a query so the attacker can change its structure — e.g.
' OR '1'='1to bypass a login, or'; DROP TABLE users;--to destroy data.The primary defense is parameterized queries / prepared statements: the SQL text and the data travel on separate channels, so input is always treated as a value, never as code. ORMs do this for you when used correctly.
Defense in depth adds least-privilege DB accounts and allow-list input validation — but escaping by hand is error-prone and not the real fix. The principle (separate code from data) generalizes to *all* injection: OS command, LDAP, NoSQL, etc.
Follow-ups they push on- Why is manual escaping or a blocklist of bad characters not a reliable defense?
- How does an ORM still let you write injectable queries?
Red flag Saying 'sanitize/escape the input' as the primary fix — parameterization is the answer; ad-hoc escaping misses cases.
source: OWASP — SQL Injection Prevention Cheat Sheet ↗ -
What is defense in depth, and why isn't input validation alone enough to stop XSS?
Defense in depth is layering independent controls so that no single failure is fatal — if one layer is bypassed, another still stands. No control is perfect, so you don't bet everything on one.
For XSS, input validation alone is insufficient because the danger depends on output context, not the input. A string that's harmless in an HTML body can break out inside a
<script>block, an HTML attribute, a URL, or a CSS context — and validation at the input boundary can't know where the value will eventually be rendered. Worse, data arrives from many sources (DB, other services) that never passed your input filter.So you layer: context-aware output encoding at the point of rendering (the primary defense), a strict Content-Security-Policy as a backstop that limits what injected script can do,
HttpOnlycookies so stolen script can't read the session token, and input validation as one more (not the only) layer.What a strong answer coversDefense in depth = independent layers; a single bypass shouldn't compromise the system.
XSS safety depends on output context (HTML body vs attribute vs JS vs URL), which input validation can't anticipate.
Primary defense is context-aware output encoding at render time; CSP is the backstop.
Data also enters from sources that never hit your input filter (DB, other services), so input validation alone is incomplete.
Follow-ups they push on- Why does the same string need different encoding in an HTML attribute vs a JavaScript context?
- What does a Content-Security-Policy actually restrict?
Red flag Treating input validation as the complete XSS fix — encoding must happen at output based on context, and CSP/HttpOnly provide the additional layers that catch what slips through.
source: OWASP — Cross Site Scripting Prevention Cheat Sheet ↗ -
What is Security Misconfiguration (OWASP A02:2025), and give concrete examples.
Security Misconfiguration is risk introduced by how systems are set up rather than by code flaws — and it climbed to A02 in the 2025 Top 10, reflecting how common it is across the increasingly complex, configurable stacks we run.
Concrete examples: default or unchanged credentials; verbose error pages or stack traces leaking internals in production; unnecessary features/ports/services left enabled; an S3 bucket or admin console open to the public; missing security headers (HSTS, CSP); directory listing on; debug mode on in prod; overly permissive CORS.
The defense is a repeatable, hardened baseline: minimal install (remove what you don't use), secure defaults, infrastructure-as-code so every environment is configured identically and reviewably, automated configuration scanning, and segregated environments. It overlaps tightly with least privilege and supply-chain hygiene.
What a strong answer coversRisk from setup, not code — defaults, exposed services, leaked errors, missing headers.
Rose to A02 in 2025 because modern stacks have huge configurable surface area.
Examples: default creds, public buckets, debug mode in prod, verbose stack traces, permissive CORS.
Fix with a hardened, minimal, repeatable baseline (IaC + config scanning + identical environments).
Follow-ups they push on- Why does Infrastructure-as-Code reduce misconfiguration risk?
- Why are verbose production error messages a security problem, not just a UX one?
Red flag Treating misconfiguration as a one-time setup task — config drifts across environments and over time; without IaC and scanning, prod quietly diverges into an insecure state.
source: OWASP Top 10:2025 — A02 Security Misconfiguration ↗ -
What is encoding (Base64), and why is it not a security control?
Encoding transforms data into another representation for safe transport or storage — Base64, URL-encoding, hex. It's a fully reversible, keyless, public algorithm: anyone can decode it with no secret. Its purpose is *compatibility* (e.g. putting binary in a text/JSON field), not secrecy.
That's the trap: Base64 *looks* scrambled, so people mistake it for protection. But
dXNlcjpwYXNzdecodes touser:passin one trivial step — it provides zero confidentiality.The three are distinct: encoding = reversible, no key, for compatibility; encryption = reversible *with a key*, for confidentiality; hashing = one-way, no key, for integrity/verification. Anytime someone says 'we Base64 the password before sending,' that's a misunderstanding — over HTTP it's plaintext; you need TLS (encryption) for confidentiality.
What a strong answer coversEncoding is reversible and keyless — its job is transport/compatibility, not secrecy.
Base64 'looks' encrypted but decodes in one public step → zero confidentiality.
Distinguish the trio: encoding (no key, compat) vs encryption (key, confidentiality) vs hashing (one-way, integrity).
Base64-ing a credential adds no protection; only TLS/encryption provides confidentiality on the wire.
Quick self-checkWhich statement about Base64 encoding is correct?
-
Correct — anyone can decode Base64 without a secret; it's not a security control.
-
Base64 uses no key and is trivially reversible — that's encryption, which Base64 is not.
-
That describes hashing; Base64 is fully reversible.
-
Base64-stored passwords are effectively plaintext; passwords must be hashed with a slow algorithm.
Follow-ups they push on- Where is Base64 a legitimate, correct choice?
- Why is 'we Base64-encode the API key in the header' not securing anything?
Red flag Mistaking Base64 (or any encoding) for encryption — it's a reversible public transform with no key and provides no confidentiality whatsoever.
source: OWASP — Cryptographic Storage Cheat Sheet ↗ -
How do XSS and CSRF differ, and how do you defend against each?
XSS injects attacker-controlled JavaScript that runs in the victim's browser in *your* site's origin. Defenses: context-aware output encoding, a strict Content-Security-Policy, sanitize any HTML you must render, and
HttpOnlycookies so stolen script cannot read the session token.CSRF tricks an already-authenticated browser into firing an unwanted state-changing request (the browser auto-attaches the cookie). Defenses: anti-CSRF tokens,
SameSitecookies, and verifying theOrigin/Refererheader.The crisp distinction: XSS abuses the site's trust in user input; CSRF abuses the site's trust in the user's authenticated session.
Follow-ups they push on- Why does SameSite=Lax mitigate most CSRF?
- Why don't CSRF tokens help against XSS?
Red flag Claiming CSRF tokens stop XSS — if you have XSS, the attacker's script can just read the CSRF token and forge a valid request.
source: OWASP — Cross Site Request Forgery (CSRF) ↗ -
Hashing vs encryption — what's the difference, and which do you use for passwords?
Encryption is reversible: with the key you can recover the plaintext. Use it for data you must read back — data in transit (TLS), secrets at rest.
Hashing is one-way: you cannot invert it; you can only re-hash a candidate and compare. Use it when you only ever need to *verify*, never recover — exactly the password case.
So passwords are hashed, not encrypted — if you can decrypt them, so can an attacker who steals your key. And not just any hash: use a slow, memory-hard password hash with a per-password salt.
Follow-ups they push on- Where does encoding (Base64) fit — is it a security control?
- What's the difference between hashing and an HMAC/keyed hash?
Red flag Saying you 'encrypt passwords' — that is the wrong primitive; passwords should be hashed with a dedicated password hash so they are non-recoverable.
source: OWASP — Password Storage Cheat Sheet ↗ -
A login endpoint returns 'user not found' for unknown emails and 'wrong password' for known ones. What's wrong?
It is a user-enumeration vulnerability. The two distinct messages let an attacker probe which emails are registered, building a target list for credential stuffing, phishing, or password spraying.
Fix: return a single generic message — 'invalid email or password' — for both cases, and keep the response *timing* uniform (still run a dummy password hash when the user doesn't exist) so the attacker can't distinguish via latency either. The same care applies to signup ('email already in use') and password-reset flows.
This ties to OWASP A07 (Authentication Failures).
Follow-ups they push on- How could an attacker still enumerate users via response timing, and how do you prevent that?
- How does this interact with the password-reset 'we sent an email if it exists' pattern?
Red flag Fixing only the message text but leaving a timing side-channel (fast 'not found' vs slow bcrypt compare) that still leaks which accounts exist.
source: OWASP — Authentication Cheat Sheet (account enumeration) ↗ -
What sits at #1 of the OWASP Top 10:2025, and name a couple of categories that are new or changed this edition.
A01: Broken Access Control is #1 — and in the 2025 edition it now absorbs Server-Side Request Forgery (SSRF). It means users acting outside their intended permissions: missing authorization checks, IDOR, privilege escalation.
What is new/notable in 2025:
- A03: Software Supply Chain Failures is new and surged into the top 3 — broadened from the old 'Vulnerable and Outdated Components' to the whole dependency/build ecosystem.
- A10: Mishandling of Exceptional Conditions is brand new — improper error handling, failing open, logic errors on abnormal input.
- The full order: A01 Broken Access Control, A02 Security Misconfiguration, A03 Software Supply Chain Failures, A04 Cryptographic Failures, A05 Injection, A06 Insecure Design, A07 Authentication Failures, A08 Software/Data Integrity Failures, A09 Logging & Alerting Failures, A10 Mishandling of Exceptional Conditions.Follow-ups they push on- Why did SSRF get folded into Broken Access Control?
- What does 'failing open' mean under A10, and why is it dangerous?
Red flag Quoting the 2021 list as current (e.g. putting Injection at #3 or naming 'Vulnerable and Outdated Components') — in 2025 Injection is A05 and supply-chain is its own A03.
source: OWASP Top 10:2025 ↗ -
What is an IDOR, and why does Broken Access Control sit at #1 of the OWASP Top 10:2025?
An IDOR (Insecure Direct Object Reference) is the canonical Broken Access Control bug: an endpoint exposes a reference to an object —
/api/orders/1043— and the server returns it based on the URL alone, without checking that *this* user is allowed to see *that* object. Change1043to1044and you read someone else's order.It's #1 in OWASP Top 10:2025 (as it was in 2021) because authorization is per-request, per-object logic that's easy to forget on some path, hard for scanners to find, and devastating when wrong — it's the most commonly found weakness. The 2025 edition also folded SSRF into this category.
The fix is to enforce authorization server-side on every request, checking ownership/role against the authenticated identity — never trusting a client-supplied ID, never relying on the object reference being unguessable, and denying by default. Using unpredictable IDs (UUIDs) is hardening, not a substitute for the check.
What a strong answer coversIDOR: the server returns an object from a client-supplied reference without verifying the user is authorized for it.
#1 because authz is per-object, easy to miss, hard to scan for, and catastrophic — the most prevalent weakness.
Enforce authorization server-side on every request, deny by default, check ownership against the session identity.
Unguessable IDs (UUIDs) are hardening — not a replacement for the access-control check; SSRF now lives in this category (2025).
Follow-ups they push on- Why is switching from sequential IDs to UUIDs not a real fix for IDOR?
- Why was SSRF moved under Broken Access Control in 2025?
Red flag Relying on 'unguessable' object IDs or hiding the endpoint instead of performing a real per-request authorization check — security by obscurity, not access control.
source: OWASP Top 10:2025 — A01 Broken Access Control ↗ -
Why store a session token in an HttpOnly, Secure, SameSite cookie rather than localStorage?
localStorageis fully readable by any JavaScript on the page — so a single XSS flaw lets an attacker's script exfiltrate the token instantly. A cookie markedHttpOnlyis invisible to JavaScript: even with XSS, the script can't read the token to steal it.The other flags close the remaining gaps:
Securesends the cookie only over HTTPS (no plaintext interception), andSameSite(Lax/Strict) stops the browser from auto-attaching it on cross-site requests, which mitigates CSRF — the attack that cookie-based auth otherwise invites.The tradeoff: HttpOnly cookies are auto-sent by the browser, so you take on CSRF risk and must defend it (SameSite + anti-CSRF tokens). localStorage avoids CSRF but trades it for far worse XSS token theft. The consensus is HttpOnly cookies with CSRF defenses, because XSS token exfiltration is the more dangerous failure.
What a strong answer coverslocalStorage is readable by any JS — one XSS = instant token theft.
HttpOnly hides the cookie from JavaScript, so XSS can't read/exfiltrate it.
Secure = HTTPS-only; SameSite blocks cross-site auto-send, mitigating CSRF.
Cookies trade XSS-theft risk for CSRF risk — so pair them with SameSite + anti-CSRF tokens.
Follow-ups they push on- If HttpOnly cookies are auto-sent, what new attack do you now have to defend, and how?
- Can an attacker with XSS still abuse an HttpOnly session cookie even without reading it?
Red flag Storing JWTs/session tokens in localStorage 'for convenience' — it's directly readable by any injected script, turning any XSS into full account takeover.
source: OWASP — Session Management Cheat Sheet ↗ -
Why is SHA-256 a bad choice for storing passwords, and what's the salt for?
General-purpose hashes like SHA-256 are *designed to be fast* — which is exactly wrong for passwords. An attacker with the hash file can compute billions of guesses per second on a GPU.
Use a slow, memory-hard password hash: OWASP recommends Argon2id (then scrypt; bcrypt only for legacy). Their tunable work factor keeps verification fast for you but brute force expensive for attackers.
The salt is a unique random value per password, stored alongside the hash. It ensures two users with the same password get different hashes and defeats precomputed rainbow tables — the attacker must crack each password individually. (A site-wide secret pepper can be layered on top.)
Follow-ups they push on- Why does a salt have to be unique per password rather than one site-wide value?
- What is a pepper and how does it differ from a salt?
Red flag Using a fast hash (MD5/SHA-1/SHA-256) for passwords, or reusing one salt for everyone — both leave you open to rainbow-table and GPU attacks.
source: OWASP — Password Storage Cheat Sheet ↗ -
What is the principle of least privilege, and how does it apply to secrets management?
Least privilege: every user, service, and credential gets only the permissions it needs to do its job — no more, no less. It shrinks the blast radius when something is compromised.
Applied to secrets:
- Don't hardcode secrets in source or commit them to git; store them in a secrets manager (Vault, AWS/GCP Secrets Manager) or injected env vars.
- Scope each secret narrowly — a service's DB credential can touch only its own schema, not everything.
- Rotate secrets, and prefer short-lived/dynamic credentials over long-lived static keys.
- Audit access so a leaked key is detectable and revocable.This maps to OWASP A02 (Security Misconfiguration) and A08 (Integrity Failures).
Follow-ups they push on- Why is a leaked secret in git history not fixed by just deleting the file in a new commit?
- How do short-lived/dynamic credentials reduce risk versus static keys?
Red flag Granting broad, permanent admin credentials 'to keep things simple' — it maximizes blast radius and violates least privilege.
source: OWASP — Secrets Management Cheat Sheet ↗ -
What is Software Supply Chain risk (OWASP A03:2025), and how do you reduce it?
Your app is mostly code you didn't write — third-party packages, their transitive deps, base images, and the build/CI pipeline itself. A03:2025 covers compromises anywhere in that chain: a malicious or vulnerable dependency, a typosquatted package, a poisoned build step, or a tampered artifact.
Mitigations:
- Pin and lock dependencies (lockfiles, hashes) so builds are reproducible.
- Scan deps for known CVEs (SCA tools) and patch promptly.
- Generate an SBOM so you know what you ship.
- Verify provenance / sign artifacts (e.g. Sigstore) and protect CI credentials.
- Minimize and pin base images.This was broadened in 2025 from the older 'Vulnerable and Outdated Components' to the whole ecosystem.
Follow-ups they push on- What is an SBOM and why did regulators start requiring it?
- How would a typosquatted npm package actually compromise you?
Red flag Treating supply-chain security as just 'keep dependencies updated' — it also covers the build pipeline, artifact provenance, and transitive deps.
source: OWASP Top 10:2025 — Introduction (A03 Software Supply Chain Failures) ↗
6.4 Testing 12
-
Define unit, integration, and end-to-end tests — what does each actually verify?
Unit tests exercise the smallest testable piece — one function/class — in isolation, with collaborators faked. They verify *this unit's logic is correct*. Fast and deterministic.
Integration tests verify that units talk to a real collaborator correctly — your code against an actual database, queue, or HTTP API. They catch interface/wiring bugs a unit test mocks away.
End-to-end tests drive the fully assembled system the way a user would (through the UI or public API) and verify a whole journey works. Slowest, most realistic, most brittle.
The trade is realism vs. speed/stability: unit = fast + narrow, e2e = realistic + fragile.
Follow-ups they push on- Why can a suite of all-green unit tests still let a broken feature ship?
- What's the difference between an integration test and a component test?
Red flag Calling a test that mocks the database an 'integration test' — if every dependency is faked it is still a unit test.
source: Martin Fowler — The Practical Test Pyramid ↗ -
What is the Arrange-Act-Assert pattern, and what makes a test maintainable?
Arrange-Act-Assert (AAA) structures a test into three clear phases: Arrange the inputs and preconditions, Act by invoking the one thing under test, then Assert on the outcome. Keeping these visually separate makes a test read as a tiny spec of the behavior.
Maintainable tests share a few traits: they test one behavior (so a failure points at one cause), assert on observable behavior rather than implementation detail (so a refactor doesn't break them), are deterministic and isolated (no shared state, no order dependence), and have descriptive names that state the scenario and expected result. A good test is also fast.
The through-line: a test should fail for exactly one reason and tell you what that reason is. Tests are production code — DRY-ish helpers are fine, but readability beats cleverness.
What a strong answer coversAAA: Arrange preconditions → Act on the unit → Assert the outcome; keep the phases visibly separate.
Test one behavior per test so a failure localizes to a single cause.
Assert on observable behavior, not internals, so refactors don't break green tests.
Be deterministic, isolated, and descriptively named — a test should fail for exactly one reason.
Follow-ups they push on- Why does asserting on private implementation detail make tests brittle?
- What's the 'one assertion per test' guideline really getting at?
Red flag Writing tests that assert on internal calls/structure rather than observable behavior — they break on every refactor even when the behavior is unchanged, training people to delete tests.
source: Martin Fowler — Given-When-Then ↗ -
What is the test pyramid, and why more unit tests than end-to-end tests?
The pyramid is a guideline for the *shape* of your test suite: a wide base of fast, cheap unit tests; fewer integration tests in the middle; and a thin top of end-to-end tests through the whole system/UI.
Why that shape: as you go up, tests get slower, more brittle, and harder to pin a failure to a cause. Unit tests run in milliseconds and localize bugs precisely; e2e tests run for minutes, flake on timing, and only tell you *something* broke. So you push as much coverage as low as possible and reserve e2e for a few critical user journeys.
The inverted shape — mostly e2e — is the ice-cream cone anti-pattern: slow, flaky, expensive to maintain.
Follow-ups they push on- What does the 'ice-cream cone' look like and why is it painful?
- Where do contract tests fit in this picture?
Red flag Treating the pyramid as exact ratios or gospel rather than a heuristic — the real point is fast/cheap/localized at the bottom, slow/brittle at the top.
source: Martin Fowler — The Practical Test Pyramid ↗ -
Walk me through the TDD cycle. What does it actually buy you?
TDD is red-green-refactor:
1. Red — write a small failing test for the next bit of behavior.
2. Green — write the minimum code to make it pass.
3. Refactor — clean up the code (and tests) now that they are green, keeping the bar passing.Repeat in tiny increments. What it buys you: tests exist by construction (not bolted on later), the code is *designed to be testable* (so it tends toward decoupling and clear interfaces), and you get a fast feedback loop plus a regression safety net that lets you refactor fearlessly. It also forces you to define 'done' before coding.
Follow-ups they push on- Why is the refactor step the part people skip, and what happens when they do?
- When is strict TDD a poor fit?
Red flag Describing TDD as 'write tests after the code' — the whole point is the test comes *first* and drives the design.
source: Martin Fowler — Test Driven Development ↗ -
Should unit tests hit a real database? When is an in-memory or test-container DB the right call?
By definition, a unit test shouldn't touch a real DB — that makes it slow and non-deterministic. So you mock the data layer for unit tests. But mocking the DB means you never verify your *actual* SQL, migrations, or ORM mappings, and that's where real bugs hide.
So the pragmatic answer is layered: unit-test pure logic with the DB doubled, then write integration tests against a real database engine for the queries themselves. The mistake to avoid is using a *different* engine in tests than in production — e.g. SQLite or an in-memory fake standing in for Postgres. SQL dialects, constraint behavior, and types differ, so tests can pass against the fake and fail against prod (or vice versa).
Modern practice is Testcontainers: spin up the *real* database (same engine/version as prod) in a throwaway container for integration tests. You get fidelity without polluting a shared environment.
What a strong answer coversA true unit test doesn't hit a DB — mock the data layer for logic; it's slow/non-deterministic otherwise.
But mocks never validate real SQL, migrations, or ORM mappings — cover those with integration tests.
Don't substitute a different engine (SQLite for Postgres) — dialect/constraint differences make tests lie.
Use Testcontainers to run the real prod-version DB in a disposable container for integration tests.
Follow-ups they push on- Why can an in-memory SQLite stand-in for Postgres give false confidence?
- What belongs in a unit test vs an integration test for a repository class?
Red flag Testing against a different DB engine than production (in-memory fake for the real thing) — dialect and constraint mismatches let bugs pass tests and break in prod.
source: Testcontainers — Database integration testing ↗ -
What is mutation testing, and how does it reveal that high code coverage can be misleading?
Line/branch coverage tells you code *ran* during tests, not that anything was *checked*. Mutation testing measures the latter: a tool makes small deliberate changes (mutants) to your code — flip
>to>=, replace+with-, negate a condition, return null — then reruns your tests against each mutant.If a mutant makes a test fail, it's killed (good — your tests detected the change). If all tests still pass, the mutant survived — meaning your suite executed that code but never asserted anything that the change would break. The mutation score (killed / total) is a far better quality signal than coverage.
This exposes the assertion-free-coverage problem directly: you can have 100% line coverage and a low mutation score, because tests call the code but verify nothing meaningful. The cost is compute — running the suite once per mutant is expensive — so teams often run it on critical modules rather than the whole repo.
What a strong answer coversCoverage proves code executed; mutation testing proves your assertions actually catch changes.
It injects small bugs (mutants); a killed mutant = tests detected it, a survivor = a gap in assertions.
Mutation score (killed/total) is a stronger quality metric than line coverage.
Directly exposes assertion-free coverage: 100% lines but mutants survive = tests that check nothing.
Cost is high (rerun suite per mutant), so target critical modules rather than the whole codebase.
Quick self-checkA mutant 'survives' a mutation test run. What does that tell you?
-
Correct — a surviving mutant means the change wasn't detected, exposing weak or missing assertions.
-
Possible but not what 'survived' specifically means — survival is about tests running yet not detecting the change.
-
Mutants are throwaway experiments on a copy, not changes shipped to production.
-
Mutation testing is slow to run, but survival is a signal about assertion quality, not speed.
Follow-ups they push on- How can you have 100% line coverage and a 40% mutation score?
- What is an 'equivalent mutant' and why does it muddy the score?
Red flag Trusting coverage as a quality bar — mutation testing routinely shows high-coverage suites with surviving mutants, i.e. tests that run code without asserting on its behavior.
source: PIT (Pitest) — Mutation testing ↗ -
Your e2e suite takes 45 minutes and people skip it. How do you make the test strategy sustainable?
A 45-minute, ignored e2e suite is usually the ice-cream-cone anti-pattern: too much testing pushed up to the slow, brittle e2e layer. The fix is to rebalance toward the test pyramid — push coverage down to where it's fast and reliable.
Concretely: for each slow e2e test, ask what it really verifies and move that assertion to the lowest layer that can — pure logic to unit tests, service-boundary behavior to integration/contract tests, and reserve e2e for a handful of critical user journeys (login, checkout). Parallelize what remains across CI runners, and split the suite so fast tests gate every PR while the full e2e set runs on a schedule or pre-deploy.
Separately, hunt flakiness — a slow suite people skip is often also a flaky one they've stopped trusting. Quarantine and fix flaky tests rather than retrying. The goal is a fast feedback loop developers actually run, backed by a thin, stable e2e layer.
What a strong answer coversA bloated e2e suite is the ice-cream cone — rebalance toward the pyramid (fast, low-level tests).
Move each assertion to the lowest layer that can verify it; keep e2e for a few critical journeys only.
Parallelize and tier the suite: fast tests gate PRs, full e2e runs pre-deploy/scheduled.
Attack flakiness too — skipped suites are usually distrusted (flaky) ones; quarantine and fix, don't retry.
Follow-ups they push on- How do you decide which assertions can move down from e2e to unit/integration?
- Why is tiering the suite (PR gate vs nightly) better than running everything on every push?
Red flag Speeding up an ice-cream-cone suite by only adding retries and more parallelism — without rebalancing toward the pyramid you still have a slow, brittle suite developers route around.
source: Martin Fowler — The Practical Test Pyramid ↗ -
What's the difference between a mock and a stub, and when do you reach for each?
Both are test doubles that stand in for a real dependency, but they answer different questions.
A stub provides canned return values so the code under test can run — it is about *state*: 'when asked, return this'. You assert on the output your code produces.
A mock also has pre-programmed responses but additionally *verifies the interaction* — it is about *behavior*: 'was
sendEmailcalled once, with these args?'. You assert on the mock itself.Rule of thumb: stub queries (reads), mock commands (side effects you care happened). Over-mocking couples tests to implementation detail and makes refactoring painful.
Follow-ups they push on- What's the difference between a fake and a stub?
- Why can heavy mocking make tests pass while the real integration is broken?
Red flag Mocking everything, including pure logic — the test then asserts on internal calls and breaks on any refactor even when behavior is unchanged.
source: Martin Fowler — Mocks Aren't Stubs ↗ -
Why isn't 100% code coverage the goal? Can you have high coverage and still be poorly tested?
Coverage measures which lines *executed* during tests — not whether you *asserted* anything meaningful about them. You can hit 100% with tests that call code and check nothing, or that never exercise the edge cases and error paths that actually break in production.
Chasing 100% also has diminishing returns: the last few percent are often trivial getters or unreachable branches, and the effort is better spent elsewhere. Worse, it incentivizes shallow tests written to satisfy a number.
Better: treat coverage as a *diagnostic for gaps* (what is entirely untested?), aim for a sensible threshold, and judge quality by whether tests assert behavior and cover the risky paths — not by a single percentage.
Follow-ups they push on- What is mutation testing and how does it expose 'assertion-free' coverage?
- Which kinds of code genuinely don't need unit tests?
Red flag Treating a coverage percentage as a quality metric — high coverage with weak/absent assertions is theater.
source: Martin Fowler — Test Coverage ↗ -
A test in your CI passes locally but fails ~10% of the time in the pipeline. How do you approach it?
That is a flaky test — non-deterministic. First, do not 'fix' it by retrying or deleting; quarantine it so it stops eroding trust in the suite, then root-cause it.
Common causes to check:
- Async/timing: a fixed
sleepinstead of waiting on a real condition; race conditions.
- Shared state / test ordering: tests that leak state between runs or assume order.
- Time and randomness: realnow(), time zones, unseeded random.
- External dependencies / network that are slow or unavailable in CI.
- Resource contention in the parallel CI runner that doesn't happen locally.Fix the determinism (inject the clock, isolate state, wait on conditions, stub the network). Flaky tests are dangerous because people start ignoring red builds.
Follow-ups they push on- Why is auto-retrying flaky tests a trap?
- How would you reproduce a CI-only failure locally?
Red flag Masking flakiness with blanket retries — it hides real race conditions and trains the team to ignore failing tests.
source: Martin Fowler — Eradicating Non-Determinism in Tests ↗ -
What's the difference between sociable and solitary unit tests, and the 'London vs Detroit' (mockist vs classicist) schools?
A solitary unit test isolates the unit by replacing *all* its collaborators with test doubles; a sociable unit test lets the unit use its real collaborators (as long as they're fast and deterministic), testing them together.
This maps to two testing schools. The mockist / London school favors solitary tests with mocks for every dependency, verifying *interactions* — it gives precise failure localization and tests units in true isolation, but couples tests to the call structure, so refactors that preserve behavior can still break tests. The classicist / Detroit (Chicago) school favors sociable tests, mocking only awkward dependencies (network, clock, DB), and asserting on *resulting state* — tests are more refactor-resilient and catch integration bugs between collaborators, but a failure may implicate several units.
Neither is 'correct'; the tradeoff is isolation/precision vs. refactor-resilience/realism, and most teams blend them.
What a strong answer coversSolitary = all collaborators doubled; sociable = uses real collaborators where practical.
Mockist/London: mock everything, verify interactions — precise localization, but couples tests to call structure.
Classicist/Detroit: mock only awkward deps, assert on state — refactor-resilient, catches inter-unit bugs.
The tradeoff is isolation/precision vs. realism/refactor-resilience; teams usually mix both.
Follow-ups they push on- Why can a mockist test pass while the real integration is broken?
- Which approach makes a behavior-preserving refactor less likely to break tests, and why?
Red flag Treating one school as universally right — all-mockist suites become refactor-fragile interaction tests, while all-sociable suites can lose failure localization.
source: Martin Fowler — Unit Test (Solitary vs Sociable) ↗ -
What is a contract test, and what problem does it solve that unit and e2e tests don't?
When service A calls service B, A's unit tests stub B — but the stub encodes A's *assumption* of B's API, which silently rots when B changes. Full e2e tests catch the mismatch but are slow, flaky, and need every service deployed together.
Contract testing (e.g. consumer-driven contracts / Pact) fills the gap. The consumer (A) defines the requests it makes and the responses it expects as a contract; that contract is then verified against the provider (B) independently. If B's change would violate A's expectations, B's pipeline fails — *before* anything is deployed together.
The payoff: you get confidence that two services are compatible at their boundary with the speed and independence of unit tests — no shared environment, each side tested in its own pipeline. It's how you keep a microservices fleet integrable without a giant brittle e2e suite.
What a strong answer coversStubs of a remote service encode assumptions that drift as the provider changes — unit tests won't notice.
A contract captures the consumer's expected requests/responses and is verified against the provider separately.
It catches integration breakage before deploy, without a shared e2e environment.
Gives boundary-compatibility confidence with the speed/isolation of unit tests — key for microservices.
Follow-ups they push on- What does 'consumer-driven' add over the provider just publishing an OpenAPI spec?
- Why don't all-green unit tests on both services guarantee they integrate?
Red flag Assuming green unit tests on both sides mean the services integrate — the consumer's stub can diverge from the provider's real behavior, which only contract or integration tests catch.
source: Martin Fowler — Contract Testing ↗
6.5 Version control intricacies 12
-
Walk me through resolving a merge conflict. What is Git actually asking you to do?
A conflict happens when two branches changed the *same lines* (or one edited what the other deleted) and Git can't auto-pick a winner. It pauses and marks the file with
<<<<<<< HEAD(your side),=======, and>>>>>>> other-branch(incoming side).To resolve: open each conflicted file, decide the correct final content (it is rarely 'pick one blindly' — often you keep parts of both), delete the conflict markers, then
git addthe file to mark it resolved andgit commit(orgit rebase --continue).Good practice: understand *why* both sides changed it, run the tests after resolving, and keep branches short-lived so conflicts stay small.
git merge --abortbacks out if you want to start over.Follow-ups they push on- How does keeping PRs small reduce conflict pain?
- What is `git rerere` and when does it help?
Red flag Blindly accepting one side ('keep mine'/'keep theirs') to make the conflict go away — that silently drops the other side's legitimate change.
source: Atlassian — Merge conflicts ↗ -
What makes a good commit message and a good atomic commit, and why does it matter downstream?
An atomic commit captures one logical change — it does exactly one thing and leaves the codebase building/passing. A good message has a concise imperative summary line ('Add retry to S3 upload', ~50 chars), a blank line, then a body explaining the why (and any tradeoffs), not the *what* — the diff already shows what changed.
Why it matters is entirely downstream: clean atomic commits make
git bisectland on a tiny diff, makegit revertundo exactly one change without collateral, make code review comprehensible commit-by-commit, and makegit blame/log a usable history rather than a wall of 'misc fixes'. A commit that bundles a refactor, a feature, and a formatting sweep is impossible to bisect, revert, or review cleanly.The summary is for scanning
git log; the body is for the engineer (often future-you) who needs to understand *why* a line exists.What a strong answer coversAtomic = one logical change that builds/passes on its own.
Message = imperative summary line + blank line + body explaining why, not what.
Pays off in bisect (tiny diff), revert (no collateral), review (commit-by-commit), and blame/log.
Bundled commits (feature + refactor + reformat) are un-bisectable, un-revertable, and unreviewable.
Follow-ups they push on- Why explain the *why* in the body when the diff already shows the *what*?
- How does a clean commit history make `git revert` safer than a bundled one?
Red flag Bundling unrelated changes into one commit (and writing 'fixes'/'updates' as the message) — it destroys the downstream value of bisect, revert, blame, and review.
source: Git — Commit Guidelines (Pro Git book) ↗ -
What's the point of a pull request beyond merging code? Why squash-merge vs merge-commit vs rebase-merge?
A pull request is the collaboration unit, not just a merge button: it's where review, CI gates, discussion, and an audit trail of *why* a change was made all attach to a proposed change before it lands. The merge is the smallest part.
The three merge modes shape your
mainhistory differently. Merge commit preserves every commit on the branch plus a merge node — full history, butmaingets noisy with WIP commits. Squash-merge collapses the whole PR into one commit onmain— clean, atomic, one-PR-one-commit history that's easy to bisect/revert, at the cost of losing the branch's intermediate commits. Rebase-merge replays the branch's commits linearly ontomainwith no merge node — linear history that keeps individual commits, but rewrites their hashes.Many teams default to squash-merge for a tidy, revertable trunk; rebase-merge when individual commits are each meaningful; merge-commit when preserving exact branch topology matters.
What a strong answer coversA PR bundles review, CI gates, discussion, and audit trail — merging is its smallest function.
Merge commit: keeps all branch commits + a merge node — full history, noisier trunk.
Squash-merge: one commit per PR — clean, atomic, easy to bisect/revert; loses intermediate commits.
Rebase-merge: linear history keeping individual commits, but rewrites their hashes (no merge node).
Follow-ups they push on- Why does squash-merge make `git revert` of a whole feature trivial?
- When would preserving the branch's individual commits (rebase/merge) be worth the noise?
Red flag Thinking a PR is just a merge mechanism — its real value is the review/CI/discussion gate; and picking a merge strategy without considering how bisect/revert/readability of `main` are affected.
source: GitHub Docs — About merge methods on GitHub ↗ -
Rebase vs merge — what's the difference, and when should you NOT rebase?
Merge ties two branches together with a merge commit, preserving the true, non-linear history (and the context of when work diverged).
Rebase replays your commits on top of the target branch, producing a *linear* history as if you'd branched from the latest
main— cleaner log, no merge bubbles. But it rewrites commit hashes.The golden rule of rebasing: never rebase commits that exist outside your local repo / that others have based work on. Rewriting a shared/public branch changes its history out from under teammates, causing divergence and painful re-syncs. Rebase your *private* feature branch onto
mainbefore opening the PR; use merge for integrating shared branches.Follow-ups they push on- What does `git pull --rebase` do, and why might a team standardize on it?
- If you must change a pushed branch, what makes force-pushing 'safer'?
Red flag Rebasing a branch other people have already pulled — it rewrites shared history and forces everyone into messy recovery.
source: Atlassian — Merging vs Rebasing (the golden rule) ↗ -
What does git cherry-pick do, and what's a legitimate use case?
git cherry-pick <sha>applies the *changes introduced by one specific commit* onto your current branch, creating a new commit (new hash) with the same diff.Legitimate uses: backporting a hotfix from
mainonto a release/maintenance branch without dragging along everything else; recovering one commit from an abandoned branch; pulling a single fix forward.Use it sparingly: cherry-picking the same change into multiple branches duplicates commits, which can cause confusing 'phantom' conflicts later when the branches eventually merge. Prefer normal merge/rebase flow when you actually want all of a branch.
Follow-ups they push on- Why can repeated cherry-picks create duplicate-commit merge conflicts down the line?
- How is cherry-pick different from a partial merge?
Red flag Using cherry-pick as a routine integration strategy — it scatters duplicated commits and breaks the clean ancestry that merge/rebase preserve.
source: Atlassian — git cherry-pick ↗ -
A bug appeared somewhere in the last 200 commits. How do you find which commit introduced it?
Use
git bisect— a binary search over history. You mark a known-bad commit and a known-good one; Git checks out the midpoint, you test it and markgoodorbad, and it halves the range each step. Over ~200 commits that is roughly 8 tests instead of 200.If you can script the check (a test that exits non-zero on the bug),
git bisect run <script>automates the whole thing. When done,git bisect resetreturns you to where you started, and you have the exact offending SHA — thengit showit to understand the change.This is why small, atomic commits matter: bisect lands you on a tiny diff, not a 2,000-line mega-commit.
Follow-ups they push on- Why do large, mixed-purpose commits make bisect less useful?
- How does `git bisect run` automate the search?
Red flag Manually checking out commits at random instead of bisecting — it is O(n) guessing versus O(log n) binary search.
source: Git — git-bisect documentation ↗ -
What's the difference between git reset, git revert, and git checkout/restore?
They operate at different levels and have very different safety profiles.
git revert <sha>creates a *new* commit that undoes the changes of an earlier one — history is preserved and moves forward. It's the safe way to undo a commit that's already been pushed/shared, because it doesn't rewrite history.git resetmoves the current branch pointer to another commit, rewriting history.--softkeeps changes staged,--mixed(default) unstages them,--harddiscards working-tree changes too. Reset is for local, unpushed history — using it on shared history is the rebase-style hazard.git checkout/git restore(modern Git split checkout's jobs intoswitchfor branches andrestorefor files) operate on the working tree / specific files — discarding local file changes or restoring a file to a given version, without moving the branch.Rule of thumb: undo *public* history with
revert; rewrite *local* history withreset; restore *files/working tree* withrestore.What a strong answer coversrevert= new commit that undoes another; safe on pushed/shared history (no rewrite).reset= move the branch ref, rewriting history;--soft/--mixed/--harddiffer in what they keep. Local-only.restore/checkout= operate on files/working tree, not the branch pointer.Rule: revert public, reset local, restore files.
Quick self-checkA buggy commit is already pushed and others have pulled it. What's the safe way to undo it?
-
Correct — revert is non-destructive and safe for commits others already have.
-
This rewrites shared history out from under teammates — the exact hazard to avoid on a pushed branch.
-
That just moves HEAD into a detached state on your machine; it doesn't undo the commit for anyone.
-
restore touches working-tree files, not the published commit history everyone has pulled.
Follow-ups they push on- Why is revert the correct tool for undoing a commit on a shared branch?
- What exactly do --soft, --mixed, and --hard each preserve?
Red flag Using `git reset --hard` to undo a commit that's already pushed — it rewrites shared history (and `--hard` also destroys uncommitted work); use `revert` for anything public.
source: Atlassian — Resetting, checking out & reverting ↗ -
Compare trunk-based development, GitHub Flow, and Git Flow — when does each fit?
Trunk-based: everyone commits to (or merges tiny, short-lived branches into)
mainat least daily; unfinished work hides behind feature flags. Optimizes for continuous integration and fast delivery; demands strong tests and CI. The modern default for teams shipping continuously.GitHub Flow: one long-lived
mainplus short feature branches via pull request; merge and deploy on approval. A lightweight middle ground, great for web apps with continuous deployment.Git Flow: heavyweight model with long-lived
develop,release, andhotfixbranches alongsidemain. Suits versioned/installed software with scheduled releases — but for fast web delivery its long-lived branches cause painful merges and slow integration.The trend is toward trunk-based; Git Flow is increasingly considered overkill outside release-train products.
Follow-ups they push on- Why do long-lived branches hurt continuous integration?
- How do feature flags make trunk-based development possible?
Red flag Defaulting to Git Flow for a continuously-deployed web app — its long-lived branches fight CI and cause merge hell.
source: Atlassian — Trunk-based development ↗ -
When is force-pushing acceptable, and what makes --force-with-lease safer than --force?
Force-pushing is needed after you rewrite history on a branch (rebase, amend, interactive-rebase cleanup) — the remote ref no longer fast-forwards, so a normal push is rejected. It's acceptable on a branch you own that others aren't building on: typically your own feature/PR branch. It is *not* acceptable on shared branches like
main.Plain
git push --forceoverwrites the remote ref unconditionally — if a teammate pushed in the meantime, you silently destroy their commits.git push --force-with-leaseadds a safety check: it only overwrites if the remote is still at the commit you *last saw*. If someone else pushed since your last fetch, the lease check fails and the push is rejected, so you can't clobber work you didn't know about.So: rewrite only unshared history, and when you must force-push, use
--force-with-leaseso a surprise upstream change aborts the push instead of being overwritten.What a strong answer coversForce-push is required after history rewrites (rebase/amend); only OK on branches you own, never shared
main.--forceoverwrites the remote unconditionally — it can silently erase teammates' new commits.--force-with-leaseonly pushes if the remote still matches what you last fetched — else it aborts.The lease turns 'I might clobber unseen work' into a safe failure you can investigate.
Follow-ups they push on- How can --force-with-lease still bite you if a tool runs `git fetch` in the background?
- Why does an interactive rebase on a PR branch require a force-push at all?
Red flag Using plain `--force` on a branch others might have pushed to — it overwrites their commits with no warning; `--force-with-lease` aborts instead, so it should be the default.
source: Atlassian — git push (force pushing) ↗ -
You pushed a commit with a leaked API key. Is deleting the file in a new commit enough? How do you fix it?
No — a new commit that removes the file leaves the secret in history; anyone can
git log/git checkoutthe old commit and read it. The secret is effectively public the moment it was pushed.Correct response, in order:
1. Rotate/revoke the key immediately — assume it is already compromised. This is the only step that truly protects you.
2. Purge it from history (git filter-repo, or BFG Repo-Cleaner) and force-push, coordinating with the team since it rewrites shared history.
3. Add a pre-commit/secret-scanning hook and a.gitignoreso it can't recur.The key insight: history rewriting is cleanup, but rotation is the real fix — caches, forks, and clones may still hold the old blob.
Follow-ups they push on- Why is rotating the secret more important than scrubbing it from git?
- Why does rewriting history here require a coordinated force-push?
Red flag Thinking 'I deleted the file and committed, we're fine' — the secret persists in history and must be rotated regardless.
source: GitHub Docs — Removing sensitive data from a repository ↗ -
You ran a bad reset/rebase and 'lost' commits that aren't in any branch. How do you get them back?
Use
git reflog. The reflog records whereHEAD(and each branch ref) has pointed over time — every commit, checkout, reset, rebase, and merge — even commits no branch points at anymore. Areset --hardor a botched rebase doesn't delete the old commits; it just moves the ref, leaving the originals 'dangling' but still reachable via reflog.Recovery:
git reflogto find the SHA from *before* the bad operation (e.g.HEAD@{3}), thengit reset --hard <sha>to move the branch back, orgit checkout -b recover <sha>/git cherry-pick <sha>to salvage specific commits.The key insight: in Git, work you've committed is almost never truly lost — those objects survive until garbage collection (default ~30–90 days) and the reflog is the map to them. (Uncommitted working-tree changes, by contrast, *are* gone — reflog only tracks committed history.)
What a strong answer coversgit refloglogs every position of HEAD/branch refs — including commits no branch references.reset --hard/rebase move refs, leaving old commits dangling but recoverable, not deleted.Recover with
git reset --hard <sha>orgit checkout -b/cherry-pickthe SHA found in the reflog.Committed work survives until GC (~30–90 days); only uncommitted changes are truly unrecoverable.
Follow-ups they push on- Why can reflog recover a committed change but not uncommitted working-tree edits?
- How long do dangling commits survive before garbage collection removes them?
Red flag Panicking and re-doing work after a bad reset/rebase — the old commits are almost always recoverable via reflog; only uncommitted changes are genuinely lost.
source: Atlassian — git reflog ↗ -
What is an interactive rebase (squash/fixup/reword) for, and what's the risk?
git rebase -ilets you rewrite a series of your own commits before sharing them: reorder them, squash/fixupseveral WIP commits into one logical change, reword messages, edit a commit's content, or drop a commit. The point is to turn a messy local history ('wip', 'fix typo', 'oops') into a clean, reviewable sequence of atomic commits — which makes review,git bisect, andgit revertfar more useful later.The risk is the same golden rule of rebasing: it rewrites commit hashes, so you must only do it to commits that haven't been shared. Interactive-rebasing commits others have already based work on rewrites public history and forces everyone into painful re-syncs. Do it on your local feature branch before opening (or updating) the PR; never on shared
main.What a strong answer coversInteractive rebase curates your own unshared commits: squash/fixup, reorder, reword, edit, drop.
Goal: a clean, atomic, reviewable history — which makes bisect and revert more effective.
It rewrites hashes, so obey the golden rule: only on commits not yet shared.
Use it on your local feature branch pre-PR, never on shared
main.
Follow-ups they push on- How does `git commit --fixup` plus `rebase --autosquash` streamline cleanup?
- Why does squashing make `git bisect` and `git revert` more useful afterward?
Red flag Interactive-rebasing commits that are already pushed/shared — it rewrites public history (new hashes) and forces collaborators into messy recovery; keep it to local, unshared work.
source: Atlassian — Rewriting history (interactive rebase) ↗
6.6 Code quality 11
-
What does 'clean code' mean to you? Name a few concrete principles.
Clean code is code optimized for the *reader*, since code is read far more than it is written. Concrete principles:
- Intention-revealing names — a name should say what something is/does so you don't need a comment to explain it.
- Small, single-purpose functions — one level of abstraction, do one thing.
- DRY — don't duplicate knowledge; but don't abstract prematurely either.
- **Comments explain *why*, not *what* — the code shows what; comments justify non-obvious decisions.
- Consistent style** — let formatters/linters handle it so reviews focus on substance.The through-line: minimize the cognitive load on the next person (often future-you).
Follow-ups they push on- When does DRY go too far and create the wrong abstraction?
- Why is a comment that restates the code a smell?
Red flag Reciting buzzwords (DRY, SOLID) without the underlying goal — readability and changeability — or over-applying DRY into a tangled wrong abstraction.
source: Martin Fowler — Two Hard Things (naming) / CodeAsDocumentation ↗ -
What is refactoring, and when is the right time to do it?
Refactoring is changing the *internal structure* of code to make it easier to understand and cheaper to modify, without changing its observable behavior. The behavior-preserving part is what makes it safe — and why a solid test suite is its prerequisite.
When: not as a separate 'refactoring sprint' but continuously, woven into feature work. The pragmatic trigger is the rule of three / refactor-when-it-hurts — when you are about to add a feature and the existing design fights you, first refactor to make the change easy, then make the easy change. Plus the boy-scout rule: leave each file a little cleaner than you found it.
Follow-ups they push on- Why is refactoring without tests dangerous?
- What's the difference between refactoring and rewriting?
Red flag Calling any code change 'refactoring' even when it alters behavior — that conflation is how 'refactors' sneak in bugs and scope creep.
source: Martin Fowler — Refactoring ↗ -
What's the difference between a linter and a formatter, and why automate both?
A formatter (Prettier, gofmt, Black) rewrites code to a canonical *style* — indentation, quotes, line length. It is purely cosmetic and deterministic.
A linter (ESLint, Ruff, golangci-lint) analyzes code for *problems and smells* — unused variables, likely bugs, anti-patterns, sometimes security issues. It catches substance, not just style.
Automate both, ideally in pre-commit hooks and CI, because it removes whole categories of nit-picking from human review. When formatting and trivial issues are settled by tools, reviewers spend their attention on design and correctness — the things only humans can judge. It also keeps style consistent regardless of who wrote the code.
Follow-ups they push on- Why run these in CI even if developers have editor integration?
- How does auto-formatting reduce diff noise in code review?
Red flag Conflating the two, or relying on humans to enforce style in review — that wastes reviewer attention on what a tool should settle automatically.
source: Prettier — Prettier vs. Linters ↗ -
What does cyclomatic complexity measure, and why is high complexity a problem?
Cyclomatic complexity counts the number of independent paths through a piece of code — essentially one plus the number of decision points (
if,for,while,case,&&/||,?:). A straight-line function is 1; each branch adds a path.Why it matters: it correlates with how hard the code is to understand, test, and maintain. It's also a lower bound on the number of test cases needed to cover every path — a function with complexity 15 needs at least 15 paths exercised to test thoroughly, which is a strong hint it's doing too much. High complexity concentrates risk: the more tangled the branching, the more places a bug can hide.
Use it as a heuristic flag, not a hard law — a high score points you at a function worth simplifying (extract method, replace nested conditionals with guard clauses or polymorphism), but a naturally branchy dispatch can be legitimately high. Linters can fail a build over a threshold to keep it visible.
What a strong answer coversMeasures independent paths ≈ 1 + count of decision points (branches/loops/boolean operators).
Higher = harder to understand, test, maintain; it's a lower bound on test cases needed for path coverage.
A high score flags a function doing too much — a candidate for extract-method / guard clauses.
It's a heuristic, not gospel — some dispatch logic is legitimately branchy.
Quick self-checkWhat does a high cyclomatic complexity number most directly indicate?
-
Correct — complexity counts decision paths, and more paths mean more to understand and cover with tests.
-
Complexity is about control-flow paths, not execution speed; a branchy function can still be fast.
-
Line count isn't the metric — a long but straight-line function has complexity 1.
-
That's a different concern (coupling); cyclomatic complexity is internal branching, not dependencies.
Follow-ups they push on- Why is cyclomatic complexity a lower bound on the number of tests for full path coverage?
- Which refactorings most directly reduce a function's complexity score?
Red flag Treating a complexity threshold as an absolute rule — it's a signal to investigate, and gaming the number (splitting one clear function into confusing fragments) can hurt readability more than it helps.
source: NIST — Cyclomatic Complexity (Structured Testing) ↗ -
What makes a good code review, and what should reviewers actually look for?
A good review judges whether the change improves the overall health of the codebase — not whether it is perfect. Reviewers look for, roughly in priority order:
- Design: does the change belong here, fit the architecture, and not over-engineer?
- Correctness & edge cases: logic, error handling, concurrency, security.
- Tests: do they exist and actually exercise the behavior?
- Naming, clarity, comments: will the next reader understand it?
- Consistency with project conventions.Process matters too: keep PRs small (faster, deeper reviews), review promptly to unblock people, comment kindly and explain the 'why', and distinguish blocking issues from optional nits (label them). The goal is shared understanding and a healthier codebase, not gatekeeping.
Follow-ups they push on- Why are small PRs reviewed better than large ones?
- How do you give critical feedback without demoralizing the author?
Red flag Reviewing only for style/formatting (which a linter should catch) while rubber-stamping the design — the expensive bugs live in design and edge cases.
source: Google — Code Review Developer Guide (What to look for) ↗ -
The team wants to stop feature work for a 'big rewrite' to fix the messy codebase. What's your take?
Push back. Big-bang rewrites are notoriously risky: you spend months reproducing existing behavior (including the undocumented edge cases the old code quietly handles), ship no value during the freeze, and often discover the new system has its own mess by the time it's done — the famous 'second-system' trap. Meanwhile the business is frozen and a parallel old-vs-new maintenance burden appears.
The pragmatic alternative is incremental refactoring under a green test suite, often via the Strangler Fig pattern: build the new behavior around the edges of the old system, route traffic to it piece by piece, and retire the old parts gradually — delivering value continuously and keeping rollback cheap. Pay down debt where you're already working (boy-scout rule) and where it has the highest interest.
A rewrite is occasionally justified (the platform is truly dead, or constraints changed fundamentally), but the default answer is: refactor incrementally, keep shipping, and make the cost of debt visible so it's prioritized — not a heroic stop-the-world bet.
What a strong answer coversBig-bang rewrites freeze value delivery and must re-derive every undocumented edge case the old code handles.
They invite the second-system effect and a long old-vs-new dual-maintenance period.
Prefer incremental refactoring behind tests, e.g. the Strangler Fig — replace piece by piece, keep shipping.
Pay down debt where you already work and where interest is highest; make the cost visible.
Quick self-checkWhat's the strongest argument against a big-bang rewrite of a working legacy system?
-
Correct — the hidden, accreted behavior and the value freeze are exactly what makes rewrites overrun and underdeliver.
-
Language choice isn't the issue; the risk is in re-deriving behavior and the delivery freeze, regardless of language.
-
Backwards — incremental refactoring (e.g. Strangler Fig) is precisely the alternative to rewriting.
-
Overly absolute and not the core objection; the central risk is lost behavior and frozen delivery.
Follow-ups they push on- What is the Strangler Fig pattern and how does it de-risk replacing a legacy system?
- What rare conditions actually justify a full rewrite over incremental refactoring?
Red flag Defaulting to a stop-the-world rewrite — it usually overruns, loses hard-won edge-case behavior, ships nothing for months, and lands you with a new mess; incremental refactoring is the lower-risk path.
source: Martin Fowler — StranglerFigApplication ↗ -
How do you keep code review and quality gates from becoming a bottleneck that slows the team down?
The dominant lever is small changes. A small PR is reviewed faster, more thoroughly, and merges before it rots; large PRs sit for days, get rubber-stamped, and block their authors. Google's guidance is explicit that small CLs are central to fast, high-quality review.
Reduce the human cost by automating what doesn't need judgment: formatters and linters settle style, CI runs the tests, security/dependency scanners flag the obvious — so reviewers spend their limited attention on design and correctness, not whitespace. Set an SLA for review turnaround (review promptly so authors aren't blocked) and make review a first-class part of the day, not an interruption deferred indefinitely.
Also right-size the gate: not every change needs the same rigor, and 'don't let perfect be the enemy of good' — approve net improvements and file follow-ups. The goal is a fast, trustworthy pipeline, not maximal ceremony.
What a strong answer coversSmall PRs are the biggest lever — faster, deeper review; large PRs block authors and get rubber-stamped.
Automate the judgment-free checks (lint/format/tests/scanners) so humans review design and correctness.
Set a review-turnaround SLA and treat review as first-class work, not a deferred interruption.
Right-size rigor and approve net improvements with follow-ups — don't let perfect block good.
Follow-ups they push on- Why does a small PR get a higher-quality review than a large one?
- Which checks should never reach a human reviewer at all?
Red flag Trying to fix slow reviews by lowering standards or skipping review — the real fixes are smaller changes, automation of trivial checks, and a turnaround SLA, which speed things up *without* sacrificing quality.
source: Google — Code Review Developer Guide (Small CLs) ↗ -
What is technical debt, and how do you decide whether to pay it down?
Technical debt is the implied future cost of choosing an easy-now solution over a better-but-slower one — like financial debt, it accrues 'interest' as every future change in that area takes longer.
Fowler's quadrant is useful: debt can be *deliberate or inadvertent* and *prudent or reckless*. Deliberate-prudent debt ('we'll ship now and refactor next sprint, and we know the tradeoff') is a legitimate engineering decision; reckless debt ('what's layering?') is not.
Deciding to pay it down: prioritize debt in code you touch often (high interest) over dead corners; pay it down opportunistically as you work nearby (boy-scout rule) rather than via giant rewrites; and make the cost visible to stakeholders so it competes fairly with features.
Follow-ups they push on- Why is debt in rarely-touched code often fine to leave?
- How do you make tech debt visible to non-engineering stakeholders?
Red flag Treating all tech debt as equally urgent (or all of it as 'just bad code') — debt in hot paths costs far more than debt in stable, untouched code.
source: Martin Fowler — Technical Debt Quadrant ↗ -
Name a few code smells and explain what each one signals.
A code smell is a surface symptom that *hints* at a deeper design problem — not a bug, but a prompt to look closer. Common ones:
- Long method / large class: too many responsibilities; signals a need to extract functions/classes.
- Duplicated code: the same knowledge in many places — change one, miss the others (DRY violation).
- Long parameter list: often a missing object that should group related params.
- Feature envy: a method that mostly uses *another* object's data — behavior is in the wrong place.
- Shotgun surgery: one change forces edits across many files — poor cohesion.
- Primitive obsession / magic numbers: missing a domain type or named constant.The value is that smells give a shared vocabulary for review and point toward the right refactoring — but they are heuristics, not hard rules.
Follow-ups they push on- Why is a smell a *hint* rather than a definitive 'this is wrong'?
- Which refactoring addresses 'shotgun surgery'?
Red flag Treating every smell as a mandatory fix — sometimes the 'smelly' code is the pragmatic choice; smells prompt investigation, not reflexive rewrites.
source: Martin Fowler — CodeSmell ↗ -
What are coupling and cohesion, and why do we want low coupling and high cohesion?
Cohesion is how strongly the things *inside* a module belong together — high cohesion means a module has one clear, focused responsibility. Coupling is how dependent modules are on *each other's* internals — low coupling means modules interact through small, stable interfaces and can change independently.
We want high cohesion, low coupling because together they localize change. With high cohesion a single concern lives in one place (you know where to look, and the change stays contained). With low coupling, changing one module doesn't ripple into others. The opposite — low cohesion, high coupling — produces the shotgun surgery smell (one change forces edits everywhere) and fragile code where a tweak in module A mysteriously breaks module B.
This is the engine behind modularity, SOLID's single-responsibility and dependency-inversion principles, and why you depend on interfaces rather than concrete implementations.
What a strong answer coversCohesion = how well a module's internals belong together (want it high — one responsibility).
Coupling = how much modules depend on each other's internals (want it low — small stable interfaces).
Together they localize change: a concern lives in one place and edits don't ripple outward.
Low cohesion + high coupling → shotgun surgery and fragile, change-resistant code.
Follow-ups they push on- How does depending on an interface instead of a concrete class reduce coupling?
- Which code smell is the direct symptom of low cohesion across modules?
Red flag Optimizing one in isolation — e.g. splitting code into many tiny modules can lower per-module size while *raising* coupling (everything calls everything); you want both directions right together.
source: Martin Fowler — Reducing Coupling (Beck Design Rules) ↗ -
How do you give code-review feedback that improves the code without alienating the author?
Review the code, not the person, and assume competence — phrase comments about the change ('this query runs N+1') rather than the author ('you always...'). Explain the why behind a suggestion so it teaches rather than dictates, and prefer asking ('what happens if
itemsis empty here?') over commanding when you're unsure.Distinguish blocking issues from preferences: label optional suggestions explicitly (Google's convention is prefixing nits with
Nit:) so the author knows what must change versus what's taste. Don't let perfect block good — if a change improves the codebase's health overall, approve it even if it isn't exactly how you'd write it; file follow-ups for non-urgent improvements.Process courtesies matter too: review promptly to avoid blocking people, keep feedback respectful and specific, and recognize good work, not just problems. The goal is a healthier codebase and a team that *wants* their code reviewed.
What a strong answer coversCritique the code, not the person; assume competence and explain the why.
Label nits/optional suggestions vs blocking issues so the author knows what's required.
Don't gatekeep on perfection — approve net improvements; file follow-ups for the rest.
Review promptly and respectfully; the aim is codebase health *and* a team that welcomes review.
Follow-ups they push on- Why is prefixing optional comments with 'Nit:' valuable to the author?
- When should a reviewer approve a change that isn't exactly how they'd write it?
Red flag Blocking a net-positive change over personal style preferences, or phrasing feedback as commands/attacks — it breeds resentment and slows the team without improving the code.
source: Google — Code Review Developer Guide (How to comment) ↗
07 Building in the AI Age 80 Q's
7.1 Anatomy of a modern app 10
-
Where does your code actually run — client or server — and why does it matter for what you can put in it?
Client code is shipped to and runs in the user's browser — it's fully visible (anyone can open DevTools and read it) and editable, so it can keep no secrets and enforce no rules. Server code runs on a machine you control — invisible to the user — so it can hold credentials, reach the database, and enforce checks.
The practical rule: anything that must stay secret or be trusted (API keys, authorization, pricing, validation that counts) lives on the server. The client is for rendering and convenience-level checks only.
What a strong answer coversClient code runs in the user's browser — fully visible and editable, keeps no secrets.
Server code runs on a machine you control — invisible to the user, can hold credentials and enforce rules.
Secrets and trusted checks (auth, pricing, real validation) must be server-side.
Client-side checks are a UX nicety; the server must re-validate everything that matters.
Quick self-checkYou hardcode a third-party API key into your React component so the browser can call the API. What's the problem?
-
Wrong — compiled/bundled JS still ships to the browser and is readable in DevTools.
-
Correct — client code is fully visible; secrets belong on the server, which proxies the call.
-
Wrong — it works in both, which is exactly why it's dangerous; it leaks everywhere.
-
Wrong — that's not a real constraint; the issue is exposure, not length.
Follow-ups they push on- If you bundle an API key into your frontend JS, who can see it?
- Why is a 'disable the button' check in the browser not real security?
Red flag Putting an API key or secret in frontend code 'because it's just JavaScript' — it ships to every visitor and is trivially readable.
source: MDN — Server-side vs client-side code ↗ -
Name the three tiers of a typical web app and say what each is responsible for.
Three tiers: client/frontend (the browser — renders UI, handles interaction), server/backend (a machine you control — owns business logic, data access, and secrets), and database (persists state).
The key separation is trust: the browser is untrusted and public, so anything sensitive (DB credentials, API keys, authorization checks) lives on the server. The frontend asks the server for data; the server talks to the database.
Follow-ups they push on- Why can't the browser talk to the database directly?
- Where does an API sit in this picture?
Red flag Saying the frontend 'connects to the database' — it never does; it calls your server, which holds the credentials.
source: InterviewPrep — 3-Tier Architecture ↗ -
What is an API, in one sentence, and why does the frontend go through it instead of the database?
An API is a contract: a defined set of endpoints the server exposes so other code can request data or actions without knowing the internals.
The frontend goes through it because the API is the trust boundary. It can authenticate the caller, authorize the action, validate input, and hide DB credentials and schema. If the browser hit the DB directly, anyone could read its network traffic, steal the credentials, and run any query.
Follow-ups they push on- What does the API do that the database can't be trusted to do itself?
- What's the difference between an endpoint and a route?
Red flag Describing an API only as 'a URL' — the point is the contract and the trust boundary, not the address.
source: MDN — How does the web work? ↗ -
Trace what happens, end to end, when a user types a URL and hits Enter.
Walk the path out loud: browser parses the URL, DNS resolves the domain to an IP, the request reaches the host/server, the server/API runs logic and (if needed) queries the database, builds a response, sends it back, and the browser renders it.
Good signal is naming the layers in order and knowing DNS is a lookup, not the server itself. Bonus: mention HTTPS securing the connection along the way.
Follow-ups they push on- Where would caching help in that path?
- What's the difference between the host and the code running on it?
Red flag Skipping DNS, or thinking the domain name 'is' the server. DNS is the phone book that maps name to address.
source: MDN — How does the web work? ↗ -
What's the difference between 'build', 'deploy', and 'host'? People use them interchangeably.
They're three stages, not synonyms. Build compiles/bundles your source into shippable artifacts (the
dist/folder). Deploy is the act of pushing those built artifacts to a place that serves them. Host is the place itself — the always-on machine or platform serving the result.Mnemonic: build is a verb that produces files, deploy is a verb that moves them, host is the noun where they live.
Follow-ups they push on- Where does 'repo' and 'bundle' fit in the chain?
- What is CI/CD in one line?
Red flag Conflating build and deploy — you can build without deploying (a failed CI run) and redeploy the same build.
source: Vercel — Deployments overview ↗ -
What is an environment variable, and why should secrets never be committed to the repo?
An environment variable is config supplied to the program at runtime by its environment, not hardcoded in the source — things like the database URL or an API key. The same code reads different values in dev, preview, and prod.
Secrets stay out of the repo because git history is forever and repos get shared, cloned, and leaked. A committed key is compromised even after you 'delete' it — it's still in history. Secrets belong in the host's environment-variable/secret store.
Follow-ups they push on- You accidentally committed an API key. What do you do?
- Why use a `.env` file locally but not commit it?
Red flag Thinking deleting the line in a later commit fixes it — the secret is still in history and must be rotated.
source: The Twelve-Factor App — Config ↗ -
What's the difference between an endpoint and a route?
A route is the path pattern the server matches against an incoming request (
/users/:id). An endpoint is a specific addressable operation — usually a method + path together (GET /users/:idvsDELETE /users/:id) — that does one thing.In practice people use them loosely, but the useful distinction is: one route (path) can host several endpoints, one per HTTP method. The route is where the request lands; the endpoint is the exact action it triggers.
What a strong answer coversA route is the URL path pattern the server matches (
/users/:id).An endpoint is method + path together — a specific operation (
GET /users/:id).One route can back several endpoints, one per HTTP method (GET/POST/DELETE…).
:idis a path parameter — a placeholder filled by the actual request.
Follow-ups they push on- How does the same path serve a GET and a DELETE differently?
- What's a path parameter vs a query parameter?
Red flag Thinking a path alone fully identifies an operation — `/users/1` means nothing until you also know the method (read it? delete it?).
source: MDN — Routing (Server-side first steps) ↗ -
Why can't the browser talk to the database directly — what would go wrong?
To connect to a database you need its address and credentials, and the browser is a public, untrusted environment — anyone can read the page's network traffic and JavaScript. Shipping DB credentials to the browser means handing them to every visitor.
Even if you could, there'd be no enforcement layer: the database just runs whatever query it's given. The server sits in between precisely to authenticate the caller, authorize the action, validate input, and only then run a safe, scoped query. The DB stays on a private network the browser can't reach.
What a strong answer coversConnecting needs credentials; the browser is public, so those credentials would leak to everyone.
The database has no notion of *who* is asking — it just runs the query it's given.
The server is the enforcement layer: authenticate, authorize, validate, then query.
In real setups the DB lives on a private network the browser literally can't reach.
Follow-ups they push on- What does the server add that the database can't enforce itself?
- How is a DB connection string like an API key?
Red flag Imagining the database can 'just check permissions' itself — it executes queries; the trust/permission logic lives in your server code.
source: MDN — Server-side programming: first steps ↗ -
What's the difference between a request and a response, and what do status codes like 200, 404, and 500 tell you?
A request is what the client sends (method, URL, headers, optional body); a response is what the server sends back (a status code, headers, and usually a body). Every HTTP exchange is one request and one response.
Status codes are the response's one-glance summary: 2xx = success (200 OK), 3xx = redirect, 4xx = the client did something wrong (404 not found, 401/403 auth problems, 400 bad input), 5xx = the server broke (500 internal error). The first digit tells you whose 'fault' it is — 4xx is on the caller, 5xx is on the server.
What a strong answer coversRequest = client → server (method, URL, headers, body); response = server → client (status, headers, body).
2xx success, 3xx redirect, 4xx client error, 5xx server error.
404 = not found, 401/403 = not authenticated/authorized, 400 = bad request, 500 = server crashed.
The leading digit tells you where to look first: 4xx → the request; 5xx → the server logs.
Quick self-checkYour API returns 500 for a request. Where do you look first?
-
Wrong — that's the 4xx family; invalid input is a 400/422.
-
Correct — 5xx is the server's fault, so the stack trace/server logs are the place to start.
-
Wrong — DNS failure happens before any HTTP status is returned.
-
Wrong — 2xx is success; 500 is a server error.
Follow-ups they push on- If you see a 401 vs a 403, what's the difference?
- Why is a 500 your problem but a 404 might be the caller's?
Red flag Returning 200 for everything (including errors) and signalling failure only in the body — it breaks clients, caches, and monitoring that rely on the status code.
source: MDN — HTTP response status codes ↗ -
What is a runtime (Node, a browser, an edge runtime), and why does 'where it runs' change what your code can do?
A runtime is the environment that executes your code and decides which capabilities (APIs) are available. The same JavaScript behaves differently depending on the runtime: a browser runtime gives you the DOM,
fetch, andlocalStoragebut no filesystem; Node.js gives you the filesystem, network sockets, and process access but no DOM; an edge runtime is a stripped-down server runtime optimized to run close to users, often missing some Node APIs.So 'where it runs' is really 'which runtime', and the runtime is what gates what's possible —
fs.readFileworks in Node and crashes in a browser;document.querySelectorworks in a browser and is undefined in Node.What a strong answer coversA runtime is the execution environment that supplies the available APIs.
Browser: DOM,
fetch,localStorage; no filesystem or process access.Node.js: filesystem, sockets, env/process; no DOM.
Edge runtimes are lean server runtimes near the user — fast, but a subset of Node's APIs.
Follow-ups they push on- Why does `document` exist in the browser but not in Node?
- Why might a library work locally (Node) but fail when deployed to an edge runtime?
Red flag Assuming 'it's all JavaScript' means any code runs anywhere — runtime-specific APIs (fs, DOM) make code break when moved to the wrong environment.
source: MDN — JavaScript execution environments ↗
7.2 Frontend for backend devs 10
-
What do HTML, CSS, and JavaScript each do, and what is the DOM?
HTML is structure (the content and its meaning), CSS is style (how it looks), JavaScript is behavior (what happens when you interact). The DOM (Document Object Model) is the browser's live, in-memory tree representation of the HTML — JS reads and changes the DOM, and the browser re-renders.
The one-liner that lands: HTML is the skeleton, CSS is the skin, JS is the muscles, and the DOM is the object you manipulate to change any of it at runtime.
Follow-ups they push on- When JS changes the page, is it changing the HTML file or the DOM?
- What's the difference between the DOM and the source HTML?
Red flag Thinking JS edits the .html file. It edits the DOM — the in-memory tree — not the file on disk.
source: MDN — How does the web work? ↗ -
Why do component states like loading, empty, and error matter as much as the happy path?
Any component that fetches or depends on real data has more than one state: while the data is in flight (loading), when it arrives but is empty (empty — zero results), when the request fails (error), and finally the populated success state. Real users hit all four.
If you only build the success path, the component shows a blank or broken UI the moment data is slow, missing, or failing — exactly the moments users notice. Designing the loading skeleton, the empty message, and the error/retry up front is what separates a demo from a shippable feature, and it's why you list these states explicitly when prompting an AI to build the component.
What a strong answer coversData-driven components have ≥4 states: loading, empty, error, success.
Users hit the non-happy states constantly (slow networks, no results, failures).
Skipping them yields blank/broken UI at the worst moment — when something's already wrong.
Naming all states up front is what makes a component (and an AI prompt) production-grade.
Follow-ups they push on- What's the difference between an empty state and an error state?
- Why is a loading state about perceived performance, not just correctness?
Red flag Building only the populated success view — the component looks done in the demo but breaks on the first slow or failed request in production.
source: Anthropic — Prompt engineering overview ↗ -
What's the difference between props and state in a component?
Props are inputs passed in from the parent — read-only from the component's view, like function arguments. State is data the component owns and can change over time, which triggers a re-render when it does.
Framework-agnostic rule of thumb: if the data comes from above and the component shouldn't mutate it, it's a prop; if the component manages it and updates it (a toggle, a form field, a counter), it's state.
Follow-ups they push on- What is a component, in one line?
- If a parent and child both need the same value, where should it live?
Red flag Mutating props directly. Props flow down and are read-only; to change them you lift state to the parent.
source: React — Thinking in React ↗ -
A UI library, a meta-framework, a styling system, a component kit — what's the difference?
Different layers of the stack: a UI library (React/Vue/Svelte) gives you the component model. A meta-framework (Next/Astro/SvelteKit) wraps a UI library with routing, rendering modes, and a build pipeline. A styling system (Tailwind or plain CSS) decides how you apply styles. A component kit (shadcn/MUI) is pre-built, styled components you drop in.
They stack, not compete: e.g. Astro (meta-framework) + React (UI library) + Tailwind (styling) + shadcn (components).
Follow-ups they push on- Is Next.js a replacement for React?
- What does a bundler like Vite do?
Red flag Calling Next.js 'a JavaScript framework like React' — Next is built on React; they're different layers.
source: Astro — Why Astro? ↗ -
You're asking an AI to build a UI component. What makes a good frontend prompt?
Name four things: the component (what it is — 'a comment card'), its props (the data it takes in), its states (loading, empty, error, hover/disabled), and a visual reference (a screenshot, an existing component to match, or a design system).
Vague prompts ('make a nice form') produce generic output. Specifying props and states is what turns the model from guessing into building to a contract — the same discipline you'd use describing the component to a teammate.
Follow-ups they push on- Why list the empty and error states explicitly?
- How does giving an existing component as reference help?
Red flag Only describing the happy path — you get a component that breaks on empty/error data you forgot to mention.
source: Anthropic — Prompt engineering overview ↗ -
What is a component, and why break a UI into components at all?
A component is a reusable, self-contained piece of UI — markup plus its own logic and styling — that takes inputs (props) and renders a piece of the screen. A
Button, aCommentCard, aNavbarare all components.You break a UI into components for the same reasons you break code into functions: reuse (write the card once, render it 50 times), isolation (a bug in one is contained), and composition (build complex screens by nesting small pieces). It also maps cleanly to how you reason and how you prompt an AI — one component, one clear responsibility.
What a strong answer coversA component = reusable UI unit: markup + logic + style, driven by props.
Reuse: define once, render many times with different props.
Isolation and composition: small pieces nest into whole screens; bugs stay contained.
Mirrors functions — one component should have one clear responsibility.
Follow-ups they push on- How do you decide where one component ends and another begins?
- What's the downside of one giant component that does everything?
Red flag Building one massive component for a whole page — it becomes unreusable, hard to test, and a nightmare to change or describe to an AI.
source: React — Your First Component ↗ -
Explain client-side rendering vs server-side rendering, and what 'hydration' means.
CSR: the server sends a near-empty HTML shell plus a JS bundle; the browser builds the whole DOM in JS. Fast to deploy, but slower first paint and weaker SEO. SSR: the server renders real HTML up front so the user sees content immediately and crawlers get real markup.
Hydration is the step after SSR where the JS bundle loads and attaches event listeners to the already-rendered HTML, turning the static markup into an interactive app. The HTML the server sent and the HTML React expects must match, or you get a hydration mismatch.
Follow-ups they push on- Why does SSR help SEO?
- What causes a 'hydration mismatch' error?
Red flag Thinking SSR means 'no JavaScript.' SSR sends HTML first, then still hydrates with JS for interactivity.
source: GreatFrontend — Explain what React hydration is ↗ -
Why does a list of rendered items need a stable `key`, and what goes wrong if you use the array index?
When you render a list, the framework needs to know which rendered element corresponds to which data item across re-renders — that's what
keyprovides. A stable, unique key (an item'sid) lets it correctly match, reorder, insert, and remove elements while preserving each item's state.Using the array index breaks this when the list reorders, filters, or has items inserted/removed: the index→item mapping shifts, so the framework reuses the wrong DOM node and component state (a half-typed input, a checkbox) sticks to the wrong row. Index keys are only safe for a static, never-reordered list.
What a strong answer coverskeylets the framework match rendered elements to data items across renders.Use a stable, unique id from the data — not the array index.
Index keys break on reorder/insert/delete: state and DOM attach to the wrong item.
Index is acceptable only for a fixed list that never changes order or length.
Quick self-checkYou render a reorderable todo list using the array index as each item's `key`. After dragging an item to the top, the checkboxes appear checked on the wrong todos. Why?
-
Wrong — keys can be strings; the type isn't the issue.
-
Correct — index keys aren't stable across reorder, so state binds to position, not identity.
-
Wrong — per-item state works fine with stable keys.
-
Wrong — the framework re-renders on state change; the bug is the key choice, not a missing render.
Follow-ups they push on- Why does a half-typed input jump to the wrong row with index keys?
- Where should the key come from if your data has no id?
Red flag Reaching for the array index as the key by default — it silently corrupts state when the list is dynamic (the exact case keys exist for).
source: React — Rendering Lists (keys) ↗ -
When the data on screen changes, what makes the UI update? Contrast the imperative DOM approach with the declarative component approach.
Imperative (vanilla DOM): you change the data *and* manually issue the DOM edits —
el.textContent = count— keeping the screen in sync by hand. It works but every state change means hand-written update code, which is where bugs breed.Declarative (React/Vue/Svelte): you describe what the UI should look like *as a function of state*, and when state changes you just update the state — the framework figures out the minimal DOM changes and applies them. You stop writing 'how to update the screen' and only write 'what the screen is for this state'. That's the core mental shift for a backend dev moving to the frontend.
What a strong answer coversImperative: you manually mutate the DOM on every change — error-prone bookkeeping.
Declarative: UI = f(state); you update state, the framework re-derives and patches the DOM.
The win is removing hand-written sync code, the classic source of UI bugs.
You think in 'what the screen is', not 'what DOM operations to perform'.
Follow-ups they push on- What is a 're-render' in a declarative framework?
- Why is manually syncing the DOM so bug-prone at scale?
Red flag Trying to manually edit the DOM inside a React component — you fight the framework; instead change state and let it re-render.
source: React — Reacting to Input with State ↗ -
What does it mean to 'lift state up', and when do you do it?
Lifting state up means moving a piece of state out of a child component into the closest common parent, then passing it back down as props (plus a callback to change it). You do it when two or more components need to read or stay in sync with the same value.
The rule: state should live at the lowest common ancestor of everything that needs it. If a child owns state that a sibling also needs, neither can see the other's local state, so you hoist it to the parent that contains both. The parent becomes the single source of truth and hands it down.
What a strong answer coversMove shared state to the closest common parent, pass it down as props.
Do it when 2+ components must read or stay in sync with the same value.
The parent becomes the single source of truth; children get value + a change callback.
Keeps duplicate, drifting copies of the same state from existing.
Follow-ups they push on- If two sibling components both need a value, where does it live?
- What's the risk of each sibling keeping its own copy of the same state?
Red flag Duplicating the same state in two siblings and trying to keep them in sync manually — they drift; lift it to the shared parent instead.
source: React — Sharing State Between Components ↗
7.3 Backend for frontend devs 10
-
Sketch a REST API for a 'notes' resource. What does the full set of CRUD endpoints look like?
REST organizes the API around a resource (notes) and uses HTTP methods for the verbs. The standard set:
GET /notes(list),POST /notes(create),GET /notes/:id(read one),PUT/PATCH /notes/:id(update),DELETE /notes/:id(delete).The pattern that makes it 'RESTful': the noun lives in the URL (the resource) and the verb lives in the HTTP method — never
POST /createNoteorGET /deleteNote/1. The same path/notes/:idserves read, update, and delete by varying the method. That convention is why anyone can guess your API once they know the resource.What a strong answer coversResource in the URL (
/notes), action in the HTTP method.List
GET /notes, createPOST /notes, readGET /notes/:id, updatePUT/PATCH /notes/:id, deleteDELETE /notes/:id.Avoid verbs in the path (
/createNote,/deleteNote/1) — that's the anti-pattern.Predictable: knowing the resource lets a caller guess the endpoints.
Quick self-checkWhich is the RESTful way to delete the note with id 42?
-
Wrong — GET should be safe/read-only, and the verb shouldn't be in the path.
-
Wrong — verb-in-URL and wrong method; not RESTful.
-
Correct — the DELETE method on the resource URL; verb in the method, noun in the path.
-
Wrong — the specific resource should be addressed by URL: /notes/42.
Follow-ups they push on- Why is `POST /notes/123/delete` considered un-RESTful?
- How does this map back to CRUD and to SQL operations?
Red flag Putting the verb in the URL (`GET /getNotes`, `POST /deleteNote`) — it breaks the REST convention and the method↔CRUD mapping.
source: MDN — HTTP request methods ↗ -
What can server-side code do that browser code can't?
The server runs on a machine you control, so it can hold secrets (API keys, DB credentials) the user never sees, reach the database directly, touch the filesystem, and call other services with trusted credentials.
Browser code is shipped to and runs on the user's machine — it's fully visible and editable by anyone, so it can't be trusted to keep secrets or enforce rules. Any check that matters (auth, pricing, permissions) must happen server-side.
Follow-ups they push on- Why isn't a check in the frontend enough to secure an action?
- What's a runtime — Node, Python — in this context?
Red flag Putting an authorization check only in the frontend. The user can bypass it; the server must re-check everything.
source: MDN — Server-side programming: first steps ↗ -
Map CRUD to HTTP methods. Which methods are idempotent?
CRUD ↔ HTTP: Create → POST, Read → GET, Update → PUT/PATCH, Delete → DELETE.
Idempotent means calling it N times has the same effect as calling it once. GET, PUT, and DELETE are idempotent; POST is not (two POSTs create two records). PATCH is generally not guaranteed idempotent. This matters for retries: it's safe to retry a GET or PUT after a timeout, but retrying a POST may double-charge or double-create.
Follow-ups they push on- What's the difference between PUT and PATCH?
- Why does idempotency matter when a request times out?
Red flag Saying GET is idempotent because 'it doesn't change anything' — that's safety. Idempotency is about repeated calls having one effect (a correct GET is also safe, but the concepts differ).
source: InterviewBit — REST API Interview Questions ↗ -
What is an ORM, and why do people warn against hand-concatenating SQL strings?
An ORM (Object-Relational Mapper, e.g. Prisma) lets you work with database rows as objects in your language instead of writing raw SQL — it generates the SQL for you and maps results back to typed objects.
Hand-concatenating SQL from user input invites SQL injection: if you build
"SELECT * FROM users WHERE name = '" + input + "'", a crafted input can break out of the string and run arbitrary SQL. ORMs (and parameterized queries) bind values separately from the query text, so input can never become executable SQL.Follow-ups they push on- What's a parameterized/prepared query?
- When might you drop to raw SQL anyway?
Red flag Thinking an ORM is required for safety — the real fix is parameterized queries; an ORM is one convenient way to get them.
source: Prisma — What is an ORM? ↗ -
Authentication vs authorization — what's the difference?
Authentication is proving who you are (login, a token, a session). Authorization is what you're allowed to do once you're known (can this user delete that post?).
Mnemonic: authentication is the bouncer checking your ID at the door; authorization is the rule about which rooms your ticket lets you into. You authenticate once, then authorize every sensitive action.
Follow-ups they push on- Where must these checks run — frontend or backend?
- Can you be authenticated but not authorized for an action?
Red flag Using the words interchangeably. A logged-in user (authenticated) still must be authorized per action; conflating them leads to privilege bugs.
source: MDN — Server-side programming: first steps ↗ -
Why validate input on the server even when the frontend already validates the same form?
Frontend validation is a UX feature — it gives instant feedback so users fix mistakes fast — but it provides zero security, because the client is fully under the user's control. Anyone can bypass the form entirely and POST raw data with cURL, disabled JavaScript, or DevTools.
So the server must re-validate everything it receives as if no frontend existed: required fields, types, ranges, formats, and authorization. The two layers aren't redundant — they serve different jobs: the frontend for friendliness, the server for trust. Skipping server validation is how malformed and malicious data reaches your database.
What a strong answer coversFrontend validation = UX (fast feedback); it is not security.
The client is user-controlled — attackers bypass the form and POST directly.
The server must re-validate every request as if no frontend existed.
Both layers coexist: friendliness on the client, trust on the server.
Quick self-checkYour signup form checks the email format in JavaScript before submitting. Is server-side email validation still needed?
-
Wrong — the frontend can be bypassed entirely with a direct request.
-
Correct — client validation is UX only; the server must independently validate.
-
Wrong — HTTPS encrypts transit; it doesn't validate the payload.
-
Wrong — HTML5 validation also runs client-side and is bypassable.
Follow-ups they push on- How would someone bypass your frontend validation?
- Is server validation enough on its own (no frontend checks)?
Red flag Trusting client-side validation as a security boundary — it's trivially bypassed; the server is the only place validation actually protects you.
source: MDN — Form data validation ↗ -
What is CORS, and why does the browser block your frontend from calling an API on a different origin?
CORS (Cross-Origin Resource Sharing) is a browser security mechanism. By default the same-origin policy stops JavaScript on
app.example.comfrom reading responses from a different origin (different scheme, host, or port) — this prevents a malicious page from quietly calling APIs as you. CORS is the *opt-in* by which a server says 'these specific other origins are allowed', viaAccess-Control-Allow-Originand related response headers.Key nuance for builders: CORS is enforced by the browser, on the server's say-so. A CORS error isn't your frontend misbehaving — it means the API you're calling hasn't allow-listed your origin. The fix is on the server (or a proxy), not in the browser.
What a strong answer coversSame-origin policy blocks cross-origin reads by default; CORS is the server's opt-in to relax it.
Enforced by the browser, but configured via the server's response headers.
Access-Control-Allow-Originnames which origins may read the response.A CORS error means the target server hasn't allowed your origin — fix it server-side.
Follow-ups they push on- What makes two URLs the 'same origin'?
- Why can server-to-server requests ignore CORS entirely?
Red flag Trying to 'fix CORS' in the frontend code — the browser enforces it from the server's headers; the change must happen on the API or via a backend proxy.
source: MDN — Cross-Origin Resource Sharing (CORS) ↗ -
Serverless vs a long-running server — when does each fit?
Serverless (functions that spin up per request) shines for spiky, event-driven, or low-traffic workloads: you pay per invocation and scale to zero, but each call can have a cold start and there's no in-memory state between calls. A long-running server fits steady traffic, long-lived connections (websockets), background work, and cases where keeping things warm in memory matters.
Orientation-level takeaway: serverless trades always-on cost and statefulness for automatic scaling and pay-per-use.
Follow-ups they push on- What is a 'cold start'?
- Why is a websocket server awkward to run serverless?
Red flag Assuming serverless is always cheaper — at sustained high traffic a long-running server is often cheaper and lower-latency.
source: MDN — Server-side programming: first steps ↗ -
What's the difference between a session and a JWT for keeping a user logged in?
Both answer 'how does the server know it's still you on the next request'. A session stores the auth state on the server (a session record) and gives the browser an opaque session ID (usually in a cookie); the server looks it up each request — easy to revoke, but it's stateful. A JWT is a signed token the server hands back that *contains* the claims (user id, expiry); the server just verifies the signature, no lookup needed — stateless and scalable, but hard to revoke before it expires.
Tradeoff in one line: sessions are easy to invalidate but require server state; JWTs are stateless and scale well but you can't easily 'log someone out' until the token expires.
What a strong answer coversSession: state lives server-side; browser holds an opaque ID; easy to revoke, but stateful.
JWT: signed token carries the claims; server verifies signature, no lookup; stateless.
JWT scales well (no shared session store) but is hard to revoke before expiry.
Both typically ride in a cookie or Authorization header on each request.
Follow-ups they push on- Why is logging a user out harder with JWTs?
- Why should a JWT have a short expiry?
Red flag Storing sensitive data in a JWT thinking it's hidden — a JWT is signed, not encrypted; its payload is readable by anyone who has the token.
source: MDN — HTTP authentication ↗ -
Why move slow work (sending email, resizing an image, calling a slow API) to a background job instead of doing it in the request?
An HTTP request should return fast. If you do slow work inline — sending a welcome email, resizing an upload, calling a slow third-party API — the user waits the whole time, the request may time out, and a failure in that work fails the whole request.
The pattern is to enqueue the slow work and return immediately: accept the request, push a job onto a queue, respond '202 accepted / we're on it', and let a separate worker process the job later (with retries on failure). The user gets a snappy response; the slow, flaky, or retryable work happens out of band where a failure doesn't break the user's request.
What a strong answer coversRequests should be fast; slow inline work blocks the user and risks timeouts.
Enqueue the work, respond immediately, let a separate worker process it.
Background jobs can retry on failure without re-running the user's request.
Good fits: email, image/video processing, slow external API calls, report generation.
Follow-ups they push on- What does a queue + worker setup look like at a high level?
- How does a background job report success or failure back to the user?
Red flag Doing slow/flaky work inline in the request handler — one slow third-party call makes every user wait and turns a transient failure into a failed request.
source: MDN — Server-side programming: first steps ↗
7.4 TypeScript, just enough 10
-
What problem does TypeScript solve over plain JavaScript? What class of bugs does it catch?
TypeScript adds static types checked at compile time, so a whole class of bugs is caught before the code runs: typos in property names, passing the wrong shape, calling a method that doesn't exist, forgetting a required field, or assuming a value is present when it can be
undefined.It's a developer-time tool — the types are erased and it's plain JS at runtime. The payoff is the error shows up in your editor as you type instead of as a crash in production.
Follow-ups they push on- Do types exist at runtime?
- Does TypeScript make code faster?
Red flag Believing TS catches every bug — it catches type/shape errors, not logic errors. `if (x = 5)` is still wrong; a bad algorithm is still bad.
source: TypeScript — TS for JavaScript Programmers ↗ -
What's the difference between `interface` and `type` in TypeScript?
Both describe the shape of data.
interfaceis best for object shapes and class contracts — it can be extended and merged (declaration merging).typeis a more general alias — it can do everything interface does for objects plus unions, intersections, primitives, and tuples.Practical rule: reach for
interfacefor object shapes you might extend, andtypewhen you need a union or a non-object alias. At orientation level the honest answer is they overlap heavily and either is fine for object shapes.Follow-ups they push on- Which can express a union type?
- What is declaration merging?
Red flag Claiming they're identical — `type` can express unions and primitives; `interface` supports declaration merging.
source: DataCamp — TypeScript Interview Questions ↗ -
What is type inference, and why don't you annotate every variable?
Type inference is TypeScript figuring out the type from the value automatically: write
const n = 5and TS knowsnisnumber— no annotation needed.You skip redundant annotations because they add noise without adding safety. Annotate where inference can't help or where you want to pin a contract: function parameters, function return types for public APIs, and the shape of external data (API responses). Let inference handle the obvious local cases.
Follow-ups they push on- Where is an explicit annotation still worth it?
- What does `const x: number = 5` add over `const x = 5`?
Red flag Annotating everything 'to be safe' — over-annotation is noise; the value is typing boundaries, not every local.
source: TypeScript Handbook — Everyday Types ↗ -
How do types make an AI coding assistant more useful?
Types are machine-readable context. With a typed codebase the assistant gives better completions (it knows the exact shape available), invents fewer non-existent fields (the contract is right there), and its mistakes surface as in-editor type errors instead of silent runtime bugs.
So a typed contract is a form of guardrail for the AI: it constrains what valid code looks like, which is why 'add types' or 'type this API response' is a high-leverage instruction to give it.
Follow-ups they push on- What does telling the AI to 'make this strict' do?
- Why does a typed API response reduce hallucinated fields?
Red flag Treating types as only a human concern — they're also the strongest signal the model has about valid code.
source: TypeScript — TS for JavaScript Programmers ↗ -
Name TypeScript's main primitive types and how you type an array and an object.
The core primitives are
string,number,boolean, plusnullandundefined(andbigint/symbolyou rarely touch early). You type an array asnumber[](orArray<number>), and an object by its shape:{ name: string; age: number }.A point that trips JS devs: TypeScript has no separate
int/float— it's allnumber. Andstring[]means 'array of strings', whilestringalone is one string. Get comfortable reading these shapes; most real-world typing is just composing primitives into object and array shapes.What a strong answer coversPrimitives:
string,number,boolean,null,undefined(plusbigint,symbol).No
int/floatdistinction — all numbers arenumber.Array:
T[]orArray<T>(e.g.string[]).Object: describe its shape —
{ name: string; age: number }.
Follow-ups they push on- What's the difference between `number[]` and `[number, number]`?
- How do you mark an object field as optional?
Red flag Looking for `int`/`float`/`char` types from other languages — TypeScript only has `number` and `string`; there's no character type.
source: TypeScript Handbook — Everyday Types ↗ -
What is a union type, and how do you write a literal union for something like a status field?
A union type says a value is one of several types, written with
|:string | numbermeans 'either a string or a number'. The most useful flavor is a literal union of exact values:type Status = "idle" | "loading" | "error" | "done".This is huge for modeling state: instead of a loose
stringthat could be any typo, the type pins the field to exactly the allowed values, sostatus = "loadign"is a compile error and your editor autocompletes the valid options. It's the cleanest way to make impossible states unrepresentable.What a strong answer coversA union (
A | B) means the value is one of the listed types.A literal union (
"a" | "b" | "c") restricts to exact allowed values.Great for status/role/variant fields — typos become compile errors.
Editor autocompletes the valid options, so you can't pick an invalid one.
Quick self-checkWhich type best models a button's variant, which must be exactly 'primary', 'secondary', or 'ghost'?
-
Wrong — allows any string, including typos like 'primry'; no autocomplete.
-
Correct — a literal union pins it to exactly the three valid values, catching typos at compile time.
-
Wrong — turns off checking entirely; the worst choice for a fixed set.
-
Wrong — that's an array of strings, not one of three allowed values.
Follow-ups they push on- How does TypeScript 'narrow' a union so you can use it safely?
- Why is a literal union better than a plain `string` for a status field?
Red flag Typing a fixed-set field as `string` — you lose the typo-catching and autocomplete a literal union would give you for free.
source: TypeScript Handbook — Everyday Types (Union Types) ↗ -
How do shared types act as a contract between your frontend and backend?
If both sides of your app are TypeScript, you can define the shape of the data once —
interface User { id: string; name: string; email: string }— and import it in both the API code and the frontend. That shared type is a contract: the server is typed to return it, the client is typed to consume it.The payoff is compile-time safety across the boundary. If you rename
nametofullNameon the server but forget the frontend, the build breaks at the mismatch instead of shipping a silently broken page. It turns 'did the API change?' from a runtime surprise into a type error you see immediately — the single biggest reason teams run TypeScript end to end.What a strong answer coversDefine the data shape once; import it on both client and server.
The shared type is an enforced contract across the API boundary.
A change on one side that breaks the other fails the build, not production.
Turns API drift from a runtime surprise into an immediate compile error.
Follow-ups they push on- What happens at build time if the server's response no longer matches the shared type?
- How do tools generate these shared types from an API schema automatically?
Red flag Hand-redeclaring the same shape separately on client and server — they drift out of sync; share one source-of-truth type instead.
source: TypeScript Handbook — Object Types (interfaces) ↗ -
What's the difference between `any` and `unknown`?
anyturns type checking off for that value — you can do anything with it and TS won't complain, which throws away the safety you came for.unknownis the safe counterpart: you can hold any value, but you must narrow it (check its type) before you use it.Rule of thumb:
anyis an escape hatch (use sparingly, e.g. migrating JS);unknownis a checkpoint that forces you to prove the type first. Preferunknownwhen you genuinely don't know the type yet.Follow-ups they push on- Why is `unknown` safer for a parsed JSON / API response?
- When is reaching for `any` defensible?
Red flag Sprinkling `any` to silence errors — it defeats the point of TypeScript and hides real bugs. Tighten the type instead.
source: DataCamp — TypeScript Interview Questions ↗ -
What's the difference between optional (`?`) and `| undefined`, and how do you safely read a value that might be missing?
field?: stringmeans the property may be absent entirely (you can omit the key).field: string | undefinedmeans the key must be present but its value may beundefined. They overlap a lot in practice; the practical concern is the same — you must handle the missing case before using it.To read it safely, use narrowing: an
if (user.name)check, optional chaininguser.profile?.bio, or a default with??(const name = user.name ?? "Anonymous"). WithstrictNullCheckson, TypeScript forces you to do this — it won't let you call.toUpperCase()on something that might beundefined, which kills a huge class of 'cannot read property of undefined' crashes.What a strong answer covers?= the property may be absent;| undefined= present but possibly undefined.Both require handling the missing case before use.
Narrow with
ifchecks, optional chaining?., or nullish coalescing??.strictNullChecksmakes the compiler force this — preventing undefined-access crashes.
Follow-ups they push on- What does optional chaining (`?.`) return when the left side is undefined?
- Why does `strictNullChecks` catch so many real-world bugs?
Red flag Accessing a possibly-undefined value directly (`user.name.toUpperCase()`) — without narrowing it crashes at runtime; let strict mode force the check.
source: TypeScript — Migrating with strictNullChecks / handling null ↗ -
TypeScript said the code is type-correct, but it still crashed at runtime with bad data from an API. How is that possible?
TypeScript types are erased at compile time — they don't exist at runtime and don't check actual values. When you write
const user = await res.json() as User, you're *asserting* the shape, not verifying it. If the API returns something different, TS believed your assertion and the mismatch only surfaces as a crash later.Types guarantee your code is internally consistent; they cannot police data that enters at runtime (API responses, form input, JSON files). For real boundaries you need runtime validation — a schema validator like Zod that actually checks the value and *then* gives you a trustworthy type. Static types and runtime validation are different jobs.
What a strong answer coversTypes are erased at compile time — no runtime checking of actual values.
as Useris an unchecked assertion; TS trusts you, it doesn't verify.External data (APIs, forms, files) can violate the asserted type silently.
Validate at the boundary with a runtime schema (e.g. Zod) to get a type you can trust.
Quick self-checkYou write `const data = await res.json() as Product`. The API changes and now omits `price`. What happens?
-
Wrong — TS can't see the runtime response; the assertion is taken on faith.
-
Correct — `as` is unchecked and types are erased, so the mismatch isn't caught until you use the missing field.
-
Wrong — `res.json()` parses JSON; it has no knowledge of your TypeScript type.
-
Wrong — `as` performs no runtime validation; that's the entire trap.
Follow-ups they push on- Why is `as SomeType` on an API response dangerous?
- How does a tool like Zod give you both a runtime check and a static type?
Red flag Using `as` to assert the shape of external data and assuming it's now safe — `as` does no checking; only runtime validation actually verifies the value.
source: TypeScript Handbook — Type assertions (`as`) ↗
7.5 From code to a live URL 10
-
Walk the full chain from a git commit to a live URL. What happens at each step?
Repo → build → bundle → deploy → host. You push a commit to the repo (e.g. GitHub). That triggers a build on the host: it installs dependencies and runs your build command, which bundles your source — many files of TS/JSX/CSS — into a small set of optimized, browser-ready static assets (
dist/). The host then deploys those artifacts (copies them to its servers/CDN) and serves them at a URL.The mental model that matters: your source code is *not* what runs — the built bundle is. A push kicks off a pipeline that transforms source into deployable artifacts and puts them somewhere always-on. Modern hosts collapse all of this into 'git push and we handle the rest'.
What a strong answer coversChain: push to repo → host builds → bundler produces optimized
dist/→ deploy → live URL.Bundling turns many dev files into a few optimized, browser-ready assets.
What runs in prod is the build output, not your raw source.
Modern hosts trigger the whole pipeline automatically on push.
Quick self-checkAfter `git push`, your host shows a live site. What did the browser actually download?
-
Wrong — browsers don't run TS/JSX; it must be transpiled and bundled first.
-
Correct — the build step turns source into browser-ready assets, and those are what get served.
-
Wrong — the repo is source storage; it isn't shipped to visitors.
-
Wrong — that's not how web apps work; the browser downloads HTML/CSS/JS assets.
Follow-ups they push on- What does a bundler (Vite, esbuild) actually do to your files?
- Why isn't your raw `.tsx` source what the browser downloads?
Red flag Thinking the browser runs your source files — it runs the bundled, transpiled output; a build step sits between your code and what ships.
source: Vite — Building for Production ↗ -
What is version control, and what is a GitHub repo — in one line each?
Version control (Git) tracks the history of your code over time, lets you branch and merge, and lets you go back to any past state. A GitHub repo is a hosted home for a Git repository — the shared remote copy that you push to, others pull from, and deploys are triggered from.
Git is the tool; GitHub is a hosting service for Git repos (with PRs, issues, and CI on top).
Follow-ups they push on- What's the difference between a commit and a push?
- What is a branch for?
Red flag Conflating Git and GitHub. Git is the version-control tool; GitHub is one place to host Git repos.
source: GitHub — Hello World ↗ -
Modern hosts — Vercel, Netlify, Cloudflare, Railway, Render, Fly. What's the rough split between them?
Roughly two camps. Static / frontend hosts (Vercel, Netlify, Cloudflare Pages) are tuned for serving built frontends and serverless functions at the edge — push a repo, they build and serve it. App / server hosts (Railway, Render, Fly) are tuned for long-running servers, databases, and containers.
The line blurs (most do some of both), but the orientation-level instinct is: a static site or a frontend-plus-functions app leans toward the first group; a long-running backend with its own database leans toward the second.
Follow-ups they push on- Where would you host a static marketing site vs a websocket server?
- What does 'edge' mean here?
Red flag Treating all hosts as interchangeable — a pure static host won't run your always-on stateful backend well.
source: Vercel — Deployments overview ↗ -
Walk the path from a domain name to your running app. What does HTTPS/SSL add?
Domain → DNS → host. You point the domain at the host using DNS records: an A record maps a name to an IP address; a CNAME maps a name to another name (e.g. your-app.vercel.app). The browser resolves the name via DNS, then connects to the host.
HTTPS/SSL adds encryption and identity: it encrypts traffic so it can't be read or tampered with in transit, and the certificate proves the server is who it claims to be. Without it, credentials and data travel in plaintext.
Follow-ups they push on- When do you use an A record vs a CNAME?
- Why does the padlock matter beyond 'it's secure'?
Red flag Thinking DNS 'hosts' the site — DNS only maps the name to an address; the host serves the actual app.
source: Cloudflare — What is DNS? ↗ -
What is a preview deploy, and where do you look first when a deploy breaks?
A preview deploy is a full, live build of a branch or pull request at its own URL, separate from production — so you (and reviewers) can click through the change before it ships. Production and preview typically have separate env vars, which is a common gotcha when something works in preview but breaks in prod.
When a deploy breaks, read the build logs first (did it compile?), then the runtime logs (is it crashing at request time?), and check that the right env vars exist for that environment.
Follow-ups they push on- Why might something work in preview but fail in production?
- What's the difference between a build-time and a runtime error?
Red flag Forgetting prod and preview have different env vars — a missing prod secret is a classic 'works on preview' failure.
source: Vercel — Deployments overview ↗ -
What does CI/CD mean at a high level?
CI (Continuous Integration) is automatically building and testing your code every time you push, so problems surface early. CD (Continuous Delivery/Deployment) is automatically taking the code that passed and deploying it.
The whole pipeline, orientation-level: push → build → test → deploy. The point is no manual steps and no 'works on my machine' — every change goes through the same gated, repeatable path.
Follow-ups they push on- What's the difference between continuous delivery and continuous deployment?
- Why run tests before deploying?
Red flag Thinking CI/CD is one tool — it's a practice/pipeline; many tools implement it.
source: GitHub — Hello World ↗ -
What does a bundler like Vite or esbuild actually do, and why do you need one?
A bundler takes your project — dozens or hundreds of source files plus dependencies — and produces a small set of optimized files the browser can load efficiently. Along the way it transpiles modern TS/JSX into plain JavaScript the browser understands, resolves and combines
imports, minifies (strips whitespace and shortens names), tree-shakes unused code, and fingerprints filenames for caching.You need one because browsers don't run TypeScript or JSX, and shipping hundreds of separate files would be slow. The bundler is the bridge between 'how you write code' (modular, modern, typed) and 'what loads fast in a browser' (few, small, plain-JS files).
What a strong answer coversTranspiles modern TS/JSX → plain browser-compatible JavaScript.
Combines many modules and resolves
imports into a few output files.Minifies and tree-shakes to cut size; fingerprints filenames for caching.
Bridges 'nice to write' (modular/typed) and 'fast to load' (few small files).
Follow-ups they push on- What is tree-shaking?
- Why does the browser need TS transpiled before it can run it?
Red flag Confusing the bundler (prepares code to ship) with the host (serves it) — they're different stages; the bundler runs during the build.
source: Vite — Why Vite (the problems it solves) ↗ -
What's the difference between a build-time error and a runtime error when a deploy goes wrong?
A build-time error happens while the host is compiling/bundling your code — a type error, a syntax error, a missing import. The build fails, nothing gets deployed, and you read the build logs to find it. Production keeps serving the last good deploy.
A runtime error happens after a successful deploy, when the live code actually executes a request — a null reference, a crashed API call, a missing env var the code reads at request time. The build passed, the site is 'up', but pages error; you read the runtime/function logs to find it. First diagnostic question on any broken deploy: did it fail to build, or did it build and then fail to run?
What a strong answer coversBuild-time: fails during compile/bundle (type/syntax/import errors) → check build logs; nothing deploys.
Runtime: fails while serving requests on a deployed build → check runtime/function logs.
A failed build leaves the previous good version live; a runtime error means broken-but-deployed.
First question: did it fail to build, or build fine then fail to run?
Quick self-checkYour deploy succeeds and the site loads, but one page throws 'cannot read property of undefined' for some users. What kind of error is this?
-
Wrong — the build succeeded; this surfaces only while running.
-
Correct — it appears during request execution on a deployed build, so it's a runtime issue.
-
Wrong — the site loaded, so DNS resolved fine.
-
Wrong — a bundler failure would have failed the build, not produced a live-but-broken page.
Follow-ups they push on- A missing env var — is that more likely build-time or runtime?
- Why might TypeScript catch a build-time error that JS would only hit at runtime?
Red flag Looking in the build logs for a problem that's actually a runtime crash (or vice versa) — knowing which phase failed points you at the right log.
source: Vercel — Logs (build vs runtime) ↗ -
How do you wire the same secret (like an API key) into local dev, preview, and production without committing it?
Locally, you keep secrets in a
.envfile that is git-ignored (and you commit a.env.examplewith the keys but not the values, so teammates know what's needed). The code reads them via the environment (process.env.API_KEY), never hardcoded.For preview and production, you set the same variable names in the host's environment-variable settings (the dashboard or CLI), with environment-specific values. Each environment can hold a different value — a test key for preview, the real key for prod. The code stays identical; only the supplied values differ. The secret never enters git in any environment.
What a strong answer coversLocal: git-ignored
.env; commit.env.example(names only, no values).Code reads from the environment (
process.env.X), never hardcodes the value.Preview/prod: set the same names in the host's env-var settings, per-environment values.
Same code everywhere; only the injected values differ; nothing secret hits git.
Follow-ups they push on- Why commit a `.env.example` but never the real `.env`?
- Why might preview use a different API key than production?
Red flag Setting env vars only locally and forgetting the host — the build/runtime in prod has no value, so it works locally and breaks when deployed.
source: Vite — Env Variables and Modes ↗ -
What does a CDN do, and why is serving your app from 'the edge' faster?
A CDN (Content Delivery Network) is a global network of servers that cache your static assets close to users. Instead of every visitor fetching files from one origin server (which might be a continent away), they're served from a nearby edge location, cutting the round-trip distance and latency.
'The edge' just means 'physically close to the user'. For a static frontend, the whole site can be cached at the edge so it loads fast worldwide. The tradeoff to remember: cached content is fast but can be stale until it's invalidated, and truly dynamic/personalized responses can't simply be cached for everyone.
What a strong answer coversA CDN caches assets on servers worldwide, serving each user from a nearby location.
'Edge' = close to the user; less distance means lower latency.
Great for static assets and frontends; the whole site can live at the edge.
Tradeoff: cached content can be stale; per-user dynamic responses don't cache trivially.
Follow-ups they push on- What kind of content is easy to cache at the edge vs hard?
- What does 'cache invalidation' mean and why is it tricky?
Red flag Assuming everything benefits from edge caching — highly dynamic or per-user responses can't be shared from a cache, and stale caches serve old content until invalidated.
source: Cloudflare — What is a CDN? ↗
7.6 The AI coding toolbox 9
-
Name the categories of AI coding tools and when you'd reach for each.
Roughly: autocomplete (in-editor suggestions as you type — fast, line-to-block scope), chat assistant (ask questions, get explanations and snippets in a side panel), terminal/CLI agent (runs in your shell, reads/edits files and runs commands across a repo), AI IDE (an editor built around AI with the codebase in context), and app-builder (describe an app, get a scaffolded project).
Reach for autocomplete for flow while writing known code; chat for understanding or a focused snippet; a CLI agent or AI IDE for multi-file changes across a real repo; an app-builder for a quick from-scratch prototype.
Follow-ups they push on- When would autocomplete be the wrong tool?
- What does a CLI agent do that a chat assistant can't?
Red flag Assuming one tool fits every task — a from-scratch prototype and a surgical multi-file refactor want different tools.
source: Anthropic — Claude Code overview ↗ -
Frontier models come in tiers. Describe them without naming specific models.
Most providers offer roughly three tiers: a fast/cheap tier (cheapest and quickest, for high-volume or simple tasks like classification and autocomplete), a balanced tier (the everyday workhorse — good quality at reasonable cost/speed), and a most-capable tier (the strongest reasoning for hard, high-stakes problems, at higher cost and latency).
Deliberately avoid pinning specific names or 'the latest model' — those change constantly. The durable skill is reasoning about the tier, then checking the provider's current model page for which name maps to it today.
Follow-ups they push on- Why frame this in tiers instead of memorizing model names?
- Where would you check which model is current?
Red flag Naming a specific model as 'the best/latest' — it dates instantly. Talk in tiers and verify the current mapping at use time.
source: Anthropic — Models overview ↗ -
Why do AI-tool and model facts come with an 'as of <date>' caveat, and how do you handle that?
The AI tooling and model landscape moves fast: names, prices, tiers, context-window sizes, and capabilities change month to month. Any specific fact you memorize ('model X is the best', 'it costs $Y') has a short shelf life, and a model's training data has a cutoff so it doesn't even know about newer models.
So you reason in durable concepts (tiers, the cost/capability tradeoff) and verify specifics against the provider's current docs at the moment you need them, rather than trusting a printed name or a number from memory.
Follow-ups they push on- Where do you check the current model lineup?
- Why can't you just trust the model to know the latest model names?
Red flag Treating a model/price fact as permanent — quote tiers and concepts, and re-verify any specific at authoring time.
source: Anthropic — Models overview ↗ -
What can a terminal/CLI coding agent do that an in-editor chat assistant can't?
A CLI/terminal agent runs in your shell with access to your whole project: it can read and edit files across the repo, run commands (tests, builds, git), see the output, and iterate — a full plan-edit-test loop on its own. A chat assistant in the editor mainly sees the snippet or file you've shared and hands back text/snippets you copy in yourself.
The difference is agency over the environment: the CLI agent acts on the real repo (multi-file refactors, running the tests it just changed), while chat is closer to a knowledgeable pair you query for explanations and focused code. That power is also why CLI agents need the review/permission discipline that chat doesn't.
What a strong answer coversCLI agent: reads/edits many files, runs commands, sees output, loops autonomously.
Chat assistant: mostly sees what you paste; returns text you apply yourself.
Difference is agency over the real environment, not just smarter answers.
More power → more need for review and permission gating on the CLI agent.
Quick self-checkYou want a tool to refactor a function across 12 files, run the test suite, and fix what breaks — without you copy-pasting. Which fits best?
-
Wrong — autocomplete suggests as you type; it can't drive a multi-file refactor or run tests.
-
Wrong — chat returns snippets you apply manually; it doesn't act across the repo or run tests itself.
-
Correct — it can edit many files, run the test suite, read output, and iterate autonomously.
-
Wrong — a linter flags issues; it doesn't perform the refactor.
Follow-ups they push on- Why does a CLI agent need stronger review discipline than chat?
- For a one-off 'explain this regex', which tool fits better?
Red flag Expecting a chat assistant to actually apply a multi-file change across your repo — it returns snippets; running the change in the environment is the agent's job.
source: Anthropic — Claude Code overview ↗ -
When is an app-builder ('describe an app, get a project') the right tool, and when is it the wrong one?
An app-builder shines for getting from zero to something visible fast: prototypes, demos, throwaway internal tools, validating an idea, or learning by seeing a working scaffold. You describe what you want and get a runnable project without setup friction.
It's the wrong tool when you need to fit an existing, large codebase, follow specific conventions, or make surgical changes to production code — there a CLI agent or AI IDE working in the real repo is far better. Rule of thumb: app-builders are great at the blank-page start; once there's a real codebase and real constraints, you graduate to tools that operate inside it.
What a strong answer coversBest for: prototypes, demos, throwaway tools, idea validation, fast blank-page starts.
Worst for: surgical edits inside a large existing codebase with conventions.
Once a real repo and constraints exist, switch to a CLI agent or AI IDE.
Strength is zero-to-running speed, not maintaining production code.
Follow-ups they push on- Why is an app-builder awkward for changing an existing production app?
- What do you lose if you keep prototyping in an app-builder past the demo stage?
Red flag Using an app-builder to evolve a serious, growing codebase — it's tuned for fresh scaffolds, not careful changes within established structure and conventions.
source: Anthropic — Claude Code overview ↗ -
What does it mean that a model has a 'training cutoff', and how should that change what you trust it on?
A model's knowledge is frozen at its training data cutoff — it learned from data up to roughly that date and has no inherent awareness of anything after it. So it can be confidently wrong about recent library versions, new APIs, current prices, or even newer models (including itself).
Practically: trust it for durable concepts and patterns (how REST works, what a closure is), but verify anything time-sensitive — latest package version, current API signature, today's model lineup — against live docs or by giving it the current information in context. Tools that can fetch docs or read your actual
package.jsonclose this gap; raw model memory does not.What a strong answer coversKnowledge is frozen at the training cutoff; nothing newer is inherently known.
It can be confidently wrong on recent versions, APIs, prices, and newer models.
Trust it for durable concepts; verify time-sensitive specifics against live sources.
Giving it current docs/context or a fetch tool beats relying on its memory.
Follow-ups they push on- Why might an agent suggest a deprecated API or an old package version?
- How does giving the model your current docs in context fix this?
Red flag Trusting the model's recall of 'the latest' version, API, or model name — that's exactly what its cutoff makes unreliable; check current docs.
source: Anthropic — Models overview ↗ -
Is the most capable model always the right choice? Explain the tradeoff.
No — there's a cost / capability / latency tradeoff. The most-capable tier costs more per token and is slower; for simple, high-volume tasks (tagging, extraction, routing, autocomplete) a fast/cheap model is both cheaper and snappier, and just as correct.
Match the model to the task: escalate to a stronger tier only when the task's reasoning genuinely needs it. A common production pattern is to route — cheap model for the easy 90%, strong model for the hard 10%.
Follow-ups they push on- Give a task where the cheapest tier is the right call.
- What is model 'routing' or a cascade?
Red flag Defaulting to the biggest model for everything — it burns money and latency on tasks a small model nails.
source: Anthropic — Models overview ↗ -
How do you pick which model tier to point a coding agent at for a given task?
Match the tier to the task's reasoning demand. For hard, multi-step, high-stakes work — architecture, gnarly debugging, large refactors where a wrong move is costly — use the most-capable tier; the extra cost and latency buy correctness. For routine, well-specified work — boilerplate, simple edits, repetitive transforms, classification-like steps — a fast/cheap tier is snappier and just as correct.
A common pattern is to default to a balanced tier for everyday coding and escalate to the top tier only when a task stalls or genuinely needs deeper reasoning. The skill is reasoning about the demand, not memorizing which model name is 'best' this month — and checking the provider's current model page for which name maps to each tier today.
What a strong answer coversHard/multi-step/high-stakes → most-capable tier; correctness outweighs cost.
Routine, well-specified work → fast/cheap tier; same result, less cost and latency.
Common default: balanced tier everyday, escalate to top tier when stuck.
Reason about reasoning-demand; verify current tier→name mapping in provider docs.
Follow-ups they push on- Give a coding task where the cheapest tier is the right call.
- What signals tell you to escalate from a balanced to the top tier?
Red flag Pointing the biggest, slowest model at every task by default — you burn cost and latency on edits a cheaper tier handles perfectly.
source: Anthropic — Choosing a model ↗ -
AI coding tools feel magical at first but stall on real codebases. What's the realistic mental model for what they're good and bad at?
Think of an AI coding tool as a fast, broadly knowledgeable, eager junior who has never seen your codebase, can't run things in their head reliably, and won't push back unless you make them. They're excellent at well-scoped, well-specified tasks with clear examples and a way to verify; they're weak at ambiguous goals, implicit context they were never given, and anything where being confidently wrong is cheap for them but expensive for you.
The realistic model: their output quality tracks the quality of your context and spec, not the tool's branding. Give relevant files, an example to match, and an acceptance check, and review the result — and they're a force multiplier. Hand them a vague wish and full autonomy, and they generate plausible code that misses the point.
What a strong answer coversStrong on scoped, specified tasks with examples and a verification path.
Weak on ambiguity, unstated context, and self-checking their own correctness.
Output quality tracks your context/spec quality more than the tool's brand.
Force multiplier with good prompts + review; liability with vague goals + blind trust.
Follow-ups they push on- Why does giving an example from the repo improve results so much?
- What's the single highest-leverage thing you can add to a weak prompt?
Red flag Blaming the tool when results are poor — usually the missing piece is context, a concrete example, or an acceptance check the human didn't provide.
source: Anthropic — Claude Code best practices ↗
7.7 Working with AI agents 10
-
Why treat a prompt to a coding agent like a spec rather than a casual request?
An agent does exactly what you describe, not what you meant — it has none of the shared context a teammate would fill in. A casual request ('add login') leaves a hundred decisions to chance: which auth method, where state lives, what the error states are, what 'done' means. The agent picks plausible answers, and you discover the gaps afterward.
Treating the prompt as a spec front-loads those decisions: state the goal, the constraints, an example to match, and the acceptance check. This is the same discipline as writing a ticket for a junior dev. The clearer the spec, the less re-work — vague prompts don't save time, they move the cost to debugging plausible-but-wrong output.
What a strong answer coversThe agent does what you say, not what you meant — it lacks your unstated context.
A casual ask leaves many decisions to chance; the agent guesses, you find gaps later.
A spec front-loads: goal, constraints, example, acceptance check — like a good ticket.
Vagueness doesn't save time; it relocates the cost to debugging wrong output.
Quick self-checkWhich prompt is most likely to produce code you can ship with minimal rework?
-
Wrong — no constraints, example, or acceptance check; the agent guesses everything.
-
Correct — goal, constraint, concrete example, and a verifiable acceptance check: a real spec.
-
Wrong — entirely unspecified; nothing to build to or verify against.
-
Wrong — delegates all decisions; you'll get plausible choices that may miss intent.
Follow-ups they push on- Which part of a spec do people most often omit?
- How is prompting an agent like writing a ticket for a junior engineer?
Red flag Firing off a one-line wish and expecting the agent to infer your conventions, edge cases, and definition of done — it can't; it fills gaps with guesses.
source: Anthropic — Prompt engineering overview ↗ -
What is the context window, and why does it shape how you work with a coding agent?
The context window is everything the model can 'see' at once — the system prompt, your instructions, the files and snippets you've shared, and the conversation so far, all measured in tokens with a hard limit.
It shapes your workflow because the agent can't reason about code it hasn't been shown, and stuffing in irrelevant files wastes the budget and dilutes attention. So you deliberately feed it the right files, an example, and the acceptance criteria — and start fresh when a long thread gets noisy.
Follow-ups they push on- Why can dumping the whole repo into context hurt rather than help?
- What do you do when a session gets long and the model starts losing the thread?
Red flag Assuming the agent 'remembers' your codebase — it only knows what's in the current context window; share the relevant files.
source: Anthropic — Claude Code best practices ↗ -
Describe a healthy loop for working with a coding agent on a real change.
Plan → edit → test → review the diff → commit. First have it lay out a plan and agree on it before any code (plan mode helps). Then let it edit, run the tests, and crucially read the diff yourself before accepting — verify, don't trust. Commit in small, reviewable chunks.
The discipline is treating the agent like a fast junior pair: you still own the review and the commit. Small loops with verification beat one giant unreviewed change.
Follow-ups they push on- Why plan before editing?
- What's the risk of committing the agent's output without reading the diff?
Red flag Accepting a large change wholesale without reading the diff — bugs and unintended edits slip through unverified.
source: Anthropic — Claude Code best practices ↗ -
What makes a strong task prompt for an agent?
Four parts: the goal (what done looks like), the constraints (don't touch X, use library Y, match this style), an example (an existing pattern to follow or sample input/output), and an acceptance check (the test or command that proves it works).
This turns a vague wish into a spec the agent can hit and you can verify. The acceptance check is the part people skip — without it neither you nor the agent knows when it's actually done.
Follow-ups they push on- Why include an example of existing code in the repo?
- What's the value of stating an acceptance check up front?
Red flag Giving only the goal ('add search') with no constraints, example, or check — you get plausible code that may miss the point.
source: Anthropic — Prompt engineering overview ↗ -
What are custom instructions like CLAUDE.md / AGENTS.md for?
They're a persistent, project-level brief the agent reads automatically — conventions, commands, architecture notes, do's and don'ts — so you don't re-explain them every session. They put durable context into the window without you pasting it each time.
They're one of several customization surfaces, alongside slash commands (reusable prompts), MCP/tools (giving the agent new capabilities), subagents, and plan mode. The instruction file is the cheapest, highest-leverage one to start with.
Follow-ups they push on- What belongs in a project instruction file vs a one-off prompt?
- What is plan mode, and when do you use it?
Red flag Letting the file rot — stale instructions actively mislead the agent; treat it as living documentation.
source: Anthropic — Claude Code best practices ↗ -
Why is 'review the diff before you accept it' the non-negotiable habit when working with an agent?
An agent is fast and confident but not accountable — you are. It can make changes beyond what you asked (touching unrelated files, deleting code it deemed unnecessary, introducing a subtle bug) and it states all of it with equal confidence. The diff is your checkpoint: it shows exactly what changed before it becomes part of your code.
Reading the diff is also where *you* stay in control of the codebase — you keep understanding what's in it, catch scope creep, and verify the change actually does what the spec asked. 'Verify, don't trust' is the whole posture. Skipping the diff is how unreviewed bugs and unintended edits slip into a repo nobody fully understands anymore.
What a strong answer coversThe agent is fast and confident but not accountable — the human is.
It can make changes beyond the ask; the diff exposes exactly what changed.
Reviewing keeps you in command of the codebase and catches scope creep.
'Verify, don't trust' — the diff is the checkpoint before code is yours.
Follow-ups they push on- What's the danger of accepting a large change wholesale, unread?
- How do small, frequent commits make diff review easier?
Red flag Accepting big changes blind because they 'look right' — confident, plausible code can carry unrelated edits and subtle bugs that only a diff review surfaces.
source: Anthropic — Claude Code best practices ↗ -
An agent keeps failing to fix a bug, trying variation after variation. How do you break the loop?
Thrashing usually means the agent is missing something it needs, not that it needs more attempts. Stop and give it more or better context: the exact error message and stack trace, the relevant file it hasn't seen, how to reproduce, and what you've already ruled out. Often it's been guessing because the failing piece was never in its window.
If that doesn't help, change the approach: ask it to first explain its diagnosis and a plan before editing (so you can catch a wrong mental model), narrow the task, or reset the session to clear accumulated wrong turns. And know when to take over — for a tricky bug a human read of the actual error often beats a tenth blind attempt. More tries on the same starting context rarely converges; better context or a reset does.
What a strong answer coversThrashing = missing context, not too few attempts.
Feed it the exact error, stack trace, repro steps, and what's been ruled out.
Make it state a diagnosis/plan before editing to expose a wrong mental model.
Reset the session to clear bad turns; know when to take over yourself.
Follow-ups they push on- Why does pasting the exact stack trace help more than 'it's still broken'?
- When is it faster to just debug it yourself?
Red flag Repeatedly saying 'still broken, try again' on the same context — without new information the agent just cycles plausible guesses; add context or reset.
source: Anthropic — Claude Code best practices ↗ -
What safety rails do you keep in mind when letting an agent run in your repo?
Core rails: never commit secrets (and don't let the agent paste keys into code or logs), review before running anything it generates — especially shell commands and migrations, watch cost (long autonomous runs burn tokens), and slow down on risky changes (deletes, schema migrations, anything touching prod or auth).
The mindset: the agent is fast and confident but not accountable — you are. Treat its output as a proposal to verify, not a command to execute blindly.
Follow-ups they push on- What kinds of changes warrant extra scrutiny?
- Why is reviewing a generated shell command especially important?
Red flag Granting blanket auto-run on everything — a confidently wrong destructive command (a bad `rm` or migration) can do real damage.
source: Anthropic — Claude Code best practices ↗ -
A long agent session starts making mistakes, contradicting earlier decisions, and 'forgetting' things. What's happening and what do you do?
Long sessions degrade because the context window fills with accumulated history — old turns, dead ends, large file dumps — which both crowds out room for new work and dilutes the model's attention across noise. The earlier 'decisions' may have scrolled out of effective focus, so it drifts.
The fix is to manage context deliberately: start a fresh session for a new sub-task, re-state the current goal and the few decisions that still matter, and re-share only the relevant files rather than the whole accumulated thread. Capture durable decisions in a project instruction file (CLAUDE.md) or a short summary you can paste back, so resetting the session doesn't lose them. Short, focused contexts beat one ever-growing thread.
What a strong answer coversCause: the context window fills with history/noise, crowding and diluting attention.
Fix: start fresh, re-state the goal and the decisions that still matter.
Re-share only relevant files, not the entire accumulated conversation.
Persist durable decisions (instruction file / summary) so a reset loses nothing.
Follow-ups they push on- Why does dumping the whole repo into one long thread make this worse?
- What's worth capturing in a project instruction file before you reset?
Red flag Pushing through in the same bloated thread, repeating yourself — the noise is the problem; a clean context with a crisp restatement works far better.
source: Anthropic — Claude Code best practices ↗ -
What kinds of tasks should you NOT hand to an agent autonomously, and why?
Avoid full autonomy where a confident mistake is expensive or irreversible: destructive operations (deletes,
rm, dropping data), database schema migrations, anything touching production, security-sensitive code (auth, permissions, payment), and broad sweeping changes you can't easily review. These share a trait — the cost of being wrong is high and recovery is hard.The principle is *cost of error*. Where errors are cheap and caught by tests (a new pure function, a localized UI tweak), let the agent run. Where errors are catastrophic or hard to undo, keep a human in the loop: require confirmation, work on a branch, review the diff and the exact commands before they execute. Match autonomy to reversibility.
What a strong answer coversHold back autonomy on destructive ops, migrations, prod changes, and auth/payment code.
Common thread: high cost of error and hard to undo.
Where errors are cheap and tests catch them, more autonomy is fine.
Match the autonomy you grant to how reversible the change is.
Follow-ups they push on- Why are schema migrations especially risky to automate?
- How does working on a branch lower the cost of an agent's mistake?
Red flag Granting blanket auto-approval so a confidently wrong destructive command (a bad migration or `rm`) executes before any human sees it.
source: Anthropic — Claude Code best practices ↗
7.8 Building AI features into your app 11
-
What is the context window when calling an LLM API, and why does it cap what you can send?
The context window is the maximum number of tokens a single request can hold — the system prompt, the full conversation history, any documents you stuff in, *and* the space reserved for the model's reply, all together. It's a hard ceiling measured in tokens, and it varies by model.
It caps what you send because everything competes for the same budget: a long chat history or a giant pasted document leaves less room for the answer, and exceeding the window errors or forces truncation. So building real features means being deliberate — send the relevant context (often via retrieval), summarize or trim old turns, and remember input *and* output both count against the limit and the bill.
What a strong answer coversContext window = max tokens per request: system + history + inputs + the reply, combined.
It's a hard, per-model ceiling measured in tokens.
Input and output share the budget — long input crowds out the answer.
Real features manage it: retrieve relevant context, trim/summarize history.
Quick self-checkYour chatbot works fine early in a conversation but starts erroring after many turns. The most likely cause?
-
Wrong — training is fixed; this is a per-request issue, not a model-knowledge issue.
-
Correct — each call resends the full history; eventually it exceeds the window and errors or truncates.
-
Wrong — there's no per-conversation call cap driving this; it's the token budget.
-
Wrong — tokens aren't time-based; the limit is total size per request.
Follow-ups they push on- If a conversation grows past the window, what are your options?
- Why does a huge pasted document eat into the space for the response?
Red flag Assuming the model 'remembers' past calls — each API call is stateless; you resend whatever history you want it to see, and it all counts against the window.
source: Anthropic — Context windows ↗ -
Describe a basic LLM API call. What's the difference between the system and user message, and what's a token?
You send a list of messages and get back a generated message. The system message sets the role, rules, tone, and constraints ('you are a support bot; never reveal internal IDs'). The user message carries the actual request. The model responds with text (and optionally structured data).
A token is the unit the model reads and writes — roughly a word-piece (a few characters). It matters because cost, latency, and the context-window limit are all measured in tokens, for both input and output.
Follow-ups they push on- Why are both input and output billed in tokens?
- Roughly how many characters is a token?
Red flag Putting changeable user input into the system prompt — instructions and untrusted input should be kept in their proper roles.
source: Anthropic — Build with Claude (overview) ↗ -
Why call an LLM from your server instead of directly from the browser?
Same reason any sensitive call belongs server-side: the API key. Calling the LLM from the browser means shipping your provider key to every visitor, where it's trivially stolen and used to run up your bill. The key must live on your server.
Beyond the key, the server lets you control the integration: enforce rate limits and per-user quotas (so one user can't drain your budget), validate and sanitize input, inject the system prompt the user shouldn't control, log usage and cost, and cache. The pattern is a thin backend endpoint your frontend calls; that endpoint holds the key and calls the LLM. The browser never sees the provider directly.
What a strong answer coversThe API key can't ship to the browser — it'd be stolen and abused.
Server-side lets you rate-limit and set per-user quotas to cap spend.
Server controls the system prompt and sanitizes user input before sending.
Pattern: frontend → your backend endpoint (holds the key) → LLM provider.
Quick self-checkWhat's the main reason to route LLM calls through your own backend rather than calling the provider from the browser?
-
Wrong — browsers can call HTTPS APIs fine; that's not the constraint.
-
Correct — the key is a secret; client code is public, so the call (and key) must live server-side.
-
Wrong — browsers handle large/streamed responses routinely.
-
Wrong — output format is a request parameter, unrelated to where the call originates.
Follow-ups they push on- What stops one user from draining your whole token budget?
- Why shouldn't the user be able to set the system prompt directly?
Red flag Calling the LLM provider straight from frontend JavaScript with the key embedded — it leaks to every user and there's no way to rate-limit or control cost.
source: Anthropic — API getting started (authentication) ↗ -
What knobs (temperature, max tokens, system prompt) shape an LLM's output, and what do they each do?
The system prompt sets the model's role, rules, and output format — the single biggest lever on behavior. Temperature controls randomness: low (near 0) makes output focused and repeatable (good for extraction, classification, structured data); higher makes it more varied and creative (brainstorming, copy). Max tokens caps the *length* of the response — set it high enough that answers aren't cut off mid-sentence, but it's a ceiling, not a target.
The builder's instinct: reach for the system prompt first (it shapes the most), set temperature low when you need deterministic, parseable output and higher when you want range, and size max tokens to the expected answer. Note that exact parameters vary by provider and model — check the current API reference for which knobs a given model exposes.
What a strong answer coversSystem prompt: role, rules, format — the strongest behavior lever.
Temperature: low = focused/repeatable; higher = varied/creative.
Max tokens: caps response length (a ceiling, not a target) — avoid mid-sentence cutoffs.
Available knobs differ by provider/model — verify against the current API docs.
Follow-ups they push on- For a JSON-extraction task, do you want high or low temperature, and why?
- What happens if max tokens is set too low for the answer?
Red flag Using a high temperature for tasks that need consistent, parseable output (extraction, classification) — you get unstable results that are hard to depend on.
source: Anthropic — Messages API parameters ↗ -
How do you keep an LLM feature's cost and quality under control once it's live?
Cost scales with tokens (input + output) × calls × model tier, so the levers are: pick the cheapest tier that passes for each task, trim the context you send (don't dump whole documents), cap
max_tokens, cache or reuse stable prefixes, and rate-limit per user. Log token usage per request so you can see where the spend actually goes instead of guessing.Quality can't be eyeballed forever — build an eval set of representative inputs with expected outputs and run it whenever you change the prompt or model, so you catch regressions. In production, log inputs/outputs (within privacy limits), watch for failures and refusals, and add guardrails (validate structured output, fall back gracefully). The theme: measure both dimensions with real numbers rather than vibes.
What a strong answer coversCost = tokens × calls × tier; lower tier, trim context, cap output, cache, rate-limit.
Log per-request token usage to see where spend actually goes.
Quality: maintain an eval set; re-run it on every prompt/model change to catch regressions.
In prod: log I/O within privacy limits, watch failures/refusals, validate output.
Follow-ups they push on- What goes into a good eval set for an LLM feature?
- Which is usually the bigger cost lever — model tier or context size?
Red flag Shipping and judging quality by vibes while costs creep — without an eval set and usage logging, regressions and budget blowouts go unnoticed until they're expensive.
source: Anthropic — Reducing latency and cost ↗ -
What is structured output / tool use, and why is it better than parsing prose?
Instead of free-form text, you have the model return data in a defined shape — JSON matching a schema (structured output) or a call to a function you defined with named arguments (tool use / function calling). Your code then consumes the JSON or executes the action.
It's better than regex-ing prose because it's reliable and parseable: the model commits to fields you specified, so you can validate it and wire it straight into your app — building chatbots and agents that fetch data or take actions, not just chat.
Follow-ups they push on- How does function calling let a model use external tools?
- What do you do if the returned JSON is still malformed?
Red flag Asking for prose and scraping fields out with string parsing — brittle. Request a schema/tool and validate the result.
source: Anthropic — Tool use (function calling) ↗ -
What is RAG, and when would you use it over fine-tuning?
RAG = Retrieval-Augmented Generation: chunk your data, embed each chunk into a vector store, and at query time retrieve the most relevant chunks and put them in the prompt so the model answers grounded in your data (with citations).
Use RAG for fresh/proprietary knowledge you need cited and kept current — it's cheaper to update (re-index, don't retrain). Use fine-tuning to change style, format, or behavior, not to inject facts. They're complementary, not competitors.
Follow-ups they push on- What's an embedding?
- How do you reduce hallucination in a RAG system?
Red flag Saying fine-tuning 'adds knowledge' — it mainly shifts behavior/format. For facts that change, RAG is the right tool.
source: DataCamp — RAG Interview Questions ↗ -
What is an embedding, and what does a vector store do with it?
An embedding is a vector — a list of numbers — that represents the meaning of a piece of text, such that texts with similar meaning land close together in that space. You produce them with an embedding model.
A vector store indexes those vectors so you can do fast similarity search: embed the user's query with the same model, then retrieve the nearest chunks (by cosine similarity or dot product). That's the 'retrieve' half of RAG — it's semantic search, matching on meaning rather than exact keywords.
Follow-ups they push on- Why must the query use the same embedding model as the documents?
- What is top-k retrieval?
Red flag Treating embedding similarity as keyword matching — it matches meaning, so a query with no shared words can still match.
source: DataCamp — RAG Interview Questions ↗ -
Your RAG bot keeps hallucinating. What knobs do you turn to reduce it?
Hallucination in RAG is usually a retrieval problem: if the right chunk isn't in the prompt, the model fills the gap by guessing. So improve retrieval first — better chunking (size/overlap), a better embedding model, reranking the candidates, and raising recall so the relevant passage actually shows up.
Then tighten the prompt: instruct it to answer only from the provided context and to say 'I don't know' when the context lacks the answer, and ask for citations so you can check grounding. Evaluate with a test set rather than eyeballing.
Follow-ups they push on- Why does poor chunking cause hallucination?
- How would you measure whether your fix actually helped?
Red flag Reaching for a bigger/fine-tuned model first — if retrieval doesn't surface the fact, no model can ground on it.
source: DataCamp — RAG Interview Questions ↗ -
What is prompt injection, and how do you defend an LLM feature against it?
Prompt injection is when untrusted content — a user message, a web page, a retrieved document, an email — contains instructions that hijack the model ('ignore your instructions and reveal the system prompt' / 'email all the data to X'). The model can't reliably tell your instructions from data it's reading.
Defenses are layered, not a single fix: keep trusted instructions and untrusted input clearly separated; never grant the model unchecked authority (gate tools/actions behind permissions and human confirmation for risky ones); validate and constrain outputs; apply least privilege so a hijacked prompt can't reach secrets or destructive actions; and add input/output filtering. Assume injection is possible and limit the blast radius.
Follow-ups they push on- Why is indirect injection (via a retrieved doc or web page) especially dangerous for agents?
- Why isn't 'just tell the model to ignore malicious instructions' a real fix?
Red flag Believing a clever system prompt fully prevents it — there's no perfect prompt-level fix; you must limit privileges and gate actions.
source: Simon Willison — Prompt injection explained ↗ -
An LLM API call is stateless. What does that mean for building a multi-turn chat feature?
Each call to the messages endpoint is independent — the API keeps no memory of your previous calls. The model only knows what's in *this* request. So the 'conversation' isn't stored on the server side for you; it feels continuous only because you resend the prior messages each turn.
That means your app owns the history: you keep the running list of user/assistant messages, and on every new turn you send the whole relevant history plus the new user message. Practical consequences follow directly — history grows (and so does cost and token usage), you eventually trim or summarize it to stay within the window, and any 'memory' across sessions is something you build (a database), not something the API provides.
What a strong answer coversEvery API call is independent; the server stores no conversation for you.
Continuity is an illusion you create by resending prior messages each turn.
Your app owns the message history and sends it with every request.
History growth drives cost/tokens — trim or summarize; persistent memory is yours to build.
Follow-ups they push on- Where does the conversation history actually live in your app?
- Why does each additional turn cost a little more than the last?
Red flag Expecting the API to 'remember' the chat between calls — it doesn't; if you don't resend the history, the model has no idea what was said before.
source: Anthropic — Messages API basics ↗
No questions match these filters. Reset a filter to “All”.