6.4 ★ core [J][A] 12 interview Q's

Testing

The test pyramid and why; unit vs integration vs e2e; mocking/stubbing; TDD; and why 100% coverage isn't the goal.

A good test suite is a tradeoff, not a maximization: you want the most confidence per second of runtime and per hour of maintenance. The test pyramid is the shape that optimizes that tradeoff — many fast, cheap tests at the bottom, few slow, brittle ones at the top. Every choice in testing comes back to one question: for the confidence this test buys, what does it cost me in speed, flakiness, and maintenance?

Key vocabulary

Unit test: Tests one small piece (a function/class) in isolation, with its collaborators replaced by test doubles. Milliseconds to run; pinpoints the failure exactly.
Integration test: Tests that several real components work together — e.g. your code against a real database or HTTP client. Slower, but catches wiring and contract bugs units can't.
End-to-end (e2e): Drives the whole system as a user would (often through the UI/browser), exercising every layer. Highest confidence, but slow, flaky, and expensive to maintain.
Mock vs stub: A stub returns canned data to satisfy a dependency (state). A mock additionally asserts it was called a certain way (behavior). Both are test doubles that isolate the unit.
TDD: Test-Driven Development: write a failing test first, write the minimum code to pass it, then refactor — the red-green-refactor loop. The test specifies the behavior before the code exists.
Flaky test: A test that passes or fails non-deterministically on the same code — usually from timing, ordering, shared state, or a real network. Flaky tests erode trust in the whole suite.
Code coverage: The percentage of code lines/branches executed by tests. A useful signal of what's untested, but a poor target — 100% executed is not 100% verified.

The test pyramid

FIG 1 · the test pyramid Many fast, precise unit tests at the base; a meaningful integration middle; a thin layer of slow, high-value e2e tests at the top.

Layer	Scope	Speed	Volume	When it fails you learn
Unit	one function/class, isolated	milliseconds	many	exactly which unit broke
Integration	several real components together	seconds	fewer	the wiring/contract between them broke
E2E	the whole system, like a user	minutes	fewest	something broke — now go find where

Push tests down the pyramid: many fast units, fewer integrations, fewest e2e.

The reasoning behind the shape: as you go up, tests get slower, flakier, and vaguer about what broke. A failing unit test names the broken function; a failing e2e test tells you the checkout is broken but not why. So you want most of your coverage where feedback is fast and precise (units), a meaningful middle layer that proves components actually talk to each other (integration), and a thin top layer of e2e tests covering only the few critical user journeys.

Mocking and stubbing — isolating the unit

A unit test must not depend on a real database, network, or clock — those make it slow and non-deterministic. Test doubles stand in for those collaborators. The distinction worth knowing:

Stub for state, mock for behavior

Testing a service that charges a card and then sends a receipt. We don’t want to hit Stripe or send real email, so we substitute both — but we verify them differently:

test("charges the card and emails a receipt", async () => {
  // STUB: returns canned data so the code under test can proceed
  const payments = { charge: async () => ({ id: "ch_123", ok: true }) };

  // MOCK: a spy whose *calls* we will assert on
  const mailer = { sendReceipt: jest.fn() };

  await checkout({ payments, mailer }, order);

  // assert BEHAVIOR: the receipt email was actually triggered, once, correctly
  expect(mailer.sendReceipt).toHaveBeenCalledTimes(1);
  expect(mailer.sendReceipt).toHaveBeenCalledWith(order.email, "ch_123");
});

The payments double is a stub — we only need it to return a successful charge so the flow continues. The mailer is a mock — the test’s whole point is to assert it was called the right way. Stub when you care about the result; mock when you care about the interaction.

Test double	What it does	You assert on	Use when
Stub	returns canned data on call	the result (state)	the collaborator just needs to answer so the flow proceeds
Mock	records calls + lets you assert them	the interaction (behavior)	the whole point is that a side effect happened (an email sent, an event published)
Fake	a lightweight working impl (in-memory DB)	real behavior, fast	you want realistic behavior without the real dependency's cost

Stub = canned answer (state); mock = call verification (behavior); fake = a real-but-light implementation.

TDD and the coverage caveat

TDD is the red-green-refactor loop: write a failing test that specifies the desired behavior (red), write the minimum code to make it pass (green), then clean up with the safety net in place (refactor). Its real value isn’t the tests themselves — it’s that designing the test first forces you to define the behavior and the interface before you commit to an implementation.

   ┌─────────┐      ┌──────────┐      ┌────────────┐
   │  RED    │ ───▶ │  GREEN   │ ───▶ │  REFACTOR  │ ──┐
   │ failing │      │ minimum  │      │ clean up,  │   │
   │  test   │      │ to pass  │      │ tests stay │   │
   └─────────┘      └──────────┘      └────────────┘   │
        ▲                                              │
        └──────────────────────────────────────────────┘

Key points

Pyramid: many unit tests (fast, precise), fewer integration tests (real components together), fewest e2e tests (whole system, slow/flaky).
Push tests down: lower layers give faster, more precise feedback; an inverted “ice-cream cone” of mostly e2e tests is the anti-pattern.
Stub when you care about a returned value (state); mock when you care about the interaction (behavior); a fake is a light working implementation.
Mock only at boundaries (network/clock/third parties); over-mocking lets tests pass while the system is broken — that’s what integration tests catch.
TDD = red → green → refactor; writing the test first forces you to define behavior and interface up front.
Coverage is a signal of what’s untested, not a target — 100% executed ≠ 100% verified.

01 Learning objectives

0 / 2 done

02 Curated reading

Martin Fowler — The Practical Test Pyramid
optional 25m — The definitive write-up of the pyramid.

03 Knowledge check

knowledge check2 questions · pass ≥ 70%

01easy
The test pyramid recommends:
02medium
100% code coverage guarantees a well-tested, bug-free codebase.

04 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

Commonly asked mid concept very common What is the test pyramid, and why more unit tests than end-to-end tests?
The pyramid is a guideline for the *shape* of your test suite: a wide base of fast, cheap unit tests; fewer integration tests in the middle; and a thin top of end-to-end tests through the whole system/UI.
Why that shape: as you go up, tests get slower, more brittle, and harder to pin a failure to a cause. Unit tests run in milliseconds and localize bugs precisely; e2e tests run for minutes, flake on timing, and only tell you *something* broke. So you push as much coverage as low as possible and reserve e2e for a few critical user journeys.
The inverted shape — mostly e2e — is the ice-cream cone anti-pattern: slow, flaky, expensive to maintain.
Follow-ups they push on
- What does the 'ice-cream cone' look like and why is it painful?
- Where do contract tests fit in this picture?
Red flag Treating the pyramid as exact ratios or gospel rather than a heuristic — the real point is fast/cheap/localized at the bottom, slow/brittle at the top.
source: Martin Fowler — The Practical Test Pyramid ↗
Commonly asked junior concept very common Define unit, integration, and end-to-end tests — what does each actually verify?
Unit tests exercise the smallest testable piece — one function/class — in isolation, with collaborators faked. They verify *this unit's logic is correct*. Fast and deterministic.
Integration tests verify that units talk to a real collaborator correctly — your code against an actual database, queue, or HTTP API. They catch interface/wiring bugs a unit test mocks away.
End-to-end tests drive the fully assembled system the way a user would (through the UI or public API) and verify a whole journey works. Slowest, most realistic, most brittle.
The trade is realism vs. speed/stability: unit = fast + narrow, e2e = realistic + fragile.
Follow-ups they push on
- Why can a suite of all-green unit tests still let a broken feature ship?
- What's the difference between an integration test and a component test?
Red flag Calling a test that mocks the database an 'integration test' — if every dependency is faked it is still a unit test.
source: Martin Fowler — The Practical Test Pyramid ↗
Commonly asked senior concept common What's the difference between a mock and a stub, and when do you reach for each?
Both are test doubles that stand in for a real dependency, but they answer different questions.
A stub provides canned return values so the code under test can run — it is about *state*: 'when asked, return this'. You assert on the output your code produces.
A mock also has pre-programmed responses but additionally *verifies the interaction* — it is about *behavior*: 'was sendEmail called once, with these args?'. You assert on the mock itself.
Rule of thumb: stub queries (reads), mock commands (side effects you care happened). Over-mocking couples tests to implementation detail and makes refactoring painful.
Follow-ups they push on
- What's the difference between a fake and a stub?
- Why can heavy mocking make tests pass while the real integration is broken?
Red flag Mocking everything, including pure logic — the test then asserts on internal calls and breaks on any refactor even when behavior is unchanged.
source: Martin Fowler — Mocks Aren't Stubs ↗
Commonly asked mid concept common Walk me through the TDD cycle. What does it actually buy you?
TDD is red-green-refactor:
1. Red — write a small failing test for the next bit of behavior.
2. Green — write the minimum code to make it pass.
3. Refactor — clean up the code (and tests) now that they are green, keeping the bar passing.
Repeat in tiny increments. What it buys you: tests exist by construction (not bolted on later), the code is *designed to be testable* (so it tends toward decoupling and clear interfaces), and you get a fast feedback loop plus a regression safety net that lets you refactor fearlessly. It also forces you to define 'done' before coding.
Follow-ups they push on
- Why is the refactor step the part people skip, and what happens when they do?
- When is strict TDD a poor fit?
Red flag Describing TDD as 'write tests after the code' — the whole point is the test comes *first* and drives the design.
source: Martin Fowler — Test Driven Development ↗
Commonly asked senior trick common Why isn't 100% code coverage the goal? Can you have high coverage and still be poorly tested?
Coverage measures which lines *executed* during tests — not whether you *asserted* anything meaningful about them. You can hit 100% with tests that call code and check nothing, or that never exercise the edge cases and error paths that actually break in production.
Chasing 100% also has diminishing returns: the last few percent are often trivial getters or unreachable branches, and the effort is better spent elsewhere. Worse, it incentivizes shallow tests written to satisfy a number.
Better: treat coverage as a *diagnostic for gaps* (what is entirely untested?), aim for a sensible threshold, and judge quality by whether tests assert behavior and cover the risky paths — not by a single percentage.
Follow-ups they push on
- What is mutation testing and how does it expose 'assertion-free' coverage?
- Which kinds of code genuinely don't need unit tests?
Red flag Treating a coverage percentage as a quality metric — high coverage with weak/absent assertions is theater.
source: Martin Fowler — Test Coverage ↗
Commonly asked senior debug common A test in your CI passes locally but fails ~10% of the time in the pipeline. How do you approach it?
That is a flaky test — non-deterministic. First, do not 'fix' it by retrying or deleting; quarantine it so it stops eroding trust in the suite, then root-cause it.
Common causes to check:
- Async/timing: a fixed sleep instead of waiting on a real condition; race conditions.
- Shared state / test ordering: tests that leak state between runs or assume order.
- Time and randomness: real now(), time zones, unseeded random.
- External dependencies / network that are slow or unavailable in CI.
- Resource contention in the parallel CI runner that doesn't happen locally.
Fix the determinism (inject the clock, isolate state, wait on conditions, stub the network). Flaky tests are dangerous because people start ignoring red builds.
Follow-ups they push on
- Why is auto-retrying flaky tests a trap?
- How would you reproduce a CI-only failure locally?
Red flag Masking flakiness with blanket retries — it hides real race conditions and trains the team to ignore failing tests.
source: Martin Fowler — Eradicating Non-Determinism in Tests ↗
Commonly asked junior concept common What is the Arrange-Act-Assert pattern, and what makes a test maintainable?
Arrange-Act-Assert (AAA) structures a test into three clear phases: Arrange the inputs and preconditions, Act by invoking the one thing under test, then Assert on the outcome. Keeping these visually separate makes a test read as a tiny spec of the behavior.
Maintainable tests share a few traits: they test one behavior (so a failure points at one cause), assert on observable behavior rather than implementation detail (so a refactor doesn't break them), are deterministic and isolated (no shared state, no order dependence), and have descriptive names that state the scenario and expected result. A good test is also fast.
The through-line: a test should fail for exactly one reason and tell you what that reason is. Tests are production code — DRY-ish helpers are fine, but readability beats cleverness.
What a strong answer covers
- AAA: Arrange preconditions → Act on the unit → Assert the outcome; keep the phases visibly separate.
- Test one behavior per test so a failure localizes to a single cause.
- Assert on observable behavior, not internals, so refactors don't break green tests.
- Be deterministic, isolated, and descriptively named — a test should fail for exactly one reason.
Follow-ups they push on
- Why does asserting on private implementation detail make tests brittle?
- What's the 'one assertion per test' guideline really getting at?
Red flag Writing tests that assert on internal calls/structure rather than observable behavior — they break on every refactor even when the behavior is unchanged, training people to delete tests.
source: Martin Fowler — Given-When-Then ↗
Commonly asked senior concept occasional What's the difference between sociable and solitary unit tests, and the 'London vs Detroit' (mockist vs classicist) schools?
A solitary unit test isolates the unit by replacing *all* its collaborators with test doubles; a sociable unit test lets the unit use its real collaborators (as long as they're fast and deterministic), testing them together.
This maps to two testing schools. The mockist / London school favors solitary tests with mocks for every dependency, verifying *interactions* — it gives precise failure localization and tests units in true isolation, but couples tests to the call structure, so refactors that preserve behavior can still break tests. The classicist / Detroit (Chicago) school favors sociable tests, mocking only awkward dependencies (network, clock, DB), and asserting on *resulting state* — tests are more refactor-resilient and catch integration bugs between collaborators, but a failure may implicate several units.
Neither is 'correct'; the tradeoff is isolation/precision vs. refactor-resilience/realism, and most teams blend them.
What a strong answer covers
- Solitary = all collaborators doubled; sociable = uses real collaborators where practical.
- Mockist/London: mock everything, verify interactions — precise localization, but couples tests to call structure.
- Classicist/Detroit: mock only awkward deps, assert on state — refactor-resilient, catches inter-unit bugs.
- The tradeoff is isolation/precision vs. realism/refactor-resilience; teams usually mix both.
Follow-ups they push on
- Why can a mockist test pass while the real integration is broken?
- Which approach makes a behavior-preserving refactor less likely to break tests, and why?
Red flag Treating one school as universally right — all-mockist suites become refactor-fragile interaction tests, while all-sociable suites can lose failure localization.
source: Martin Fowler — Unit Test (Solitary vs Sociable) ↗
Commonly asked senior concept occasional What is a contract test, and what problem does it solve that unit and e2e tests don't?
When service A calls service B, A's unit tests stub B — but the stub encodes A's *assumption* of B's API, which silently rots when B changes. Full e2e tests catch the mismatch but are slow, flaky, and need every service deployed together.
Contract testing (e.g. consumer-driven contracts / Pact) fills the gap. The consumer (A) defines the requests it makes and the responses it expects as a contract; that contract is then verified against the provider (B) independently. If B's change would violate A's expectations, B's pipeline fails — *before* anything is deployed together.
The payoff: you get confidence that two services are compatible at their boundary with the speed and independence of unit tests — no shared environment, each side tested in its own pipeline. It's how you keep a microservices fleet integrable without a giant brittle e2e suite.
What a strong answer covers
- Stubs of a remote service encode assumptions that drift as the provider changes — unit tests won't notice.
- A contract captures the consumer's expected requests/responses and is verified against the provider separately.
- It catches integration breakage before deploy, without a shared e2e environment.
- Gives boundary-compatibility confidence with the speed/isolation of unit tests — key for microservices.
Follow-ups they push on
- What does 'consumer-driven' add over the provider just publishing an OpenAPI spec?
- Why don't all-green unit tests on both services guarantee they integrate?
Red flag Assuming green unit tests on both sides mean the services integrate — the consumer's stub can diverge from the provider's real behavior, which only contract or integration tests catch.
source: Martin Fowler — Contract Testing ↗
Commonly asked senior concept occasional Should unit tests hit a real database? When is an in-memory or test-container DB the right call?
By definition, a unit test shouldn't touch a real DB — that makes it slow and non-deterministic. So you mock the data layer for unit tests. But mocking the DB means you never verify your *actual* SQL, migrations, or ORM mappings, and that's where real bugs hide.
So the pragmatic answer is layered: unit-test pure logic with the DB doubled, then write integration tests against a real database engine for the queries themselves. The mistake to avoid is using a *different* engine in tests than in production — e.g. SQLite or an in-memory fake standing in for Postgres. SQL dialects, constraint behavior, and types differ, so tests can pass against the fake and fail against prod (or vice versa).
Modern practice is Testcontainers: spin up the *real* database (same engine/version as prod) in a throwaway container for integration tests. You get fidelity without polluting a shared environment.
What a strong answer covers
- A true unit test doesn't hit a DB — mock the data layer for logic; it's slow/non-deterministic otherwise.
- But mocks never validate real SQL, migrations, or ORM mappings — cover those with integration tests.
- Don't substitute a different engine (SQLite for Postgres) — dialect/constraint differences make tests lie.
- Use Testcontainers to run the real prod-version DB in a disposable container for integration tests.
Follow-ups they push on
- Why can an in-memory SQLite stand-in for Postgres give false confidence?
- What belongs in a unit test vs an integration test for a repository class?
Red flag Testing against a different DB engine than production (in-memory fake for the real thing) — dialect and constraint mismatches let bugs pass tests and break in prod.
source: Testcontainers — Database integration testing ↗
Commonly asked senior concept occasional What is mutation testing, and how does it reveal that high code coverage can be misleading?
Line/branch coverage tells you code *ran* during tests, not that anything was *checked*. Mutation testing measures the latter: a tool makes small deliberate changes (mutants) to your code — flip > to >=, replace + with -, negate a condition, return null — then reruns your tests against each mutant.
If a mutant makes a test fail, it's killed (good — your tests detected the change). If all tests still pass, the mutant survived — meaning your suite executed that code but never asserted anything that the change would break. The mutation score (killed / total) is a far better quality signal than coverage.
This exposes the assertion-free-coverage problem directly: you can have 100% line coverage and a low mutation score, because tests call the code but verify nothing meaningful. The cost is compute — running the suite once per mutant is expensive — so teams often run it on critical modules rather than the whole repo.
What a strong answer covers
- Coverage proves code executed; mutation testing proves your assertions actually catch changes.
- It injects small bugs (mutants); a killed mutant = tests detected it, a survivor = a gap in assertions.
- Mutation score (killed/total) is a stronger quality metric than line coverage.
- Directly exposes assertion-free coverage: 100% lines but mutants survive = tests that check nothing.
- Cost is high (rerun suite per mutant), so target critical modules rather than the whole codebase.
Quick self-check
A mutant 'survives' a mutation test run. What does that tell you?
Follow-ups they push on
- How can you have 100% line coverage and a 40% mutation score?
- What is an 'equivalent mutant' and why does it muddy the score?
Red flag Trusting coverage as a quality bar — mutation testing routinely shows high-coverage suites with surviving mutants, i.e. tests that run code without asserting on its behavior.
source: PIT (Pitest) — Mutation testing ↗
Commonly asked senior design common Your e2e suite takes 45 minutes and people skip it. How do you make the test strategy sustainable?
A 45-minute, ignored e2e suite is usually the ice-cream-cone anti-pattern: too much testing pushed up to the slow, brittle e2e layer. The fix is to rebalance toward the test pyramid — push coverage down to where it's fast and reliable.
Concretely: for each slow e2e test, ask what it really verifies and move that assertion to the lowest layer that can — pure logic to unit tests, service-boundary behavior to integration/contract tests, and reserve e2e for a handful of critical user journeys (login, checkout). Parallelize what remains across CI runners, and split the suite so fast tests gate every PR while the full e2e set runs on a schedule or pre-deploy.
Separately, hunt flakiness — a slow suite people skip is often also a flaky one they've stopped trusting. Quarantine and fix flaky tests rather than retrying. The goal is a fast feedback loop developers actually run, backed by a thin, stable e2e layer.
What a strong answer covers
- A bloated e2e suite is the ice-cream cone — rebalance toward the pyramid (fast, low-level tests).
- Move each assertion to the lowest layer that can verify it; keep e2e for a few critical journeys only.
- Parallelize and tier the suite: fast tests gate PRs, full e2e runs pre-deploy/scheduled.
- Attack flakiness too — skipped suites are usually distrusted (flaky) ones; quarantine and fix, don't retry.
Follow-ups they push on
- How do you decide which assertions can move down from e2e to unit/integration?
- Why is tiering the suite (PR gate vs nightly) better than running everything on every push?
Red flag Speeding up an ice-cream-cone suite by only adding retries and more parallelism — without rebalancing toward the pyramid you still have a slow, brittle suite developers route around.
source: Martin Fowler — The Practical Test Pyramid ↗