Testing
The test pyramid and why; unit vs integration vs e2e; mocking/stubbing; TDD; and why 100% coverage isn't the goal.
A good test suite is a tradeoff, not a maximization: you want the most confidence per second of runtime and per hour of maintenance. The test pyramid is the shape that optimizes that tradeoff — many fast, cheap tests at the bottom, few slow, brittle ones at the top. Every choice in testing comes back to one question: for the confidence this test buys, what does it cost me in speed, flakiness, and maintenance?
The test pyramid
| Layer | Scope | Speed | Volume | When it fails you learn |
|---|---|---|---|---|
| Unit | one function/class, isolated | milliseconds | many | exactly which unit broke |
| Integration | several real components together | seconds | fewer | the wiring/contract between them broke |
| E2E | the whole system, like a user | minutes | fewest | something broke — now go find where |
The reasoning behind the shape: as you go up, tests get slower, flakier, and vaguer about what broke. A failing unit test names the broken function; a failing e2e test tells you the checkout is broken but not why. So you want most of your coverage where feedback is fast and precise (units), a meaningful middle layer that proves components actually talk to each other (integration), and a thin top layer of e2e tests covering only the few critical user journeys.
Mocking and stubbing — isolating the unit
A unit test must not depend on a real database, network, or clock — those make it slow and non-deterministic. Test doubles stand in for those collaborators. The distinction worth knowing:
Testing a service that charges a card and then sends a receipt. We don’t want to hit Stripe or send real email, so we substitute both — but we verify them differently:
test("charges the card and emails a receipt", async () => {
// STUB: returns canned data so the code under test can proceed
const payments = { charge: async () => ({ id: "ch_123", ok: true }) };
// MOCK: a spy whose *calls* we will assert on
const mailer = { sendReceipt: jest.fn() };
await checkout({ payments, mailer }, order);
// assert BEHAVIOR: the receipt email was actually triggered, once, correctly
expect(mailer.sendReceipt).toHaveBeenCalledTimes(1);
expect(mailer.sendReceipt).toHaveBeenCalledWith(order.email, "ch_123");
});The payments double is a stub — we only need it to return a successful charge so the flow
continues. The mailer is a mock — the test’s whole point is to assert it was called the
right way. Stub when you care about the result; mock when you care about the interaction.
| Test double | What it does | You assert on | Use when |
|---|---|---|---|
| Stub | returns canned data on call | the result (state) | the collaborator just needs to answer so the flow proceeds |
| Mock | records calls + lets you assert them | the interaction (behavior) | the whole point is that a side effect happened (an email sent, an event published) |
| Fake | a lightweight working impl (in-memory DB) | real behavior, fast | you want realistic behavior without the real dependency's cost |
TDD and the coverage caveat
TDD is the red-green-refactor loop: write a failing test that specifies the desired behavior (red), write the minimum code to make it pass (green), then clean up with the safety net in place (refactor). Its real value isn’t the tests themselves — it’s that designing the test first forces you to define the behavior and the interface before you commit to an implementation.
┌─────────┐ ┌──────────┐ ┌────────────┐
│ RED │ ───▶ │ GREEN │ ───▶ │ REFACTOR │ ──┐
│ failing │ │ minimum │ │ clean up, │ │
│ test │ │ to pass │ │ tests stay │ │
└─────────┘ └──────────┘ └────────────┘ │
▲ │
└──────────────────────────────────────────────┘
01 Learning objectives
0 / 2 done02 Curated reading
03 Knowledge check
- 01easy
The test pyramid recommends:
- 02medium
100% code coverage guarantees a well-tested, bug-free codebase.
04 Interview questions
browse all ↗What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.
-
What is the test pyramid, and why more unit tests than end-to-end tests?
The pyramid is a guideline for the *shape* of your test suite: a wide base of fast, cheap unit tests; fewer integration tests in the middle; and a thin top of end-to-end tests through the whole system/UI.
Why that shape: as you go up, tests get slower, more brittle, and harder to pin a failure to a cause. Unit tests run in milliseconds and localize bugs precisely; e2e tests run for minutes, flake on timing, and only tell you *something* broke. So you push as much coverage as low as possible and reserve e2e for a few critical user journeys.
The inverted shape — mostly e2e — is the ice-cream cone anti-pattern: slow, flaky, expensive to maintain.
Follow-ups they push on- What does the 'ice-cream cone' look like and why is it painful?
- Where do contract tests fit in this picture?
Red flag Treating the pyramid as exact ratios or gospel rather than a heuristic — the real point is fast/cheap/localized at the bottom, slow/brittle at the top.
source: Martin Fowler — The Practical Test Pyramid ↗ -
Define unit, integration, and end-to-end tests — what does each actually verify?
Unit tests exercise the smallest testable piece — one function/class — in isolation, with collaborators faked. They verify *this unit's logic is correct*. Fast and deterministic.
Integration tests verify that units talk to a real collaborator correctly — your code against an actual database, queue, or HTTP API. They catch interface/wiring bugs a unit test mocks away.
End-to-end tests drive the fully assembled system the way a user would (through the UI or public API) and verify a whole journey works. Slowest, most realistic, most brittle.
The trade is realism vs. speed/stability: unit = fast + narrow, e2e = realistic + fragile.
Follow-ups they push on- Why can a suite of all-green unit tests still let a broken feature ship?
- What's the difference between an integration test and a component test?
Red flag Calling a test that mocks the database an 'integration test' — if every dependency is faked it is still a unit test.
source: Martin Fowler — The Practical Test Pyramid ↗ -
What's the difference between a mock and a stub, and when do you reach for each?
Both are test doubles that stand in for a real dependency, but they answer different questions.
A stub provides canned return values so the code under test can run — it is about *state*: 'when asked, return this'. You assert on the output your code produces.
A mock also has pre-programmed responses but additionally *verifies the interaction* — it is about *behavior*: 'was
sendEmailcalled once, with these args?'. You assert on the mock itself.Rule of thumb: stub queries (reads), mock commands (side effects you care happened). Over-mocking couples tests to implementation detail and makes refactoring painful.
Follow-ups they push on- What's the difference between a fake and a stub?
- Why can heavy mocking make tests pass while the real integration is broken?
Red flag Mocking everything, including pure logic — the test then asserts on internal calls and breaks on any refactor even when behavior is unchanged.
source: Martin Fowler — Mocks Aren't Stubs ↗ -
Walk me through the TDD cycle. What does it actually buy you?
TDD is red-green-refactor:
1. Red — write a small failing test for the next bit of behavior.
2. Green — write the minimum code to make it pass.
3. Refactor — clean up the code (and tests) now that they are green, keeping the bar passing.Repeat in tiny increments. What it buys you: tests exist by construction (not bolted on later), the code is *designed to be testable* (so it tends toward decoupling and clear interfaces), and you get a fast feedback loop plus a regression safety net that lets you refactor fearlessly. It also forces you to define 'done' before coding.
Follow-ups they push on- Why is the refactor step the part people skip, and what happens when they do?
- When is strict TDD a poor fit?
Red flag Describing TDD as 'write tests after the code' — the whole point is the test comes *first* and drives the design.
source: Martin Fowler — Test Driven Development ↗ -
Why isn't 100% code coverage the goal? Can you have high coverage and still be poorly tested?
Coverage measures which lines *executed* during tests — not whether you *asserted* anything meaningful about them. You can hit 100% with tests that call code and check nothing, or that never exercise the edge cases and error paths that actually break in production.
Chasing 100% also has diminishing returns: the last few percent are often trivial getters or unreachable branches, and the effort is better spent elsewhere. Worse, it incentivizes shallow tests written to satisfy a number.
Better: treat coverage as a *diagnostic for gaps* (what is entirely untested?), aim for a sensible threshold, and judge quality by whether tests assert behavior and cover the risky paths — not by a single percentage.
Follow-ups they push on- What is mutation testing and how does it expose 'assertion-free' coverage?
- Which kinds of code genuinely don't need unit tests?
Red flag Treating a coverage percentage as a quality metric — high coverage with weak/absent assertions is theater.
source: Martin Fowler — Test Coverage ↗ -
A test in your CI passes locally but fails ~10% of the time in the pipeline. How do you approach it?
That is a flaky test — non-deterministic. First, do not 'fix' it by retrying or deleting; quarantine it so it stops eroding trust in the suite, then root-cause it.
Common causes to check:
- Async/timing: a fixed
sleepinstead of waiting on a real condition; race conditions.
- Shared state / test ordering: tests that leak state between runs or assume order.
- Time and randomness: realnow(), time zones, unseeded random.
- External dependencies / network that are slow or unavailable in CI.
- Resource contention in the parallel CI runner that doesn't happen locally.Fix the determinism (inject the clock, isolate state, wait on conditions, stub the network). Flaky tests are dangerous because people start ignoring red builds.
Follow-ups they push on- Why is auto-retrying flaky tests a trap?
- How would you reproduce a CI-only failure locally?
Red flag Masking flakiness with blanket retries — it hides real race conditions and trains the team to ignore failing tests.
source: Martin Fowler — Eradicating Non-Determinism in Tests ↗ -
What is the Arrange-Act-Assert pattern, and what makes a test maintainable?
Arrange-Act-Assert (AAA) structures a test into three clear phases: Arrange the inputs and preconditions, Act by invoking the one thing under test, then Assert on the outcome. Keeping these visually separate makes a test read as a tiny spec of the behavior.
Maintainable tests share a few traits: they test one behavior (so a failure points at one cause), assert on observable behavior rather than implementation detail (so a refactor doesn't break them), are deterministic and isolated (no shared state, no order dependence), and have descriptive names that state the scenario and expected result. A good test is also fast.
The through-line: a test should fail for exactly one reason and tell you what that reason is. Tests are production code — DRY-ish helpers are fine, but readability beats cleverness.
What a strong answer coversAAA: Arrange preconditions → Act on the unit → Assert the outcome; keep the phases visibly separate.
Test one behavior per test so a failure localizes to a single cause.
Assert on observable behavior, not internals, so refactors don't break green tests.
Be deterministic, isolated, and descriptively named — a test should fail for exactly one reason.
Follow-ups they push on- Why does asserting on private implementation detail make tests brittle?
- What's the 'one assertion per test' guideline really getting at?
Red flag Writing tests that assert on internal calls/structure rather than observable behavior — they break on every refactor even when the behavior is unchanged, training people to delete tests.
source: Martin Fowler — Given-When-Then ↗ -
What's the difference between sociable and solitary unit tests, and the 'London vs Detroit' (mockist vs classicist) schools?
A solitary unit test isolates the unit by replacing *all* its collaborators with test doubles; a sociable unit test lets the unit use its real collaborators (as long as they're fast and deterministic), testing them together.
This maps to two testing schools. The mockist / London school favors solitary tests with mocks for every dependency, verifying *interactions* — it gives precise failure localization and tests units in true isolation, but couples tests to the call structure, so refactors that preserve behavior can still break tests. The classicist / Detroit (Chicago) school favors sociable tests, mocking only awkward dependencies (network, clock, DB), and asserting on *resulting state* — tests are more refactor-resilient and catch integration bugs between collaborators, but a failure may implicate several units.
Neither is 'correct'; the tradeoff is isolation/precision vs. refactor-resilience/realism, and most teams blend them.
What a strong answer coversSolitary = all collaborators doubled; sociable = uses real collaborators where practical.
Mockist/London: mock everything, verify interactions — precise localization, but couples tests to call structure.
Classicist/Detroit: mock only awkward deps, assert on state — refactor-resilient, catches inter-unit bugs.
The tradeoff is isolation/precision vs. realism/refactor-resilience; teams usually mix both.
Follow-ups they push on- Why can a mockist test pass while the real integration is broken?
- Which approach makes a behavior-preserving refactor less likely to break tests, and why?
Red flag Treating one school as universally right — all-mockist suites become refactor-fragile interaction tests, while all-sociable suites can lose failure localization.
source: Martin Fowler — Unit Test (Solitary vs Sociable) ↗ -
What is a contract test, and what problem does it solve that unit and e2e tests don't?
When service A calls service B, A's unit tests stub B — but the stub encodes A's *assumption* of B's API, which silently rots when B changes. Full e2e tests catch the mismatch but are slow, flaky, and need every service deployed together.
Contract testing (e.g. consumer-driven contracts / Pact) fills the gap. The consumer (A) defines the requests it makes and the responses it expects as a contract; that contract is then verified against the provider (B) independently. If B's change would violate A's expectations, B's pipeline fails — *before* anything is deployed together.
The payoff: you get confidence that two services are compatible at their boundary with the speed and independence of unit tests — no shared environment, each side tested in its own pipeline. It's how you keep a microservices fleet integrable without a giant brittle e2e suite.
What a strong answer coversStubs of a remote service encode assumptions that drift as the provider changes — unit tests won't notice.
A contract captures the consumer's expected requests/responses and is verified against the provider separately.
It catches integration breakage before deploy, without a shared e2e environment.
Gives boundary-compatibility confidence with the speed/isolation of unit tests — key for microservices.
Follow-ups they push on- What does 'consumer-driven' add over the provider just publishing an OpenAPI spec?
- Why don't all-green unit tests on both services guarantee they integrate?
Red flag Assuming green unit tests on both sides mean the services integrate — the consumer's stub can diverge from the provider's real behavior, which only contract or integration tests catch.
source: Martin Fowler — Contract Testing ↗ -
Should unit tests hit a real database? When is an in-memory or test-container DB the right call?
By definition, a unit test shouldn't touch a real DB — that makes it slow and non-deterministic. So you mock the data layer for unit tests. But mocking the DB means you never verify your *actual* SQL, migrations, or ORM mappings, and that's where real bugs hide.
So the pragmatic answer is layered: unit-test pure logic with the DB doubled, then write integration tests against a real database engine for the queries themselves. The mistake to avoid is using a *different* engine in tests than in production — e.g. SQLite or an in-memory fake standing in for Postgres. SQL dialects, constraint behavior, and types differ, so tests can pass against the fake and fail against prod (or vice versa).
Modern practice is Testcontainers: spin up the *real* database (same engine/version as prod) in a throwaway container for integration tests. You get fidelity without polluting a shared environment.
What a strong answer coversA true unit test doesn't hit a DB — mock the data layer for logic; it's slow/non-deterministic otherwise.
But mocks never validate real SQL, migrations, or ORM mappings — cover those with integration tests.
Don't substitute a different engine (SQLite for Postgres) — dialect/constraint differences make tests lie.
Use Testcontainers to run the real prod-version DB in a disposable container for integration tests.
Follow-ups they push on- Why can an in-memory SQLite stand-in for Postgres give false confidence?
- What belongs in a unit test vs an integration test for a repository class?
Red flag Testing against a different DB engine than production (in-memory fake for the real thing) — dialect and constraint mismatches let bugs pass tests and break in prod.
source: Testcontainers — Database integration testing ↗ -
What is mutation testing, and how does it reveal that high code coverage can be misleading?
Line/branch coverage tells you code *ran* during tests, not that anything was *checked*. Mutation testing measures the latter: a tool makes small deliberate changes (mutants) to your code — flip
>to>=, replace+with-, negate a condition, return null — then reruns your tests against each mutant.If a mutant makes a test fail, it's killed (good — your tests detected the change). If all tests still pass, the mutant survived — meaning your suite executed that code but never asserted anything that the change would break. The mutation score (killed / total) is a far better quality signal than coverage.
This exposes the assertion-free-coverage problem directly: you can have 100% line coverage and a low mutation score, because tests call the code but verify nothing meaningful. The cost is compute — running the suite once per mutant is expensive — so teams often run it on critical modules rather than the whole repo.
What a strong answer coversCoverage proves code executed; mutation testing proves your assertions actually catch changes.
It injects small bugs (mutants); a killed mutant = tests detected it, a survivor = a gap in assertions.
Mutation score (killed/total) is a stronger quality metric than line coverage.
Directly exposes assertion-free coverage: 100% lines but mutants survive = tests that check nothing.
Cost is high (rerun suite per mutant), so target critical modules rather than the whole codebase.
Quick self-checkA mutant 'survives' a mutation test run. What does that tell you?
-
Correct — a surviving mutant means the change wasn't detected, exposing weak or missing assertions.
-
Possible but not what 'survived' specifically means — survival is about tests running yet not detecting the change.
-
Mutants are throwaway experiments on a copy, not changes shipped to production.
-
Mutation testing is slow to run, but survival is a signal about assertion quality, not speed.
Follow-ups they push on- How can you have 100% line coverage and a 40% mutation score?
- What is an 'equivalent mutant' and why does it muddy the score?
Red flag Trusting coverage as a quality bar — mutation testing routinely shows high-coverage suites with surviving mutants, i.e. tests that run code without asserting on its behavior.
source: PIT (Pitest) — Mutation testing ↗ -
Your e2e suite takes 45 minutes and people skip it. How do you make the test strategy sustainable?
A 45-minute, ignored e2e suite is usually the ice-cream-cone anti-pattern: too much testing pushed up to the slow, brittle e2e layer. The fix is to rebalance toward the test pyramid — push coverage down to where it's fast and reliable.
Concretely: for each slow e2e test, ask what it really verifies and move that assertion to the lowest layer that can — pure logic to unit tests, service-boundary behavior to integration/contract tests, and reserve e2e for a handful of critical user journeys (login, checkout). Parallelize what remains across CI runners, and split the suite so fast tests gate every PR while the full e2e set runs on a schedule or pre-deploy.
Separately, hunt flakiness — a slow suite people skip is often also a flaky one they've stopped trusting. Quarantine and fix flaky tests rather than retrying. The goal is a fast feedback loop developers actually run, backed by a thin, stable e2e layer.
What a strong answer coversA bloated e2e suite is the ice-cream cone — rebalance toward the pyramid (fast, low-level tests).
Move each assertion to the lowest layer that can verify it; keep e2e for a few critical journeys only.
Parallelize and tier the suite: fast tests gate PRs, full e2e runs pre-deploy/scheduled.
Attack flakiness too — skipped suites are usually distrusted (flaky) ones; quarantine and fix, don't retry.
Follow-ups they push on- How do you decide which assertions can move down from e2e to unit/integration?
- Why is tiering the suite (PR gate vs nightly) better than running everything on every push?
Red flag Speeding up an ice-cream-cone suite by only adding retries and more parallelism — without rebalancing toward the pyramid you still have a slow, brittle suite developers route around.
source: Martin Fowler — The Practical Test Pyramid ↗