> cs·fundamentals
interview 0% 35m read
3.2 ★ core [J][I] 16 interview Q's

JOINs

All six JOIN types, the difference between INNER and LEFT from memory, and the classic gotcha where a WHERE filter silently turns a LEFT JOIN into an INNER one.

A JOIN combines rows from two tables by matching a condition (the ON clause). The only thing that ever changes between JOIN types is what happens to rows that don’t find a match — everything else is the same. Hold onto that and the six variants collapse into one idea with a few switches.

The six JOIN types

JoinKeepsReach for it whenWatch out
INNERrows matching in bothyou only care about pairs that existsilently hides unmatched rows
LEFTall left + right matches“all X, plus their Y if any”filtering the right table in WHERE
RIGHTall right + left matchesrarely — usually rewrite as a LEFTreadability; flip the table order instead
FULL OUTERall rows, both sidesreconciling two sets (diffs)MySQL lacks it — emulate with UNION
CROSSevery combinationgenerating combinations deliberately10k × 10k = 100M rows
SELFtable joined to itselfhierarchies (employee → manager)always alias both copies
The only real decision is how to treat non-matching rows.
LRINNERoverlap onlyLRLEFTall L + matchesLRRIGHTall R + matchesLRFULL OUTEReverythingCROSS = every L × every R (no ON) · SELF = L and R are the same tableThe only difference between any two diagrams: which non-matching rows survive.
The classic Venn view. Each pair of circles is the left table (L) and right table (R); the shaded region is what the JOIN keeps. INNER keeps only the overlap; LEFT adds all of L; RIGHT adds all of R; FULL keeps everything. The unshaded crescents are the rows a NULL-padded outer join preserves and an inner join discards.

INNER vs LEFT — the one you must know cold

customers ▸ orders

Given customers(id, name) and orders(id, customer_id), list every customer and how many orders they’ve placed — including customers with none. That word “including” is the tell: you need a LEFT JOIN.

SELECT  c.name,
        COUNT(o.id) AS order_count
FROM        customers c
LEFT JOIN   orders    o ON o.customer_id = c.id
GROUP BY    c.name
ORDER BY    order_count DESC;

An INNER JOIN here would silently drop every customer who never ordered — a classic off-by-a-whole-segment reporting bug. COUNT(o.id) (not COUNT(*)) correctly reports 0 for those customers, because o.id is NULL on the unmatched rows and COUNT ignores NULL.

Self joins

A self join is just a normal join where both sides happen to be the same table, distinguished by aliases. The canonical case is a hierarchy:

SELECT  e.name      AS employee,
        m.name      AS manager
FROM        employees e
LEFT JOIN   employees m ON m.id = e.manager_id;   -- LEFT keeps the CEO (no manager)

Against the seed, that returns all eight employees; only Grace Hopper has a NULL manager (she has no manager_id). Swap the LEFT for an INNER and Grace vanishes — the exact bug from the pitfall below, in miniature.

You can try all of these live in the SQL playground below once you’ve worked the objectives.

01 Learning objectives

0 / 7 done

02 Curated reading

03 SQL challenges

SQL challenges · ecommerce.db
0/4 solved

SQL-06List each order's id alongside the name of the customer who placed it.

04 Knowledge check

knowledge check5 questions · pass ≥ 70%
  1. 01easy

    An INNER JOIN returns:

  2. 02easy

    A CROSS JOIN of a 1,000-row and a 1,000-row table yields 1,000,000 rows.

  3. 03medium

    You LEFT JOIN customers→orders and add `WHERE orders.status='shipped'`. Unexpectedly, customers with no orders disappear. Why?

  4. 04medium

    To list employees alongside their manager (same table), you use a:

  5. 05hard

    table_a has 3 rows; for a given a, b has 0, 1, 2 matches respectively. How many rows does `a LEFT JOIN b` produce?

05 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

  • Amazon junior concept very common What's the difference between an INNER JOIN and a LEFT JOIN, and what's the classic LEFT JOIN bug?

    INNER JOIN keeps only rows that match in both tables; LEFT JOIN keeps all left rows, filling NULL where the right side has no match.

    The bug: a WHERE predicate on a *right-table* column silently turns a LEFT JOIN into an INNER JOIN, because NULL fails the filter and those unmatched rows vanish. Fix by moving the condition into the ON clause: LEFT JOIN orders o ON o.cust_id = c.id AND o.status = 'paid' keeps customers with no paid order.

    Red flag Saying LEFT JOIN 'returns more rows' instead of 'preserves unmatched left rows', and not catching the WHERE-vs-ON filter trap.

    source: DataLemur — SQL Interview Questions ↗
  • Amazon mid coding common Identify the top two highest-grossing products within each category in 2022, returning category, product, and total spend.

    Aggregate spend per (category, product), rank within each category, keep rank <= 2:

    WITH g AS (SELECT category, product, SUM(spend) AS total FROM product_spend WHERE EXTRACT(YEAR FROM tx_date) = 2022 GROUP BY category, product), r AS (SELECT *, RANK() OVER (PARTITION BY category ORDER BY total DESC) AS rk FROM g) SELECT category, product, total FROM r WHERE rk <= 2;

    This is the 'top-N-per-group' pattern: GROUP BY for the metric, a window RANK to rank within partitions.

    Red flag Using a global ORDER BY + LIMIT 2 (gives the top 2 overall, not per category), or referencing the window alias in WHERE instead of wrapping it in a CTE/subquery.

    source: DataLemur — Amazon 'Highest-Grossing Items' ↗
  • Commonly asked mid coding common Write a self-join to list each employee alongside their manager's name from an employees(id, name, manager_id) table.

    Join the table to itself with two aliases:

    SELECT e.name AS employee, m.name AS manager FROM employees e LEFT JOIN employees m ON e.manager_id = m.id;

    Use LEFT JOIN (not INNER) so the CEO, whose manager_id is NULL, still appears with a NULL manager. Aliases (e, m) are mandatory to disambiguate the two copies.

    Red flag Using INNER JOIN and silently dropping the top-level employee, or forgetting aliases so the columns are ambiguous.

    source: PG Exercises — JOINs ↗
  • Commonly asked mid coding occasional MySQL has no FULL OUTER JOIN. How do you emulate one?

    Take the union of a LEFT JOIN and a RIGHT JOIN:

    SELECT * FROM a LEFT JOIN b ON a.id = b.id UNION SELECT * FROM a RIGHT JOIN b ON a.id = b.id;

    The LEFT half gives all of a plus matches; the RIGHT half gives all of b plus matches; UNION (not UNION ALL) dedups the rows that matched on both sides.

    Red flag Using UNION ALL and double-counting matched rows, or assuming MySQL silently supports FULL OUTER JOIN.

    source: PostgreSQL docs — Joins (table expressions) ↗
  • Amazon mid coding occasional Find products that exist in Amazon's catalog but NOT in the partner's catalog (an anti-join).

    Three idiomatic ways; the LEFT JOIN … IS NULL anti-join is the workhorse:

    SELECT a.product FROM amazon a LEFT JOIN partner p ON a.product = p.product WHERE p.product IS NULL;

    Alternatives: NOT EXISTS (SELECT 1 FROM partner p WHERE p.product = a.product) (NULL-safe, often the planner's favourite) or EXCEPT. Prefer NOT EXISTS over NOT IN when the right column can be NULL.

    Red flag Using `NOT IN` with a nullable column (a single NULL makes the whole predicate return no rows), or forgetting the `IS NULL` filter in the LEFT-JOIN form.

    source: StrataScratch — Amazon 'Exclusive Amazon Products' ↗
  • Commonly asked mid concept common Why can a JOIN return more rows than either input table, and how do you avoid accidental row explosion?

    A join multiplies rows wherever the join key is not unique on the other side: if one customer has 3 orders, joining customers->orders yields 3 rows for that customer. A many-to-many join multiplies both sides — fan-out.

    This silently corrupts aggregates: SUM(amount) double-counts if you joined in a second one-to-many table first. Guard against it by joining on unique/PK columns, pre-aggregating one side in a CTE before joining, or checking the grain of every join.

    Red flag Blaming 'duplicate data' when the real cause is joining on a non-unique key, or summing a measure after a fan-out join and reporting inflated totals.

    source: StrataScratch — Amazon SQL Interview Questions ↗
  • Commonly asked senior concept occasional Explain the three physical join algorithms (nested loop, hash join, merge join) and when a planner picks each.

    Nested loop: for each outer row, probe the inner table — great when one side is tiny or there's an index on the inner join key; O(n*m) without an index.

    Hash join: build a hash table on the smaller input's key, probe with the larger — best for large, unindexed equality joins; needs memory and only does equi-joins.

    Merge join: sort both inputs on the key, then walk them in lockstep — wins when inputs are already sorted (e.g. from an index) or for range conditions.

    The planner chooses by estimated row counts and available indexes; you see them in EXPLAIN.

    Red flag Thinking the JOIN keyword maps to one fixed algorithm — the optimizer picks the physical operator based on stats and indexes.

    source: PostgreSQL docs — Planner / Optimizer (join methods) ↗
  • Meta senior coding occasional For each Friday, count the total likes a post received from the poster's friends, where the like happened after the post was created.

    Friendship is usually stored one-directional, so first symmetrize it with UNION ALL of (a,b) and (b,a). Then join posts to that friend list and to likes, requiring the liker to be a friend and like_ts > post_ts, and filter to Fridays:

    ... WHERE EXTRACT(DOW FROM like_date) = 5 AND like_ts > post_ts then GROUP BY like_date. Use COUNT(DISTINCT …) if a friend could like the same post twice.

    This is a Meta-style 'SQL as a tool for product reasoning' question: the schema modelling (bidirectional friendship, temporal ordering) is the real test.

    Red flag Treating friendship as already bidirectional and undercounting, or forgetting the `like_ts > post_ts` temporal guard.

    source: StrataScratch — Meta "Friday's Likes Count" ↗
  • Commonly asked junior concept occasional What does a CROSS JOIN do, and name a legitimate use for it.

    A CROSS JOIN produces the Cartesian product — every row of A paired with every row of B (n*m rows, no ON clause). 10k x 10k = 100M rows, so it's usually a bug from a missing join condition.

    Legit uses: generating a complete grid (every store x every day to fill gaps for a report), pairing each row against a small constants/calendar table, or building combinations. Often written deliberately as CROSS JOIN generate_series(...).

    Red flag Not recognizing that a comma-join with no WHERE join condition is effectively a CROSS JOIN that explodes row counts.

    source: PostgreSQL docs — Joined Tables (CROSS JOIN) ↗
  • ★ must-know Commonly asked junior concept very common Walk through INNER, LEFT, RIGHT, FULL OUTER, and CROSS JOIN — what rows does each keep?

    All join two tables on a predicate; they differ in which unmatched rows survive:

    - INNER — only rows that match on both sides.
    - LEFT (OUTER) — all left rows; right columns are NULL where no match.
    - RIGHT (OUTER) — all right rows; left columns NULL where no match (a LEFT JOIN with the tables swapped).
    - FULL OUTER — all rows from both sides; NULL on whichever side is missing.
    - CROSS — every left row paired with every right row (Cartesian product, no ON).

    Mental model: INNER is the intersection, LEFT/RIGHT keep one side whole, FULL keeps the union, CROSS multiplies.

    What a strong answer covers
    • INNER = matches only; the unmatched rows on both sides are dropped.

    • LEFT keeps all left rows; RIGHT keeps all right rows (mirror images).

    • FULL OUTER keeps unmatched rows from both sides, padding the missing side with NULL.

    • CROSS produces n*m rows with no join condition.

    • RIGHT JOIN is rarely written by hand — people flip the tables and use LEFT for readability.

    Quick self-check

    `customers` has 10 rows; `orders` has 4 rows, all belonging to just 2 of those customers (one order each... wait, 4 orders across 2 customers). How many rows does `customers LEFT JOIN orders ON customers.id = orders.cust_id` return?

    Red flag Describing LEFT/RIGHT as 'returns more rows' rather than 'preserves the unmatched rows of one side', or thinking CROSS JOIN needs an ON clause.

    source: PostgreSQL docs — Joined Tables (join types) ↗
  • Commonly asked mid debug very common A LEFT JOIN with `WHERE right_table.col = 'x'` returns fewer rows than expected. What happened, and what's the fix?

    The WHERE on a right-table column silently demotes the LEFT JOIN to an INNER JOIN. Unmatched left rows have NULL in right_table.col, and NULL = 'x' is UNKNOWN, so WHERE discards exactly the rows the LEFT JOIN was meant to preserve.

    Fix: move the predicate into the ON clause — LEFT JOIN r ON r.fk = l.id AND r.col = 'x' — so it filters which right rows *match* without dropping unmatched left rows. The rule: conditions that should *preserve* the outer side go in ON; conditions that should *filter the final result* go in WHERE. (A WHERE right.col IS NULL is the deliberate exception — that's the anti-join idiom.)

    What a strong answer covers
    • A WHERE predicate on the null-able (right) side turns LEFT JOIN into INNER JOIN.

    • Cause: NULL = 'x' evaluates to UNKNOWN, so the padded unmatched rows are filtered out.

    • Fix: put the right-side condition in ON, not WHERE.

    • ON controls matching (preserves the outer side); WHERE filters the joined result.

    • Exception: WHERE right.col IS NULL is intentional — it's the anti-join pattern.

    Quick self-check

    You want every customer plus their 2024 orders (customers with no 2024 order should still appear). Which is correct?

    Red flag Putting a right-table filter in WHERE and not realizing you've turned an outer join into an inner join, losing the unmatched rows you wanted.

    source: PostgreSQL docs — Joined Tables (ON vs WHERE for outer joins) ↗
  • Commonly asked mid coding very common Find customers who have never placed an order — and explain three ways to write it.

    This is the canonical anti-join. Three idioms:

    1. LEFT JOIN / IS NULL: SELECT c.name FROM customers c LEFT JOIN orders o ON o.cust_id = c.id WHERE o.id IS NULL;
    2. NOT EXISTS (usually the planner's favourite, and NULL-safe): SELECT name FROM customers c WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.cust_id = c.id);
    3. NOT IN — works *only* if the subquery column can't be NULL: WHERE c.id NOT IN (SELECT cust_id FROM orders WHERE cust_id IS NOT NULL);

    Prefer NOT EXISTS for safety and performance; reach for LEFT JOIN … IS NULL when you also need columns from the joined table.

    What a strong answer covers
    • Anti-join = 'rows in A with no match in B'.

    • LEFT JOIN + WHERE matched_col IS NULL keeps only the unmatched left rows.

    • NOT EXISTS is NULL-safe and typically optimizes to an efficient anti-join.

    • NOT IN breaks (returns nothing) if the subquery yields a single NULL — guard with WHERE col IS NOT NULL.

    Red flag Using `NOT IN (SELECT cust_id FROM orders)` when `cust_id` can be NULL — one NULL makes the predicate UNKNOWN for every row and the query returns nothing.

    source: LeetCode 183 — Customers Who Never Order ↗
  • Commonly asked mid concept common What's the difference between joining in the ON clause versus filtering in WHERE for an INNER JOIN — does it matter?

    For an INNER JOIN, a predicate produces the same result whether you put it in ON or WHERE — both filter the matched set, and the optimizer treats them equivalently.

    For OUTER joins it matters enormously: an ON condition decides which rows *match* (unmatched outer rows are still kept and padded with NULL), while a WHERE condition filters the *final* result *after* the NULLs are added — which can erase the preserved rows. So the safe habit is: join keys and match conditions in ON; result-set filters in WHERE; and remember the distinction only collapses for INNER joins.

    What a strong answer covers
    • INNER JOIN: ON vs WHERE give identical results — equivalent to the optimizer.

    • OUTER JOIN: ON affects *matching* (preserves unmatched rows); WHERE filters *after* padding.

    • Best practice: put the relationship/keys in ON, post-join filters in WHERE.

    • The 'it doesn't matter' rule applies *only* to inner joins.

    Quick self-check

    For `a INNER JOIN b ON a.id = b.aid AND b.active = true` vs `a INNER JOIN b ON a.id = b.aid WHERE b.active = true`, the results are…

    Red flag Over-generalizing 'ON and WHERE are the same' from inner joins to outer joins, where they produce different result sets.

    source: Use The Index, Luke! — Join Operations ↗
  • Commonly asked senior debug common An index exists on the join column of one table but the JOIN is still slow. What index considerations apply to joins?

    For a nested-loop join the engine iterates the outer table and probes the inner table once per row, so the index that matters is on the inner table's join column — the side being looked up. If only the outer table's column is indexed, each probe still scans the inner table.

    Checklist: (1) index the inner/probed side's join key; (2) make the join columns the same type — an implicit cast (e.g. int vs varchar) makes the predicate non-sargable and skips the index; (3) for big unindexed equi-joins a hash join may be the right plan, not a fix; (4) read EXPLAIN to see whether it chose nested-loop vs hash and whether the index is actually used.

    What a strong answer covers
    • Nested-loop joins need the index on the inner (probed) table's join column.

    • Mismatched column types force an implicit cast -> non-sargable -> index ignored.

    • Both join columns should share a type and, ideally, collation.

    • A hash join on large unindexed inputs can be the correct plan, not a bug.

    • Use EXPLAIN to confirm which join algorithm and index the planner actually picked.

    Red flag Indexing only the driving (outer) table and expecting fast probes, or joining columns of different types and silently losing the index to an implicit cast.

    source: Use The Index, Luke! — Nested Loops / indexing joins ↗
  • Commonly asked senior debug common Why does summing a measure go wrong after joining two one-to-many tables, and how do you fix the double-counting?

    Joining a parent to two child tables (orders has many line_items *and* many payments) creates a Cartesian fan-out: each order's rows = items x payments. Now SUM(payment.amount) is multiplied by the number of line items, and SUM(item.qty) is multiplied by the number of payments — every total is inflated.

    Fix by pre-aggregating each child to the parent's grain in its own subquery/CTE before joining: WITH it AS (SELECT order_id, SUM(qty) q FROM items GROUP BY order_id), pm AS (SELECT order_id, SUM(amount) a FROM payments GROUP BY order_id) SELECT … FROM orders o LEFT JOIN it … LEFT JOIN pm …. Each child is now one row per order, so no fan-out. Always know the grain of each table you join.

    What a strong answer covers
    • Joining one parent to two one-to-many children multiplies rows (items x payments).

    • Aggregates over the fanned-out rows double/triple-count.

    • Fix: pre-aggregate each child to the parent grain in separate CTEs/subqueries, *then* join.

    • COUNT(DISTINCT …) can patch a single measure but doesn't fix multiple measures cleanly.

    • Track the 'grain' (one row per what?) at every join step.

    Red flag Joining several one-to-many tables in one flat query and trusting SUM — the totals are inflated by the cross-product of the child rows.

    source: StrataScratch — SQL JOIN Interview Questions ↗
  • Commonly asked mid trick occasional Using a USING clause or NATURAL JOIN instead of ON — what are they and why are they risky?

    JOIN … USING (col) joins on equally-named columns and merges them into one output column (so you write col, not a.col). NATURAL JOIN goes further and joins on all identically-named columns automatically, with no ON/USING at all.

    USING is fine and concise. NATURAL JOIN is dangerous: adding an unrelated same-named column later (a created_at or id on both tables) silently changes the join key and corrupts results with no error. Most style guides ban NATURAL JOIN and prefer an explicit ON (or USING) so the join condition is visible and stable against schema changes.

    What a strong answer covers
    • USING (col) joins on a shared column name and collapses it to a single output column.

    • NATURAL JOIN auto-joins on *every* commonly-named column — implicit and fragile.

    • A later schema change (new same-named column) silently alters a NATURAL JOIN's keys.

    • Prefer explicit ON; USING is acceptable, NATURAL JOIN is widely discouraged.

    Quick self-check

    Why do most style guides discourage NATURAL JOIN?

    Red flag Relying on NATURAL JOIN, then having a future migration add a same-named column that silently joins on it and quietly changes the result set.

    source: PostgreSQL docs — Joined Tables (USING and NATURAL) ↗