3.2 ★ core [J][I] 16 interview Q's

JOINs

All six JOIN types, the difference between INNER and LEFT from memory, and the classic gotcha where a WHERE filter silently turns a LEFT JOIN into an INNER one.

A JOIN combines rows from two tables by matching a condition (the ON clause). The only thing that ever changes between JOIN types is what happens to rows that don’t find a match — everything else is the same. Hold onto that and the six variants collapse into one idea with a few switches.

The six JOIN types

Join	Keeps	Reach for it when	Watch out
INNER	rows matching in both	you only care about pairs that exist	silently hides unmatched rows
LEFT	all left + right matches	“all X, plus their Y if any”	filtering the right table in `WHERE`
RIGHT	all right + left matches	rarely — usually rewrite as a LEFT	readability; flip the table order instead
FULL OUTER	all rows, both sides	reconciling two sets (diffs)	MySQL lacks it — emulate with UNION
CROSS	every combination	generating combinations deliberately	10k × 10k = 100M rows
SELF	table joined to itself	hierarchies (employee → manager)	always alias both copies

The only real decision is how to treat non-matching rows.

The classic Venn view. Each pair of circles is the left table (L) and right table (R); the shaded region is what the JOIN keeps. INNER keeps only the overlap; LEFT adds all of L; RIGHT adds all of R; FULL keeps everything. The unshaded crescents are the rows a NULL-padded outer join preserves and an inner join discards.

INNER vs LEFT — the one you must know cold

customers ▸ orders

Given customers(id, name) and orders(id, customer_id), list every customer and how many orders they’ve placed — including customers with none. That word “including” is the tell: you need a LEFT JOIN.

SELECT  c.name,
        COUNT(o.id) AS order_count
FROM        customers c
LEFT JOIN   orders    o ON o.customer_id = c.id
GROUP BY    c.name
ORDER BY    order_count DESC;

An INNER JOIN here would silently drop every customer who never ordered — a classic off-by-a-whole-segment reporting bug. COUNT(o.id) (not COUNT(*)) correctly reports 0 for those customers, because o.id is NULL on the unmatched rows and COUNT ignores NULL.

The gotcha that turns LEFT into INNER

Putting a condition on the right table in WHERE quietly destroys the outer join — the NULL rows you wanted are filtered right back out:

-- BROKEN: customers with no 'shipped' order vanish entirely
... LEFT JOIN orders o ON o.customer_id = c.id
    WHERE o.status = 'shipped'

-- FIXED: move the condition into the ON clause
... LEFT JOIN orders o ON o.customer_id = c.id AND o.status = 'shipped'

Rule of thumb: filters on the optional side belong in ON, not WHERE.

Self joins

A self join is just a normal join where both sides happen to be the same table, distinguished by aliases. The canonical case is a hierarchy:

SELECT  e.name      AS employee,
        m.name      AS manager
FROM        employees e
LEFT JOIN   employees m ON m.id = e.manager_id;   -- LEFT keeps the CEO (no manager)

Against the seed, that returns all eight employees; only Grace Hopper has a NULL manager (she has no manager_id). Swap the LEFT for an INNER and Grace vanishes — the exact bug from the pitfall below, in miniature.

From the trenches

The “LEFT JOIN that silently became an INNER JOIN” is the single most common JOIN bug in production reporting code, and it’s nasty precisely because the query works — it just returns fewer rows than it should. A team ships a “customers and their recent orders” report as a LEFT JOIN, correctly including customers with zero orders. Months later someone adds a filter — WHERE o.status = 'shipped' — to scope it to fulfilled orders, dropping it in the WHERE clause where every other filter lives. That one line evaluates o.status on the NULL-padded rows, and since NULL = 'shipped' is never true, every zero-order customer is filtered straight back out. The “customers with no orders” the report was built to surface silently disappear. No error, no crash — just a churn dashboard that stops counting the very customers you most need to see. The fix is one keyword of placement: move the condition into ON (LEFT JOIN orders o ON o.customer_id = c.id AND o.status = 'shipped'), where it filters during the join and leaves the unmatched left rows intact.

You can try all of these live in the SQL playground below once you’ve worked the objectives.

01 Learning objectives

0 / 7 done

02 Curated reading

Coding Horror — A Visual Explanation of SQL Joins
essential 8m — The classic Venn-diagram mental model.
PG Exercises — JOINs
essential 30m — Drill JOINs against a real schema.

03 SQL challenges

SQL challenges · ecommerce.db

0/4 solved

SQL-06List each order's id alongside the name of the customer who placed it.

04 Knowledge check

knowledge check5 questions · pass ≥ 70%

01easy
An INNER JOIN returns:
02easy
A CROSS JOIN of a 1,000-row and a 1,000-row table yields 1,000,000 rows.
03medium
You LEFT JOIN customers→orders and add `WHERE orders.status='shipped'`. Unexpectedly, customers with no orders disappear. Why?
04medium
To list employees alongside their manager (same table), you use a:
05hard
table_a has 3 rows; for a given a, b has 0, 1, 2 matches respectively. How many rows does `a LEFT JOIN b` produce?

05 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

Amazon junior concept very common What's the difference between an INNER JOIN and a LEFT JOIN, and what's the classic LEFT JOIN bug?
INNER JOIN keeps only rows that match in both tables; LEFT JOIN keeps all left rows, filling NULL where the right side has no match.
The bug: a WHERE predicate on a *right-table* column silently turns a LEFT JOIN into an INNER JOIN, because NULL fails the filter and those unmatched rows vanish. Fix by moving the condition into the ON clause: LEFT JOIN orders o ON o.cust_id = c.id AND o.status = 'paid' keeps customers with no paid order.
Follow-ups they push on
- Where do you put a filter on the *left* table — does it matter?
- Emulate a FULL OUTER JOIN in MySQL, which lacks it.
Red flag Saying LEFT JOIN 'returns more rows' instead of 'preserves unmatched left rows', and not catching the WHERE-vs-ON filter trap.
source: DataLemur — SQL Interview Questions ↗
Amazon mid coding common Identify the top two highest-grossing products within each category in 2022, returning category, product, and total spend.
Aggregate spend per (category, product), rank within each category, keep rank <= 2:
WITH g AS (SELECT category, product, SUM(spend) AS total FROM product_spend WHERE EXTRACT(YEAR FROM tx_date) = 2022 GROUP BY category, product), r AS (SELECT *, RANK() OVER (PARTITION BY category ORDER BY total DESC) AS rk FROM g) SELECT category, product, total FROM r WHERE rk <= 2;
This is the 'top-N-per-group' pattern: GROUP BY for the metric, a window RANK to rank within partitions.
Follow-ups they push on
- RANK vs DENSE_RANK vs ROW_NUMBER for breaking ties on equal spend?
- Why can't you filter on the window function in the same SELECT's WHERE?
Red flag Using a global ORDER BY + LIMIT 2 (gives the top 2 overall, not per category), or referencing the window alias in WHERE instead of wrapping it in a CTE/subquery.
source: DataLemur — Amazon 'Highest-Grossing Items' ↗
Commonly asked mid coding common Write a self-join to list each employee alongside their manager's name from an employees(id, name, manager_id) table.
Join the table to itself with two aliases:
SELECT e.name AS employee, m.name AS manager FROM employees e LEFT JOIN employees m ON e.manager_id = m.id;
Use LEFT JOIN (not INNER) so the CEO, whose manager_id is NULL, still appears with a NULL manager. Aliases (e, m) are mandatory to disambiguate the two copies.
Follow-ups they push on
- How would you go more than one level up (whole chain to the CEO)?
- Recursive CTE for an arbitrary-depth org chart?
Red flag Using INNER JOIN and silently dropping the top-level employee, or forgetting aliases so the columns are ambiguous.
source: PG Exercises — JOINs ↗
Commonly asked mid coding occasional MySQL has no FULL OUTER JOIN. How do you emulate one?
Take the union of a LEFT JOIN and a RIGHT JOIN:
SELECT * FROM a LEFT JOIN b ON a.id = b.id UNION SELECT * FROM a RIGHT JOIN b ON a.id = b.id;
The LEFT half gives all of a plus matches; the RIGHT half gives all of b plus matches; UNION (not UNION ALL) dedups the rows that matched on both sides.
Follow-ups they push on
- Why UNION and not UNION ALL here?
- How to find rows present in exactly one side (anti-join / symmetric difference)?
Red flag Using UNION ALL and double-counting matched rows, or assuming MySQL silently supports FULL OUTER JOIN.
source: PostgreSQL docs — Joins (table expressions) ↗
Amazon mid coding occasional Find products that exist in Amazon's catalog but NOT in the partner's catalog (an anti-join).
Three idiomatic ways; the LEFT JOIN … IS NULL anti-join is the workhorse:
SELECT a.product FROM amazon a LEFT JOIN partner p ON a.product = p.product WHERE p.product IS NULL;
Alternatives: NOT EXISTS (SELECT 1 FROM partner p WHERE p.product = a.product) (NULL-safe, often the planner's favourite) or EXCEPT. Prefer NOT EXISTS over NOT IN when the right column can be NULL.
Follow-ups they push on
- Why is NOT IN dangerous when the subquery may return a NULL?
- Performance: NOT EXISTS vs LEFT JOIN/IS NULL vs EXCEPT?
Red flag Using `NOT IN` with a nullable column (a single NULL makes the whole predicate return no rows), or forgetting the `IS NULL` filter in the LEFT-JOIN form.
source: StrataScratch — Amazon 'Exclusive Amazon Products' ↗
Commonly asked mid concept common Why can a JOIN return more rows than either input table, and how do you avoid accidental row explosion?
A join multiplies rows wherever the join key is not unique on the other side: if one customer has 3 orders, joining customers->orders yields 3 rows for that customer. A many-to-many join multiplies both sides — fan-out.
This silently corrupts aggregates: SUM(amount) double-counts if you joined in a second one-to-many table first. Guard against it by joining on unique/PK columns, pre-aggregating one side in a CTE before joining, or checking the grain of every join.
Follow-ups they push on
- What is the 'grain' of a result set and why track it?
- A CROSS JOIN of 10k x 10k rows — how many rows, and when is that intentional?
Red flag Blaming 'duplicate data' when the real cause is joining on a non-unique key, or summing a measure after a fan-out join and reporting inflated totals.
source: StrataScratch — Amazon SQL Interview Questions ↗
Commonly asked senior concept occasional Explain the three physical join algorithms (nested loop, hash join, merge join) and when a planner picks each.
Nested loop: for each outer row, probe the inner table — great when one side is tiny or there's an index on the inner join key; O(n*m) without an index.
Hash join: build a hash table on the smaller input's key, probe with the larger — best for large, unindexed equality joins; needs memory and only does equi-joins.
Merge join: sort both inputs on the key, then walk them in lockstep — wins when inputs are already sorted (e.g. from an index) or for range conditions.
The planner chooses by estimated row counts and available indexes; you see them in EXPLAIN.
Follow-ups they push on
- Why can't a hash join serve `a.x < b.y`?
- How does a missing index push a join from nested-loop to a costly hash join?
Red flag Thinking the JOIN keyword maps to one fixed algorithm — the optimizer picks the physical operator based on stats and indexes.
source: PostgreSQL docs — Planner / Optimizer (join methods) ↗
Meta senior coding occasional For each Friday, count the total likes a post received from the poster's friends, where the like happened after the post was created.
Friendship is usually stored one-directional, so first symmetrize it with UNION ALL of (a,b) and (b,a). Then join posts to that friend list and to likes, requiring the liker to be a friend and like_ts > post_ts, and filter to Fridays:
... WHERE EXTRACT(DOW FROM like_date) = 5 AND like_ts > post_ts then GROUP BY like_date. Use COUNT(DISTINCT …) if a friend could like the same post twice.
This is a Meta-style 'SQL as a tool for product reasoning' question: the schema modelling (bidirectional friendship, temporal ordering) is the real test.
Follow-ups they push on
- Why UNION ALL rather than UNION when symmetrizing friendships?
- How does the day-of-week number differ across MySQL/Postgres?
Red flag Treating friendship as already bidirectional and undercounting, or forgetting the `like_ts > post_ts` temporal guard.
source: StrataScratch — Meta "Friday's Likes Count" ↗
Commonly asked junior concept occasional What does a CROSS JOIN do, and name a legitimate use for it.
A CROSS JOIN produces the Cartesian product — every row of A paired with every row of B (n*m rows, no ON clause). 10k x 10k = 100M rows, so it's usually a bug from a missing join condition.
Legit uses: generating a complete grid (every store x every day to fill gaps for a report), pairing each row against a small constants/calendar table, or building combinations. Often written deliberately as CROSS JOIN generate_series(...).
Follow-ups they push on
- How does an unintended CROSS JOIN usually sneak in?
- Difference between CROSS JOIN and an INNER JOIN with `ON 1=1`?
Red flag Not recognizing that a comma-join with no WHERE join condition is effectively a CROSS JOIN that explodes row counts.
source: PostgreSQL docs — Joined Tables (CROSS JOIN) ↗
★ must-know Commonly asked junior concept very common Walk through INNER, LEFT, RIGHT, FULL OUTER, and CROSS JOIN — what rows does each keep?
All join two tables on a predicate; they differ in which unmatched rows survive:
- INNER — only rows that match on both sides.
- LEFT (OUTER) — all left rows; right columns are NULL where no match.
- RIGHT (OUTER) — all right rows; left columns NULL where no match (a LEFT JOIN with the tables swapped).
- FULL OUTER — all rows from both sides; NULL on whichever side is missing.
- CROSS — every left row paired with every right row (Cartesian product, no ON).
Mental model: INNER is the intersection, LEFT/RIGHT keep one side whole, FULL keeps the union, CROSS multiplies.
What a strong answer covers
- INNER = matches only; the unmatched rows on both sides are dropped.
- LEFT keeps all left rows; RIGHT keeps all right rows (mirror images).
- FULL OUTER keeps unmatched rows from both sides, padding the missing side with NULL.
- CROSS produces n*m rows with no join condition.
- RIGHT JOIN is rarely written by hand — people flip the tables and use LEFT for readability.
Quick self-check
`customers` has 10 rows; `orders` has 4 rows, all belonging to just 2 of those customers (one order each... wait, 4 orders across 2 customers). How many rows does `customers LEFT JOIN orders ON customers.id = orders.cust_id` return?
Follow-ups they push on
- Which join would you use to find rows present in A but missing in B?
- How do you reproduce a FULL OUTER JOIN in MySQL, which lacks it?
Red flag Describing LEFT/RIGHT as 'returns more rows' rather than 'preserves the unmatched rows of one side', or thinking CROSS JOIN needs an ON clause.
source: PostgreSQL docs — Joined Tables (join types) ↗
Commonly asked mid debug very common A LEFT JOIN with `WHERE right_table.col = 'x'` returns fewer rows than expected. What happened, and what's the fix?
The WHERE on a right-table column silently demotes the LEFT JOIN to an INNER JOIN. Unmatched left rows have NULL in right_table.col, and NULL = 'x' is UNKNOWN, so WHERE discards exactly the rows the LEFT JOIN was meant to preserve.
Fix: move the predicate into the ON clause — LEFT JOIN r ON r.fk = l.id AND r.col = 'x' — so it filters which right rows *match* without dropping unmatched left rows. The rule: conditions that should *preserve* the outer side go in ON; conditions that should *filter the final result* go in WHERE. (A WHERE right.col IS NULL is the deliberate exception — that's the anti-join idiom.)
What a strong answer covers
- A WHERE predicate on the null-able (right) side turns LEFT JOIN into INNER JOIN.
- Cause: NULL = 'x' evaluates to UNKNOWN, so the padded unmatched rows are filtered out.
- Fix: put the right-side condition in ON, not WHERE.
- ON controls matching (preserves the outer side); WHERE filters the joined result.
- Exception: WHERE right.col IS NULL is intentional — it's the anti-join pattern.
Quick self-check
You want every customer plus their 2024 orders (customers with no 2024 order should still appear). Which is correct?
Follow-ups they push on
- Why is a predicate on the LEFT (preserved) table the same in ON or WHERE here?
- How does this differ for an INNER JOIN, where ON vs WHERE are interchangeable?
Red flag Putting a right-table filter in WHERE and not realizing you've turned an outer join into an inner join, losing the unmatched rows you wanted.
source: PostgreSQL docs — Joined Tables (ON vs WHERE for outer joins) ↗
Commonly asked mid coding very common Find customers who have never placed an order — and explain three ways to write it.
This is the canonical anti-join. Three idioms:
1. LEFT JOIN / IS NULL: SELECT c.name FROM customers c LEFT JOIN orders o ON o.cust_id = c.id WHERE o.id IS NULL;
2. NOT EXISTS (usually the planner's favourite, and NULL-safe): SELECT name FROM customers c WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.cust_id = c.id);
3. NOT IN — works *only* if the subquery column can't be NULL: WHERE c.id NOT IN (SELECT cust_id FROM orders WHERE cust_id IS NOT NULL);
Prefer NOT EXISTS for safety and performance; reach for LEFT JOIN … IS NULL when you also need columns from the joined table.
What a strong answer covers
- Anti-join = 'rows in A with no match in B'.
- LEFT JOIN + WHERE matched_col IS NULL keeps only the unmatched left rows.
- NOT EXISTS is NULL-safe and typically optimizes to an efficient anti-join.
- NOT IN breaks (returns nothing) if the subquery yields a single NULL — guard with WHERE col IS NOT NULL.
Follow-ups they push on
- Why is NOT EXISTS safer than NOT IN here?
- Which form lets you also return data from the orders table?
Red flag Using `NOT IN (SELECT cust_id FROM orders)` when `cust_id` can be NULL — one NULL makes the predicate UNKNOWN for every row and the query returns nothing.
source: LeetCode 183 — Customers Who Never Order ↗
Commonly asked mid concept common What's the difference between joining in the ON clause versus filtering in WHERE for an INNER JOIN — does it matter?
For an INNER JOIN, a predicate produces the same result whether you put it in ON or WHERE — both filter the matched set, and the optimizer treats them equivalently.
For OUTER joins it matters enormously: an ON condition decides which rows *match* (unmatched outer rows are still kept and padded with NULL), while a WHERE condition filters the *final* result *after* the NULLs are added — which can erase the preserved rows. So the safe habit is: join keys and match conditions in ON; result-set filters in WHERE; and remember the distinction only collapses for INNER joins.
What a strong answer covers
- INNER JOIN: ON vs WHERE give identical results — equivalent to the optimizer.
- OUTER JOIN: ON affects *matching* (preserves unmatched rows); WHERE filters *after* padding.
- Best practice: put the relationship/keys in ON, post-join filters in WHERE.
- The 'it doesn't matter' rule applies *only* to inner joins.
Quick self-check
For `a INNER JOIN b ON a.id = b.aid AND b.active = true` vs `a INNER JOIN b ON a.id = b.aid WHERE b.active = true`, the results are…
Follow-ups they push on
- Show a case where moving a predicate from WHERE to ON changes a LEFT JOIN's output.
- Does the optimizer reorder ON vs WHERE predicates for an inner join?
Red flag Over-generalizing 'ON and WHERE are the same' from inner joins to outer joins, where they produce different result sets.
source: Use The Index, Luke! — Join Operations ↗
Commonly asked senior debug common An index exists on the join column of one table but the JOIN is still slow. What index considerations apply to joins?
For a nested-loop join the engine iterates the outer table and probes the inner table once per row, so the index that matters is on the inner table's join column — the side being looked up. If only the outer table's column is indexed, each probe still scans the inner table.
Checklist: (1) index the inner/probed side's join key; (2) make the join columns the same type — an implicit cast (e.g. int vs varchar) makes the predicate non-sargable and skips the index; (3) for big unindexed equi-joins a hash join may be the right plan, not a fix; (4) read EXPLAIN to see whether it chose nested-loop vs hash and whether the index is actually used.
What a strong answer covers
- Nested-loop joins need the index on the inner (probed) table's join column.
- Mismatched column types force an implicit cast -> non-sargable -> index ignored.
- Both join columns should share a type and, ideally, collation.
- A hash join on large unindexed inputs can be the correct plan, not a bug.
- Use EXPLAIN to confirm which join algorithm and index the planner actually picked.
Follow-ups they push on
- Which table's column should carry the index in a nested-loop join?
- How does an int-vs-varchar join key defeat an index?
Red flag Indexing only the driving (outer) table and expecting fast probes, or joining columns of different types and silently losing the index to an implicit cast.
source: Use The Index, Luke! — Nested Loops / indexing joins ↗
Commonly asked senior debug common Why does summing a measure go wrong after joining two one-to-many tables, and how do you fix the double-counting?
Joining a parent to two child tables (orders has many line_items *and* many payments) creates a Cartesian fan-out: each order's rows = items x payments. Now SUM(payment.amount) is multiplied by the number of line items, and SUM(item.qty) is multiplied by the number of payments — every total is inflated.
Fix by pre-aggregating each child to the parent's grain in its own subquery/CTE before joining: WITH it AS (SELECT order_id, SUM(qty) q FROM items GROUP BY order_id), pm AS (SELECT order_id, SUM(amount) a FROM payments GROUP BY order_id) SELECT … FROM orders o LEFT JOIN it … LEFT JOIN pm …. Each child is now one row per order, so no fan-out. Always know the grain of each table you join.
What a strong answer covers
- Joining one parent to two one-to-many children multiplies rows (items x payments).
- Aggregates over the fanned-out rows double/triple-count.
- Fix: pre-aggregate each child to the parent grain in separate CTEs/subqueries, *then* join.
- COUNT(DISTINCT …) can patch a single measure but doesn't fix multiple measures cleanly.
- Track the 'grain' (one row per what?) at every join step.
Follow-ups they push on
- Why doesn't COUNT(DISTINCT) fully solve it when you need two sums?
- What is the 'grain' of a result set and how do you reason about it?
Red flag Joining several one-to-many tables in one flat query and trusting SUM — the totals are inflated by the cross-product of the child rows.
source: StrataScratch — SQL JOIN Interview Questions ↗
Commonly asked mid trick occasional Using a USING clause or NATURAL JOIN instead of ON — what are they and why are they risky?
JOIN … USING (col) joins on equally-named columns and merges them into one output column (so you write col, not a.col). NATURAL JOIN goes further and joins on all identically-named columns automatically, with no ON/USING at all.
USING is fine and concise. NATURAL JOIN is dangerous: adding an unrelated same-named column later (a created_at or id on both tables) silently changes the join key and corrupts results with no error. Most style guides ban NATURAL JOIN and prefer an explicit ON (or USING) so the join condition is visible and stable against schema changes.
What a strong answer covers
- USING (col) joins on a shared column name and collapses it to a single output column.
- NATURAL JOIN auto-joins on *every* commonly-named column — implicit and fragile.
- A later schema change (new same-named column) silently alters a NATURAL JOIN's keys.
- Prefer explicit ON; USING is acceptable, NATURAL JOIN is widely discouraged.
Quick self-check
Why do most style guides discourage NATURAL JOIN?
Follow-ups they push on
- How does USING change which columns appear in `SELECT *`?
- Why can adding a column break an existing NATURAL JOIN with no error?
Red flag Relying on NATURAL JOIN, then having a future migration add a same-named column that silently joins on it and quietly changes the result set.
source: PostgreSQL docs — Joined Tables (USING and NATURAL) ↗