JOINs
All six JOIN types, the difference between INNER and LEFT from memory, and the classic gotcha where a WHERE filter silently turns a LEFT JOIN into an INNER one.
A JOIN combines rows from two tables by matching a condition (the ON clause).
The only thing that ever changes between JOIN types is what happens to rows that
don’t find a match — everything else is the same. Hold onto that and the six
variants collapse into one idea with a few switches.
The six JOIN types
| Join | Keeps | Reach for it when | Watch out |
|---|---|---|---|
| INNER | rows matching in both | you only care about pairs that exist | silently hides unmatched rows |
| LEFT | all left + right matches | “all X, plus their Y if any” | filtering the right table in WHERE |
| RIGHT | all right + left matches | rarely — usually rewrite as a LEFT | readability; flip the table order instead |
| FULL OUTER | all rows, both sides | reconciling two sets (diffs) | MySQL lacks it — emulate with UNION |
| CROSS | every combination | generating combinations deliberately | 10k × 10k = 100M rows |
| SELF | table joined to itself | hierarchies (employee → manager) | always alias both copies |
INNER vs LEFT — the one you must know cold
Given customers(id, name) and orders(id, customer_id), list every customer
and how many orders they’ve placed — including customers with none. That word
“including” is the tell: you need a LEFT JOIN.
SELECT c.name,
COUNT(o.id) AS order_count
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
GROUP BY c.name
ORDER BY order_count DESC;An INNER JOIN here would silently drop every customer who never ordered — a
classic off-by-a-whole-segment reporting bug. COUNT(o.id) (not COUNT(*))
correctly reports 0 for those customers, because o.id is NULL on the
unmatched rows and COUNT ignores NULL.
Self joins
A self join is just a normal join where both sides happen to be the same table, distinguished by aliases. The canonical case is a hierarchy:
SELECT e.name AS employee,
m.name AS manager
FROM employees e
LEFT JOIN employees m ON m.id = e.manager_id; -- LEFT keeps the CEO (no manager)
Against the seed, that returns all eight employees; only Grace Hopper has a NULL
manager (she has no manager_id). Swap the LEFT for an INNER and Grace vanishes
— the exact bug from the pitfall below, in miniature.
You can try all of these live in the SQL playground below once you’ve worked the objectives.
01 Learning objectives
0 / 7 done02 Curated reading
03 SQL challenges
SQL-06List each order's id alongside the name of the customer who placed it.
04 Knowledge check
- 01easy
An INNER JOIN returns:
- 02easy
A CROSS JOIN of a 1,000-row and a 1,000-row table yields 1,000,000 rows.
- 03medium
You LEFT JOIN customers→orders and add `WHERE orders.status='shipped'`. Unexpectedly, customers with no orders disappear. Why?
- 04medium
To list employees alongside their manager (same table), you use a:
- 05hard
table_a has 3 rows; for a given a, b has 0, 1, 2 matches respectively. How many rows does `a LEFT JOIN b` produce?
05 Interview questions
browse all ↗What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.
-
What's the difference between an INNER JOIN and a LEFT JOIN, and what's the classic LEFT JOIN bug?
INNER JOINkeeps only rows that match in both tables;LEFT JOINkeeps all left rows, fillingNULLwhere the right side has no match.The bug: a
WHEREpredicate on a *right-table* column silently turns a LEFT JOIN into an INNER JOIN, becauseNULLfails the filter and those unmatched rows vanish. Fix by moving the condition into theONclause:LEFT JOIN orders o ON o.cust_id = c.id AND o.status = 'paid'keeps customers with no paid order.Follow-ups they push on- Where do you put a filter on the *left* table — does it matter?
- Emulate a FULL OUTER JOIN in MySQL, which lacks it.
Red flag Saying LEFT JOIN 'returns more rows' instead of 'preserves unmatched left rows', and not catching the WHERE-vs-ON filter trap.
source: DataLemur — SQL Interview Questions ↗ -
Identify the top two highest-grossing products within each category in 2022, returning category, product, and total spend.
Aggregate spend per (category, product), rank within each category, keep rank <= 2:
WITH g AS (SELECT category, product, SUM(spend) AS total FROM product_spend WHERE EXTRACT(YEAR FROM tx_date) = 2022 GROUP BY category, product), r AS (SELECT *, RANK() OVER (PARTITION BY category ORDER BY total DESC) AS rk FROM g) SELECT category, product, total FROM r WHERE rk <= 2;This is the 'top-N-per-group' pattern: GROUP BY for the metric, a window RANK to rank within partitions.
Follow-ups they push on- RANK vs DENSE_RANK vs ROW_NUMBER for breaking ties on equal spend?
- Why can't you filter on the window function in the same SELECT's WHERE?
Red flag Using a global ORDER BY + LIMIT 2 (gives the top 2 overall, not per category), or referencing the window alias in WHERE instead of wrapping it in a CTE/subquery.
source: DataLemur — Amazon 'Highest-Grossing Items' ↗ -
Write a self-join to list each employee alongside their manager's name from an employees(id, name, manager_id) table.
Join the table to itself with two aliases:
SELECT e.name AS employee, m.name AS manager FROM employees e LEFT JOIN employees m ON e.manager_id = m.id;Use
LEFT JOIN(not INNER) so the CEO, whosemanager_idis NULL, still appears with a NULL manager. Aliases (e,m) are mandatory to disambiguate the two copies.Follow-ups they push on- How would you go more than one level up (whole chain to the CEO)?
- Recursive CTE for an arbitrary-depth org chart?
Red flag Using INNER JOIN and silently dropping the top-level employee, or forgetting aliases so the columns are ambiguous.
source: PG Exercises — JOINs ↗ -
MySQL has no FULL OUTER JOIN. How do you emulate one?
Take the union of a LEFT JOIN and a RIGHT JOIN:
SELECT * FROM a LEFT JOIN b ON a.id = b.id UNION SELECT * FROM a RIGHT JOIN b ON a.id = b.id;The
LEFThalf gives all ofaplus matches; theRIGHThalf gives all ofbplus matches;UNION(not UNION ALL) dedups the rows that matched on both sides.Follow-ups they push on- Why UNION and not UNION ALL here?
- How to find rows present in exactly one side (anti-join / symmetric difference)?
Red flag Using UNION ALL and double-counting matched rows, or assuming MySQL silently supports FULL OUTER JOIN.
source: PostgreSQL docs — Joins (table expressions) ↗ -
Find products that exist in Amazon's catalog but NOT in the partner's catalog (an anti-join).
Three idiomatic ways; the
LEFT JOIN … IS NULLanti-join is the workhorse:SELECT a.product FROM amazon a LEFT JOIN partner p ON a.product = p.product WHERE p.product IS NULL;Alternatives:
NOT EXISTS (SELECT 1 FROM partner p WHERE p.product = a.product)(NULL-safe, often the planner's favourite) orEXCEPT. PreferNOT EXISTSoverNOT INwhen the right column can be NULL.Follow-ups they push on- Why is NOT IN dangerous when the subquery may return a NULL?
- Performance: NOT EXISTS vs LEFT JOIN/IS NULL vs EXCEPT?
Red flag Using `NOT IN` with a nullable column (a single NULL makes the whole predicate return no rows), or forgetting the `IS NULL` filter in the LEFT-JOIN form.
source: StrataScratch — Amazon 'Exclusive Amazon Products' ↗ -
Why can a JOIN return more rows than either input table, and how do you avoid accidental row explosion?
A join multiplies rows wherever the join key is not unique on the other side: if one customer has 3 orders, joining customers->orders yields 3 rows for that customer. A many-to-many join multiplies both sides — fan-out.
This silently corrupts aggregates:
SUM(amount)double-counts if you joined in a second one-to-many table first. Guard against it by joining on unique/PK columns, pre-aggregating one side in a CTE before joining, or checking the grain of every join.Follow-ups they push on- What is the 'grain' of a result set and why track it?
- A CROSS JOIN of 10k x 10k rows — how many rows, and when is that intentional?
Red flag Blaming 'duplicate data' when the real cause is joining on a non-unique key, or summing a measure after a fan-out join and reporting inflated totals.
source: StrataScratch — Amazon SQL Interview Questions ↗ -
Explain the three physical join algorithms (nested loop, hash join, merge join) and when a planner picks each.
Nested loop: for each outer row, probe the inner table — great when one side is tiny or there's an index on the inner join key; O(n*m) without an index.
Hash join: build a hash table on the smaller input's key, probe with the larger — best for large, unindexed equality joins; needs memory and only does equi-joins.
Merge join: sort both inputs on the key, then walk them in lockstep — wins when inputs are already sorted (e.g. from an index) or for range conditions.
The planner chooses by estimated row counts and available indexes; you see them in
EXPLAIN.Follow-ups they push on- Why can't a hash join serve `a.x < b.y`?
- How does a missing index push a join from nested-loop to a costly hash join?
Red flag Thinking the JOIN keyword maps to one fixed algorithm — the optimizer picks the physical operator based on stats and indexes.
source: PostgreSQL docs — Planner / Optimizer (join methods) ↗ -
For each Friday, count the total likes a post received from the poster's friends, where the like happened after the post was created.
Friendship is usually stored one-directional, so first symmetrize it with
UNION ALLof (a,b) and (b,a). Then join posts to that friend list and to likes, requiring the liker to be a friend andlike_ts > post_ts, and filter to Fridays:... WHERE EXTRACT(DOW FROM like_date) = 5 AND like_ts > post_tsthenGROUP BY like_date. UseCOUNT(DISTINCT …)if a friend could like the same post twice.This is a Meta-style 'SQL as a tool for product reasoning' question: the schema modelling (bidirectional friendship, temporal ordering) is the real test.
Follow-ups they push on- Why UNION ALL rather than UNION when symmetrizing friendships?
- How does the day-of-week number differ across MySQL/Postgres?
Red flag Treating friendship as already bidirectional and undercounting, or forgetting the `like_ts > post_ts` temporal guard.
source: StrataScratch — Meta "Friday's Likes Count" ↗ -
What does a CROSS JOIN do, and name a legitimate use for it.
A
CROSS JOINproduces the Cartesian product — every row of A paired with every row of B (n*m rows, no ON clause). 10k x 10k = 100M rows, so it's usually a bug from a missing join condition.Legit uses: generating a complete grid (every store x every day to fill gaps for a report), pairing each row against a small constants/calendar table, or building combinations. Often written deliberately as
CROSS JOIN generate_series(...).Follow-ups they push on- How does an unintended CROSS JOIN usually sneak in?
- Difference between CROSS JOIN and an INNER JOIN with `ON 1=1`?
Red flag Not recognizing that a comma-join with no WHERE join condition is effectively a CROSS JOIN that explodes row counts.
source: PostgreSQL docs — Joined Tables (CROSS JOIN) ↗ -
Walk through INNER, LEFT, RIGHT, FULL OUTER, and CROSS JOIN — what rows does each keep?
All join two tables on a predicate; they differ in which unmatched rows survive:
- INNER — only rows that match on both sides.
- LEFT (OUTER) — all left rows; right columns areNULLwhere no match.
- RIGHT (OUTER) — all right rows; left columnsNULLwhere no match (a LEFT JOIN with the tables swapped).
- FULL OUTER — all rows from both sides;NULLon whichever side is missing.
- CROSS — every left row paired with every right row (Cartesian product, noON).Mental model: INNER is the intersection, LEFT/RIGHT keep one side whole, FULL keeps the union, CROSS multiplies.
What a strong answer coversINNER = matches only; the unmatched rows on both sides are dropped.
LEFT keeps all left rows; RIGHT keeps all right rows (mirror images).
FULL OUTER keeps unmatched rows from both sides, padding the missing side with NULL.
CROSS produces n*m rows with no join condition.
RIGHT JOIN is rarely written by hand — people flip the tables and use LEFT for readability.
Quick self-check`customers` has 10 rows; `orders` has 4 rows, all belonging to just 2 of those customers (one order each... wait, 4 orders across 2 customers). How many rows does `customers LEFT JOIN orders ON customers.id = orders.cust_id` return?
-
No — the 2 customers with orders contribute one row per order, so they expand beyond a single row each.
-
Correct — 8 customers with no order give 8 rows (NULL order), plus 4 order rows for the matched customers = 12.
-
That's what an INNER JOIN returns; LEFT JOIN also keeps the 8 customers with no orders.
-
No — only the 8 unmatched customers add NULL rows; total is 8 + 4 = 12, not 14.
Follow-ups they push on- Which join would you use to find rows present in A but missing in B?
- How do you reproduce a FULL OUTER JOIN in MySQL, which lacks it?
Red flag Describing LEFT/RIGHT as 'returns more rows' rather than 'preserves the unmatched rows of one side', or thinking CROSS JOIN needs an ON clause.
source: PostgreSQL docs — Joined Tables (join types) ↗ -
A LEFT JOIN with `WHERE right_table.col = 'x'` returns fewer rows than expected. What happened, and what's the fix?
The
WHEREon a right-table column silently demotes the LEFT JOIN to an INNER JOIN. Unmatched left rows haveNULLinright_table.col, andNULL = 'x'isUNKNOWN, soWHEREdiscards exactly the rows the LEFT JOIN was meant to preserve.Fix: move the predicate into the
ONclause —LEFT JOIN r ON r.fk = l.id AND r.col = 'x'— so it filters which right rows *match* without dropping unmatched left rows. The rule: conditions that should *preserve* the outer side go inON; conditions that should *filter the final result* go inWHERE. (AWHERE right.col IS NULLis the deliberate exception — that's the anti-join idiom.)What a strong answer coversA WHERE predicate on the null-able (right) side turns LEFT JOIN into INNER JOIN.
Cause:
NULL = 'x'evaluates to UNKNOWN, so the padded unmatched rows are filtered out.Fix: put the right-side condition in
ON, notWHERE.ONcontrols matching (preserves the outer side);WHEREfilters the joined result.Exception:
WHERE right.col IS NULLis intentional — it's the anti-join pattern.
Quick self-checkYou want every customer plus their 2024 orders (customers with no 2024 order should still appear). Which is correct?
-
Wrong — the WHERE on o.year drops customers with no 2024 order (their o.year is NULL), making it effectively an INNER JOIN.
-
Correct — moving the year condition into ON keeps all customers, attaching only 2024 orders (NULL where none).
-
Wrong — INNER JOIN drops customers who have no 2024 order entirely.
-
Close but flawed — it keeps customers with no orders at all, yet still drops customers whose only orders are non-2024.
Follow-ups they push on- Why is a predicate on the LEFT (preserved) table the same in ON or WHERE here?
- How does this differ for an INNER JOIN, where ON vs WHERE are interchangeable?
Red flag Putting a right-table filter in WHERE and not realizing you've turned an outer join into an inner join, losing the unmatched rows you wanted.
source: PostgreSQL docs — Joined Tables (ON vs WHERE for outer joins) ↗ -
Find customers who have never placed an order — and explain three ways to write it.
This is the canonical anti-join. Three idioms:
1. LEFT JOIN / IS NULL:
SELECT c.name FROM customers c LEFT JOIN orders o ON o.cust_id = c.id WHERE o.id IS NULL;
2. NOT EXISTS (usually the planner's favourite, and NULL-safe):SELECT name FROM customers c WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.cust_id = c.id);
3. NOT IN — works *only* if the subquery column can't be NULL:WHERE c.id NOT IN (SELECT cust_id FROM orders WHERE cust_id IS NOT NULL);Prefer
NOT EXISTSfor safety and performance; reach forLEFT JOIN … IS NULLwhen you also need columns from the joined table.What a strong answer coversAnti-join = 'rows in A with no match in B'.
LEFT JOIN +
WHERE matched_col IS NULLkeeps only the unmatched left rows.NOT EXISTSis NULL-safe and typically optimizes to an efficient anti-join.NOT INbreaks (returns nothing) if the subquery yields a single NULL — guard withWHERE col IS NOT NULL.
Follow-ups they push on- Why is NOT EXISTS safer than NOT IN here?
- Which form lets you also return data from the orders table?
Red flag Using `NOT IN (SELECT cust_id FROM orders)` when `cust_id` can be NULL — one NULL makes the predicate UNKNOWN for every row and the query returns nothing.
source: LeetCode 183 — Customers Who Never Order ↗ -
What's the difference between joining in the ON clause versus filtering in WHERE for an INNER JOIN — does it matter?
For an INNER JOIN, a predicate produces the same result whether you put it in
ONorWHERE— both filter the matched set, and the optimizer treats them equivalently.For OUTER joins it matters enormously: an
ONcondition decides which rows *match* (unmatched outer rows are still kept and padded with NULL), while aWHEREcondition filters the *final* result *after* the NULLs are added — which can erase the preserved rows. So the safe habit is: join keys and match conditions inON; result-set filters inWHERE; and remember the distinction only collapses for INNER joins.What a strong answer coversINNER JOIN: ON vs WHERE give identical results — equivalent to the optimizer.
OUTER JOIN: ON affects *matching* (preserves unmatched rows); WHERE filters *after* padding.
Best practice: put the relationship/keys in ON, post-join filters in WHERE.
The 'it doesn't matter' rule applies *only* to inner joins.
Quick self-checkFor `a INNER JOIN b ON a.id = b.aid AND b.active = true` vs `a INNER JOIN b ON a.id = b.aid WHERE b.active = true`, the results are…
-
Correct — for an INNER JOIN the extra predicate filters the matched set the same way in either clause.
-
No — that's the OUTER-join behavior; for INNER joins both forms are equivalent.
-
No — both filter to the same matched rows for an inner join.
-
No — the optimizer treats them equivalently; there's no inherent performance difference for inner joins.
Follow-ups they push on- Show a case where moving a predicate from WHERE to ON changes a LEFT JOIN's output.
- Does the optimizer reorder ON vs WHERE predicates for an inner join?
Red flag Over-generalizing 'ON and WHERE are the same' from inner joins to outer joins, where they produce different result sets.
source: Use The Index, Luke! — Join Operations ↗ -
An index exists on the join column of one table but the JOIN is still slow. What index considerations apply to joins?
For a nested-loop join the engine iterates the outer table and probes the inner table once per row, so the index that matters is on the inner table's join column — the side being looked up. If only the outer table's column is indexed, each probe still scans the inner table.
Checklist: (1) index the inner/probed side's join key; (2) make the join columns the same type — an implicit cast (e.g.
intvsvarchar) makes the predicate non-sargable and skips the index; (3) for big unindexed equi-joins a hash join may be the right plan, not a fix; (4) readEXPLAINto see whether it chose nested-loop vs hash and whether the index is actually used.What a strong answer coversNested-loop joins need the index on the inner (probed) table's join column.
Mismatched column types force an implicit cast -> non-sargable -> index ignored.
Both join columns should share a type and, ideally, collation.
A hash join on large unindexed inputs can be the correct plan, not a bug.
Use EXPLAIN to confirm which join algorithm and index the planner actually picked.
Follow-ups they push on- Which table's column should carry the index in a nested-loop join?
- How does an int-vs-varchar join key defeat an index?
Red flag Indexing only the driving (outer) table and expecting fast probes, or joining columns of different types and silently losing the index to an implicit cast.
source: Use The Index, Luke! — Nested Loops / indexing joins ↗ -
Why does summing a measure go wrong after joining two one-to-many tables, and how do you fix the double-counting?
Joining a parent to two child tables (orders has many line_items *and* many payments) creates a Cartesian fan-out: each order's rows = items x payments. Now
SUM(payment.amount)is multiplied by the number of line items, andSUM(item.qty)is multiplied by the number of payments — every total is inflated.Fix by pre-aggregating each child to the parent's grain in its own subquery/CTE before joining:
WITH it AS (SELECT order_id, SUM(qty) q FROM items GROUP BY order_id), pm AS (SELECT order_id, SUM(amount) a FROM payments GROUP BY order_id) SELECT … FROM orders o LEFT JOIN it … LEFT JOIN pm …. Each child is now one row per order, so no fan-out. Always know the grain of each table you join.What a strong answer coversJoining one parent to two one-to-many children multiplies rows (items x payments).
Aggregates over the fanned-out rows double/triple-count.
Fix: pre-aggregate each child to the parent grain in separate CTEs/subqueries, *then* join.
COUNT(DISTINCT …)can patch a single measure but doesn't fix multiple measures cleanly.Track the 'grain' (one row per what?) at every join step.
Follow-ups they push on- Why doesn't COUNT(DISTINCT) fully solve it when you need two sums?
- What is the 'grain' of a result set and how do you reason about it?
Red flag Joining several one-to-many tables in one flat query and trusting SUM — the totals are inflated by the cross-product of the child rows.
source: StrataScratch — SQL JOIN Interview Questions ↗ -
Using a USING clause or NATURAL JOIN instead of ON — what are they and why are they risky?
JOIN … USING (col)joins on equally-named columns and merges them into one output column (so you writecol, nota.col).NATURAL JOINgoes further and joins on all identically-named columns automatically, with noON/USINGat all.USINGis fine and concise.NATURAL JOINis dangerous: adding an unrelated same-named column later (acreated_atoridon both tables) silently changes the join key and corrupts results with no error. Most style guides banNATURAL JOINand prefer an explicitON(orUSING) so the join condition is visible and stable against schema changes.What a strong answer coversUSING (col)joins on a shared column name and collapses it to a single output column.NATURAL JOINauto-joins on *every* commonly-named column — implicit and fragile.A later schema change (new same-named column) silently alters a NATURAL JOIN's keys.
Prefer explicit
ON;USINGis acceptable,NATURAL JOINis widely discouraged.
Quick self-checkWhy do most style guides discourage NATURAL JOIN?
-
No — performance is the same; the objection is correctness/maintainability, not speed.
-
Correct — the join condition is implicit, so schema changes can alter results with no error.
-
No — it uses indexes like any equi-join; the issue is the hidden, mutable join condition.
-
No — NATURAL JOIN uses *all* commonly-named columns, which is precisely the problem.
Follow-ups they push on- How does USING change which columns appear in `SELECT *`?
- Why can adding a column break an existing NATURAL JOIN with no error?
Red flag Relying on NATURAL JOIN, then having a future migration add a same-named column that silently joins on it and quietly changes the result set.
source: PostgreSQL docs — Joined Tables (USING and NATURAL) ↗