Advanced querying
Subqueries (correlated vs not), CTEs for readability, and window functions (ROW_NUMBER/RANK/LAG/LEAD/running totals) vs GROUP BY.
Once you can join and group, the next leap is composing queries — nesting one inside another, naming intermediate results, and computing values that look across rows without collapsing them. This chapter covers subqueries, CTEs, and the one that surprises people most: window functions.
Subqueries: correlated vs not
A non-correlated subquery is self-contained — it could run on its own — and the database evaluates it once. A correlated subquery references the outer row, so conceptually it re-runs for every outer row (the optimizer often rewrites it, but that’s the mental model). That distinction is also a performance tell: a non-correlated subquery is computed once and cached; a naive correlated one is the shape that turns into “1 + N” work, the SQL cousin of the N+1 problem you’ll meet in chapter 3.4.
Find customers whose order count beats the overall average. The inner query computes a single number, independent of any outer row:
SELECT customer_id, COUNT(*) AS orders
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > (
SELECT AVG(cnt)
FROM ( SELECT COUNT(*) AS cnt
FROM orders GROUP BY customer_id ) per_customer
);The subquery returns the average orders-per-customer, which is exactly 2.0 across
the ten customers who have ordered. Customers 1, 4, and 9 (three orders each) clear
the bar; everyone with one or two does not.
Here the subquery references o.customer_id from the outer row, so it’s evaluated
per customer:
SELECT o.id, o.customer_id, o.order_date
FROM orders o
WHERE o.order_date = (
SELECT MAX(o2.order_date)
FROM orders o2
WHERE o2.customer_id = o.customer_id -- the correlation
);Returns one row per customer: their latest order. (We’ll see a cleaner window-function version of exactly this below — and it’s the version you’d actually ship.)
CTEs: name your steps
A CTE (WITH name AS (...)) is a subquery you pull out and name. It doesn’t usually
change performance — it changes readability. A query with three nested subqueries
becomes a top-to-bottom pipeline you can read like prose.
WITH per_customer AS (
SELECT customer_id, COUNT(*) AS orders
FROM orders
GROUP BY customer_id
)
SELECT customer_id, orders
FROM per_customer
WHERE orders > (SELECT AVG(orders) FROM per_customer)
ORDER BY orders DESC;Identical result to the nested version (customers 1, 4, 9), but per_customer is
defined once and referenced twice — no copy-pasted subquery. CTEs also enable
recursive queries (WITH RECURSIVE) for walking hierarchies like the
employees manager chain — start at the CEO, then repeatedly join each level’s
reports onto the previous level until no more rows are added.
Window functions vs GROUP BY
This is the centerpiece. GROUP BY collapses each group into one summary row —
you lose the individual rows. A window function computes a group-aware value but
keeps every row. The OVER (...) clause defines the window:
PARTITION BY splits rows into groups, ORDER BY orders rows within each window.
| GROUP BY | Window function | |
|---|---|---|
| Row count out | one per group | same as input (every row kept) |
| Can see detail + summary together | no | yes — detail row and its group stat side by side |
| Syntax | GROUP BY col | func() OVER (PARTITION BY col) |
| Use for | totals/averages per group | rankings, running totals, row-to-row deltas |
Rank each customer’s orders newest-first, then keep only the latest. PARTITION BY
restarts the numbering for each customer:
WITH ranked AS (
SELECT id, customer_id, order_date,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC
) AS rn
FROM orders
)
SELECT id, customer_id, order_date
FROM ranked
WHERE rn = 1; -- the most recent order per customerThis is the canonical “top-N-per-group” pattern, and against the seed it returns ten
rows — one per ordering customer, e.g. order 14 for customer 1 (2024-07-30), order
20 for customer 9 (2024-10-22). ROW_NUMBER always gives a unique 1,2,3…; RANK
would give ties the same number and skip the next (1,1,3); DENSE_RANK ties without
skipping (1,1,2).
LAG reaches back to the previous row in the window (LEAD reaches forward).
Here we compare each order’s date to the customer’s prior order:
SELECT id, customer_id, order_date,
LAG(order_date) OVER (
PARTITION BY customer_id ORDER BY order_date
) AS prev_order_date
FROM orders
ORDER BY customer_id, order_date;For customer 1’s three orders, prev_order_date is NULL on the first, then the
prior date on each subsequent row — perfect for computing gaps or deltas without a
self-join. (Wrapping a date difference around this is how you’d compute “days since
last order” per row in one pass.)
A running total is SUM(...) OVER (ORDER BY ...) — the window frame defaults to
“all rows from the start up to the current row.” This sums line-item revenue
chronologically by order:
SELECT o.id AS order_id,
o.order_date,
SUM(oi.quantity * oi.unit_price) AS order_total,
SUM(SUM(oi.quantity * oi.unit_price)) OVER (
ORDER BY o.order_date, o.id
) AS running_total
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
GROUP BY o.id, o.order_date
ORDER BY o.order_date, o.id;Each row shows that order’s revenue plus the cumulative revenue up to that date —
order 1 contributes 165, order 3 brings the running total to 505, and so on up
the timeline. (The nested SUM(SUM(...)) looks odd: the inner SUM is the aggregate
from GROUP BY, the outer SUM(...) OVER accumulates those per-order totals.)
Try the “latest order per customer” query in the playground, then swap ROW_NUMBER
for RANK and add a tie to see the numbering change.
01 Learning objectives
0 / 2 done02 Curated reading
03 SQL challenges
SQL-10Show the name and price of products priced ABOVE the average price of all products.
04 Knowledge check
- 01medium
A CTE (WITH …) primarily improves:
- 02hard
Which lets you rank rows WITHOUT collapsing them into groups?
05 Interview questions
browse all ↗What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.
-
Find the second-highest distinct salary in an Employee table; return NULL if there isn't one.
Order distinct salaries and skip the top one:
SELECT (SELECT DISTINCT salary FROM employee ORDER BY salary DESC LIMIT 1 OFFSET 1) AS second_highest;Wrapping it in an outer
SELECTmakes the resultNULL(not an empty set) when there's no second salary. Alternative:DENSE_RANK() OVER (ORDER BY salary DESC)and keep rank = 2.DISTINCT/DENSE_RANKmatters so duplicate top salaries don't count as two ranks.Follow-ups they push on- Generalize to the Nth-highest salary.
- Why DENSE_RANK rather than RANK or ROW_NUMBER here?
Red flag Using `MAX(salary) WHERE salary < MAX(salary)` incorrectly, or forgetting DISTINCT so two employees tied at the top hide the real second salary; also returning an empty set instead of NULL.
source: LeetCode 176 — Second Highest Salary ↗ -
What's the difference between a correlated and a non-correlated subquery, and why does it matter for performance?
A non-correlated subquery is self-contained — it runs once and its result is reused (e.g.
WHERE salary > (SELECT AVG(salary) FROM emp)).A correlated subquery references a column from the outer query, so conceptually it re-runs once per outer row (e.g.
WHERE salary > (SELECT AVG(salary) FROM emp e2 WHERE e2.dept = e1.dept)). That can be O(n) executions and slow, though modern planners often rewrite simple cases into joins.Follow-ups they push on- Rewrite a correlated subquery as a JOIN or window function.
- When is EXISTS preferable to IN with a subquery?
Red flag Calling every subquery 'correlated', or assuming a correlated subquery always re-executes literally (optimizers may decorrelate it).
source: LeetCode 185 — Department Top Three Salaries (correlated subquery) ↗ -
When would you use a CTE (WITH clause) over a subquery or a temp table?
A CTE names an intermediate result so you can reference it (sometimes multiple times) and read the query top-to-bottom — mainly a readability win, and the only way to write a recursive query (
WITH RECURSIVE).Vs a subquery: same logic, clearer structure. Vs a temp table: a CTE is scoped to the single statement and (usually) not materialized to disk. Note: in some engines a CTE is an optimization fence (older Postgres materialized them); Postgres 12+ inlines non-recursive CTEs unless you say
MATERIALIZED.Follow-ups they push on- Write a recursive CTE to walk an org hierarchy.
- When does a CTE act as an optimization barrier?
Red flag Claiming CTEs are always faster — pre-12 Postgres materialized them, which could be slower than an inlined subquery.
source: PostgreSQL docs — WITH Queries (Common Table Expressions) ↗ -
How does a window function differ from GROUP BY?
GROUP BYcollapses each group into one row — you lose the individual rows. A window function (… OVER (PARTITION BY …)) computes an aggregate/rank across a window of rows but keeps every row, attaching the result alongside.So to show each employee *and* their department's average salary in the same row, you need
AVG(salary) OVER (PARTITION BY dept), not GROUP BY. Window functions also give youROW_NUMBER/RANK/LAG/LEADand running totals, which GROUP BY can't express.Follow-ups they push on- Give a running total with `SUM(x) OVER (ORDER BY d)`.
- Difference between PARTITION BY and a plain GROUP BY?
Red flag Saying they're interchangeable — GROUP BY reduces row count, a window function preserves it.
source: PostgreSQL docs — Window Functions ↗ -
What's the difference between ROW_NUMBER, RANK, and DENSE_RANK on tied values?
On a tie of two rows ranked 1st:
-
ROW_NUMBER— always unique, arbitrary among ties: 1, 2, 3, 4 …
-RANK— ties share a rank, then it skips: 1, 1, 3, 4 …
-DENSE_RANK— ties share a rank, no gap: 1, 1, 2, 3 …Pick ROW_NUMBER for 'one row per group / dedup', RANK/DENSE_RANK for leaderboards. 'Top 3 salaries including ties' usually wants DENSE_RANK <= 3.
Follow-ups they push on- Which one for 'top N salaries, ties count as one place'?
- How to make ROW_NUMBER deterministic when the ORDER BY has ties?
Red flag Using ROW_NUMBER for a 'top N including ties' question and arbitrarily dropping tied rows, or confusing RANK's gaps with DENSE_RANK's continuity.
source: StrataScratch — Amazon 'Top-Rated Support Employees' (DENSE_RANK) ↗ -
Compute the month-over-month percentage change in revenue using a window function.
Aggregate to monthly revenue, then use
LAGto reach the previous month:WITH m AS (SELECT DATE_TRUNC('month', tx) AS mth, SUM(amount) AS rev FROM orders GROUP BY 1) SELECT mth, ROUND(100.0 * (rev - LAG(rev) OVER (ORDER BY mth)) / LAG(rev) OVER (ORDER BY mth), 2) AS pct_change FROM m ORDER BY mth;LAG(rev) OVER (ORDER BY mth)pulls the prior row's value; the first month is NULL (no prior). Multiply by100.0to force float division.Follow-ups they push on- Use LEAD instead — what changes?
- Why might integer division give you 0% everywhere?
Red flag Integer division truncating the ratio to 0, or self-joining the table to itself on month-1 instead of the cleaner LAG.
source: StrataScratch — Amazon 'Monthly Percentage Difference' ↗ -
Write a running (cumulative) total of daily sales ordered by date.
A windowed
SUMwith anORDER BYgives a running total:SELECT sale_date, amount, SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total FROM sales;Adding
ORDER BYinsideOVERswitches the default frame from 'whole partition' to 'start … current row', which is exactly a cumulative sum. AddPARTITION BY regionto get one running total per region.Follow-ups they push on- Default window frame with vs without ORDER BY?
- RANGE vs ROWS for the frame — when do they differ?
Red flag Omitting ORDER BY in OVER (you get the grand total on every row, not a running one), or being surprised by RANGE's behavior on duplicate dates.
source: PostgreSQL docs — Window Function Calls (frames) ↗ -
Find users with three or more consecutive days of activity (a gap-and-islands problem).
Classic 'gaps and islands': subtract a
ROW_NUMBERfrom the date to give every consecutive run the same anchor:WITH d AS (SELECT user_id, day, day - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY day))::int AS grp FROM activity) SELECT user_id, COUNT(*) AS streak FROM d GROUP BY user_id, grp HAVING COUNT(*) >= 3;Within one user, on consecutive days both
dayand the row number increase by 1, soday - row_numberis constant across a streak and changes at each gap — grouping on it isolates each island.Follow-ups they push on- Adapt it to find the *longest* streak per user.
- How would LAG-based gap detection compare?
Red flag Trying to detect consecutiveness with a single self-join on day+1 (breaks for runs longer than 2), or forgetting to PARTITION BY user.
source: StrataScratch — Meta 'User Streaks' (gap-and-island, LAG/DENSE_RANK) ↗ -
What is a recursive CTE, and how does it walk an org hierarchy down to arbitrary depth?
A recursive CTE has two parts joined by
UNION [ALL]: an anchor (the starting rows) and a recursive member that references the CTE itself, iterating until it adds no new rows.WITH RECURSIVE chain AS (SELECT id, name, manager_id, 1 AS lvl FROM emp WHERE id = :root UNION ALL SELECT e.id, e.name, e.manager_id, c.lvl + 1 FROM emp e JOIN chain c ON e.manager_id = c.id) SELECT * FROM chain;The anchor seeds the root; each pass joins employees onto the rows found so far, descending one level. It's the standard way to traverse trees/graphs (org charts, category trees, bill-of-materials) that a fixed number of self-joins can't handle.
What a strong answer coversStructure: anchor member
UNION ALLrecursive member that references the CTE.The recursive step runs repeatedly, feeding its output back in, until it produces no new rows.
Used for arbitrary-depth trees/graphs: org charts, category trees, BOM, threaded comments.
Guard against cycles (a depth cap or a visited-set) or the recursion never terminates.
Follow-ups they push on- How do you prevent infinite recursion if the hierarchy has a cycle?
- UNION vs UNION ALL in a recursive CTE — what changes?
Red flag Forgetting the termination behavior on cyclic data (infinite loop) or trying to express arbitrary depth with a fixed chain of self-joins.
source: PostgreSQL docs — WITH Queries (Recursive Queries) ↗ -
Why can't you put a window function in a WHERE clause, and how do you filter on its result?
Window functions are computed in the SELECT step, which logically runs *after*
WHERE,GROUP BY, andHAVING. SoWHERE rn = 1referencingROW_NUMBER() … AS rnerrors — the window result doesn't exist yet whenWHEREis evaluated.The fix is to compute the window function in an inner query (a subquery or CTE) and filter on its alias in the outer query:
WITH r AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn FROM emp) SELECT * FROM r WHERE rn = 1;. This 'rank-then-filter' wrapper is the single most common window-function pattern in interviews.What a strong answer coversWindow functions evaluate in SELECT, after WHERE/GROUP BY/HAVING.
Referencing a window alias in the same query's WHERE/HAVING is an error.
Wrap it in a CTE/subquery and filter on the alias in the outer query.
This 'rank in inner, filter in outer' is the top-N-per-group backbone.
Quick self-checkYou want the single highest-paid employee per department. Which is valid?
-
Invalid — `r` is a window alias evaluated in SELECT, not visible to WHERE.
-
Correct — the window function is computed in the inner query, and the outer query filters on its alias.
-
Invalid — HAVING also runs before SELECT, so the window function can't be filtered there either.
-
This computes a boolean column but doesn't filter rows — you'd still get every employee.
Follow-ups they push on- Could you ever use a window function in HAVING? (no — same reason)
- How does this relate to the top-N-per-group pattern?
Red flag Writing `WHERE ROW_NUMBER() OVER (...) = 1` directly and being surprised by a syntax/semantic error instead of wrapping it in a subquery.
source: PostgreSQL docs — Window Function Processing ↗ -
What's the difference between EXISTS and IN with a subquery, and when does each win?
IN (subquery)materializes the subquery's values and checks membership;EXISTS (subquery)is a correlated semi-join that returns true as soon as one matching row is found (short-circuits).Semantically the big difference is NULL handling:
NOT INreturns no rows if the subquery yields a NULL, whereasNOT EXISTSis NULL-safe — so preferNOT EXISTSfor anti-joins. Performance-wise, modern optimizers often rewrite both into the same semi-/anti-join, butEXISTStends to win when the subquery is large (it can stop early) andINreads fine for small, NULL-free value lists. UseEXISTSwhen you only test *existence*; useINfor a short, known set.What a strong answer coversINtests membership in a value set;EXISTStests whether any correlated row exists (short-circuits).NOT IN+ a NULL in the subquery returns zero rows;NOT EXISTSis NULL-safe.Optimizers frequently rewrite both to semi-joins, so results — not raw form — usually drive the plan.
Rule of thumb: EXISTS for existence tests / large subqueries; IN for small NULL-free lists.
Quick self-check`SELECT * FROM a WHERE a.x NOT IN (SELECT b.y FROM b)` where `b.y` contains one NULL. Result?
-
No — the NULL poisons the comparison; you don't get the 'sensible' answer.
-
Correct — `x NOT IN (…, NULL)` evaluates to UNKNOWN for every row (never TRUE), so nothing is returned.
-
No — the NULL makes the predicate UNKNOWN, not TRUE, so rows are filtered out, not all kept.
-
No — that's the point: NOT EXISTS would be NULL-safe and return the sensible rows, unlike NOT IN here.
Follow-ups they push on- Show the NULL case where NOT IN and NOT EXISTS diverge.
- Why can EXISTS stop scanning after the first match?
Red flag Treating IN and EXISTS as always identical and getting burned by `NOT IN` with a nullable subquery column returning no rows.
source: PostgreSQL docs — Subquery Expressions (EXISTS / IN) ↗ -
Pivot a tall table (one row per month) into a wide one (a column per month) in SQL.
The portable, engine-agnostic way is conditional aggregation —
SUM(CASE WHEN …)per target column:SELECT product, SUM(CASE WHEN month = 'Jan' THEN revenue END) AS jan, SUM(CASE WHEN month = 'Feb' THEN revenue END) AS feb FROM sales GROUP BY product;Each
CASEisolates one month's value; theGROUP BYcollapses to one row per product. You must enumerate the target columns explicitly — SQL's result shape is fixed at plan time, so a truly dynamic pivot needs generated SQL or an engine extension (Postgrescrosstab, SQL ServerPIVOT).What a strong answer coversConditional aggregation: one
SUM(CASE WHEN key = 'X' THEN val END)per output column.GROUP BYthe row dimension; each CASE picks out one pivot value.Output columns must be hard-coded — SQL can't return a runtime-variable number of columns.
Dynamic pivots need generated SQL or extensions (Postgres
crosstab, T-SQLPIVOT).
Follow-ups they push on- How would you handle a column set that isn't known until query time?
- How do you un-pivot (wide back to tall)?
Red flag Expecting a single SQL statement to produce a dynamic, data-dependent number of columns — the column list is fixed at plan time.
source: PostgreSQL docs — tablefunc (crosstab / pivot) ↗ -
Compare INTERSECT, EXCEPT (MINUS), and UNION — and how do they handle duplicates?
All three are set operators combining two result sets with matching column counts/types, and all remove duplicates by default (each has an
ALLvariant to keep them):-
UNION— rows in either set.
-INTERSECT— rows in both sets.
-EXCEPT(Oracle calls itMINUS) — rows in the first set not in the second.They compare whole rows and treat
NULLs as equal to each other for this purpose (unlike=).EXCEPTis a clean way to express an anti-join, andINTERSECTa semi-join, when you're comparing identically-shaped queries.What a strong answer coversUNION = either, INTERSECT = both, EXCEPT/MINUS = first-minus-second.
All dedup by default;
UNION ALL/INTERSECT ALL/EXCEPT ALLkeep duplicates.They match on the entire row and treat NULL = NULL (unlike
=).EXCEPT is a tidy anti-join; INTERSECT a tidy semi-join for same-shaped queries.
Oracle uses
MINUS; most others useEXCEPT.
Quick self-check`SELECT id FROM a EXCEPT SELECT id FROM b` returns…
-
No — that's INTERSECT; EXCEPT subtracts the second set.
-
Correct — EXCEPT returns first-set rows absent from the second and dedups by default.
-
No — that's UNION.
-
No — plain EXCEPT dedups; you'd need EXCEPT ALL to keep duplicates.
Follow-ups they push on- How do set operators treat NULLs differently from a `=` comparison?
- Rewrite an EXCEPT query as a NOT EXISTS anti-join.
Red flag Forgetting these dedup by default (surprising row counts), or assuming `EXCEPT` exists in Oracle, where it's `MINUS`.
source: PostgreSQL docs — Combining Queries (UNION/INTERSECT/EXCEPT) ↗ -
Use NTILE / percentile window functions to bucket users into quartiles by spend.
NTILE(n)splits ordered rows intonroughly-equal buckets and labels each row 1..n:SELECT user_id, spend, NTILE(4) OVER (ORDER BY spend DESC) AS quartile FROM users;Quartile 1 is the top quarter of spenders. NTILE distributes any remainder to the earliest buckets, so groups can differ by one row. If you instead want a *value* threshold (the spend at the 90th percentile), use
PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY spend)(an ordered-set aggregate), not NTILE — NTILE buckets *rows*, percentiles compute a *value*.What a strong answer coversNTILE(n) OVER (ORDER BY …)assigns each row a bucket number 1..n of near-equal size.Uneven counts: the first buckets get the extra rows.
NTILE labels *rows by rank position*; it does not compute a threshold value.
For a percentile *value*, use
PERCENTILE_CONT/PERCENTILE_DISC … WITHIN GROUP.
Follow-ups they push on- Difference between NTILE(4) and PERCENTILE_CONT(0.25)?
- How does NTILE distribute rows when the count isn't divisible by n?
Red flag Using NTILE to get a percentile *threshold value* (it returns bucket labels, not the value at a percentile) or assuming all NTILE buckets have exactly equal size.
source: PostgreSQL docs — Window Functions (NTILE) & Aggregate (percentile) ↗