3.3 [J][A] 14 interview Q's

Advanced querying

Subqueries (correlated vs not), CTEs for readability, and window functions (ROW_NUMBER/RANK/LAG/LEAD/running totals) vs GROUP BY.

Once you can join and group, the next leap is composing queries — nesting one inside another, naming intermediate results, and computing values that look across rows without collapsing them. This chapter covers subqueries, CTEs, and the one that surprises people most: window functions.

Subqueries: correlated vs not

A non-correlated subquery is self-contained — it could run on its own — and the database evaluates it once. A correlated subquery references the outer row, so conceptually it re-runs for every outer row (the optimizer often rewrites it, but that’s the mental model). That distinction is also a performance tell: a non-correlated subquery is computed once and cached; a naive correlated one is the shape that turns into “1 + N” work, the SQL cousin of the N+1 problem you’ll meet in chapter 3.4.

Non-correlated: customers above the average order count

Find customers whose order count beats the overall average. The inner query computes a single number, independent of any outer row:

SELECT  customer_id, COUNT(*) AS orders
FROM    orders
GROUP BY customer_id
HAVING  COUNT(*) > (
          SELECT AVG(cnt)
          FROM ( SELECT COUNT(*) AS cnt
                 FROM orders GROUP BY customer_id ) per_customer
        );

The subquery returns the average orders-per-customer, which is exactly 2.0 across the ten customers who have ordered. Customers 1, 4, and 9 (three orders each) clear the bar; everyone with one or two does not.

Correlated: each customer's most recent order

Here the subquery references o.customer_id from the outer row, so it’s evaluated per customer:

SELECT  o.id, o.customer_id, o.order_date
FROM    orders o
WHERE   o.order_date = (
          SELECT MAX(o2.order_date)
          FROM   orders o2
          WHERE  o2.customer_id = o.customer_id   -- the correlation
        );

Returns one row per customer: their latest order. (We’ll see a cleaner window-function version of exactly this below — and it’s the version you’d actually ship.)

CTEs: name your steps

A CTE (WITH name AS (...)) is a subquery you pull out and name. It doesn’t usually change performance — it changes readability. A query with three nested subqueries becomes a top-to-bottom pipeline you can read like prose.

The same average-beating query, as a CTE

WITH per_customer AS (
  SELECT customer_id, COUNT(*) AS orders
  FROM   orders
  GROUP BY customer_id
)
SELECT customer_id, orders
FROM   per_customer
WHERE  orders > (SELECT AVG(orders) FROM per_customer)
ORDER BY orders DESC;

Identical result to the nested version (customers 1, 4, 9), but per_customer is defined once and referenced twice — no copy-pasted subquery. CTEs also enable recursive queries (WITH RECURSIVE) for walking hierarchies like the employees manager chain — start at the CEO, then repeatedly join each level’s reports onto the previous level until no more rows are added.

Window functions vs GROUP BY

This is the centerpiece. GROUP BY collapses each group into one summary row — you lose the individual rows. A window function computes a group-aware value but keeps every row. The OVER (...) clause defines the window: PARTITION BY splits rows into groups, ORDER BY orders rows within each window.

GROUP BY vs a window function over the same data. GROUP BY folds the three rows of customer 1 into one row carrying the count (3) — the individual orders are gone. The window function COUNT(*) OVER (PARTITION BY customer_id) stamps that same 3 onto every original row and keeps them all, so you see each order alongside its group stat.

	GROUP BY	Window function
Row count out	one per group	same as input (every row kept)
Can see detail + summary together	no	yes — detail row and its group stat side by side
Syntax	`GROUP BY col`	`func() OVER (PARTITION BY col)`
Use for	totals/averages per group	rankings, running totals, row-to-row deltas

Same grouping idea; the difference is whether rows survive.

ROW_NUMBER / RANK: latest order per customer, ranked

Rank each customer’s orders newest-first, then keep only the latest. PARTITION BY restarts the numbering for each customer:

WITH ranked AS (
  SELECT  id, customer_id, order_date,
          ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY order_date DESC
          ) AS rn
  FROM    orders
)
SELECT id, customer_id, order_date
FROM   ranked
WHERE  rn = 1;          -- the most recent order per customer

This is the canonical “top-N-per-group” pattern, and against the seed it returns ten rows — one per ordering customer, e.g. order 14 for customer 1 (2024-07-30), order 20 for customer 9 (2024-10-22). ROW_NUMBER always gives a unique 1,2,3…; RANK would give ties the same number and skip the next (1,1,3); DENSE_RANK ties without skipping (1,1,2).

LAG: change in order value vs the previous order

LAG reaches back to the previous row in the window (LEAD reaches forward). Here we compare each order’s date to the customer’s prior order:

SELECT  id, customer_id, order_date,
        LAG(order_date) OVER (
          PARTITION BY customer_id ORDER BY order_date
        ) AS prev_order_date
FROM    orders
ORDER BY customer_id, order_date;

For customer 1’s three orders, prev_order_date is NULL on the first, then the prior date on each subsequent row — perfect for computing gaps or deltas without a self-join. (Wrapping a date difference around this is how you’d compute “days since last order” per row in one pass.)

Running total: cumulative spend over time

A running total is SUM(...) OVER (ORDER BY ...) — the window frame defaults to “all rows from the start up to the current row.” This sums line-item revenue chronologically by order:

SELECT  o.id AS order_id,
        o.order_date,
        SUM(oi.quantity * oi.unit_price) AS order_total,
        SUM(SUM(oi.quantity * oi.unit_price)) OVER (
          ORDER BY o.order_date, o.id
        ) AS running_total
FROM    orders o
JOIN    order_items oi ON oi.order_id = o.id
GROUP BY o.id, o.order_date
ORDER BY o.order_date, o.id;

Each row shows that order’s revenue plus the cumulative revenue up to that date — order 1 contributes 165, order 3 brings the running total to 505, and so on up the timeline. (The nested SUM(SUM(...)) looks odd: the inner SUM is the aggregate from GROUP BY, the outer SUM(...) OVER accumulates those per-order totals.)

You can't filter on a window function in WHERE

Window functions are computed after WHERE/GROUP BY, in the same logical phase as SELECT. So this is illegal:

-- ERROR: window function not allowed here
SELECT id, ROW_NUMBER() OVER (...) AS rn
FROM orders
WHERE rn = 1;          -- rn doesn't exist yet

Wrap it in a CTE or subquery and filter on the outside — exactly the WHERE rn = 1 pattern shown above. This is the window-function analogue of the “aggregates can’t go in WHERE, use HAVING” rule from chapter 3.1.

From the trenches

Window functions are the single highest-leverage SQL feature that engineers skip — and the skip shows up as needlessly slow, needlessly complex code. The most common case: “give me the latest row per group” (latest order per customer, current price per product, most recent status per ticket). Without windows, people reach for a correlated subquery (WHERE order_date = (SELECT MAX(...) WHERE customer_id = …)) that re-scans the table once per row, or a self-join that quietly double-counts when two rows tie on the max date. The ROW_NUMBER() OVER (PARTITION BY … ORDER BY … DESC) + WHERE rn = 1 pattern does it in a single pass, breaks ties deterministically, and reads like the sentence it implements. A second trap is believing a window function changes the row count — it never does. If your result suddenly has fewer rows, you accidentally added a GROUP BY or a DISTINCT; the window itself only adds a column, it never collapses or drops rows. Internalize those two facts — “one pass, same row count” — and a whole category of slow, fragile reporting SQL collapses into a few clean lines.

Try the “latest order per customer” query in the playground, then swap ROW_NUMBER for RANK and add a tie to see the numbering change.

01 Learning objectives

0 / 2 done

02 Curated reading

PostgreSQL docs — Window Functions
optional 15m — Authoritative intro to window functions.

03 SQL challenges

SQL challenges · ecommerce.db

0/5 solved

SQL-10Show the name and price of products priced ABOVE the average price of all products.

04 Knowledge check

knowledge check2 questions · pass ≥ 70%

01medium
A CTE (WITH …) primarily improves:
02hard
Which lets you rank rows WITHOUT collapsing them into groups?

05 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

Commonly asked junior coding very common Find the second-highest distinct salary in an Employee table; return NULL if there isn't one.
Order distinct salaries and skip the top one:
SELECT (SELECT DISTINCT salary FROM employee ORDER BY salary DESC LIMIT 1 OFFSET 1) AS second_highest;
Wrapping it in an outer SELECT makes the result NULL (not an empty set) when there's no second salary. Alternative: DENSE_RANK() OVER (ORDER BY salary DESC) and keep rank = 2. DISTINCT/DENSE_RANK matters so duplicate top salaries don't count as two ranks.
Follow-ups they push on
- Generalize to the Nth-highest salary.
- Why DENSE_RANK rather than RANK or ROW_NUMBER here?
Red flag Using `MAX(salary) WHERE salary < MAX(salary)` incorrectly, or forgetting DISTINCT so two employees tied at the top hide the real second salary; also returning an empty set instead of NULL.
source: LeetCode 176 — Second Highest Salary ↗
Commonly asked mid concept common What's the difference between a correlated and a non-correlated subquery, and why does it matter for performance?
A non-correlated subquery is self-contained — it runs once and its result is reused (e.g. WHERE salary > (SELECT AVG(salary) FROM emp)).
A correlated subquery references a column from the outer query, so conceptually it re-runs once per outer row (e.g. WHERE salary > (SELECT AVG(salary) FROM emp e2 WHERE e2.dept = e1.dept)). That can be O(n) executions and slow, though modern planners often rewrite simple cases into joins.
Follow-ups they push on
- Rewrite a correlated subquery as a JOIN or window function.
- When is EXISTS preferable to IN with a subquery?
Red flag Calling every subquery 'correlated', or assuming a correlated subquery always re-executes literally (optimizers may decorrelate it).
source: LeetCode 185 — Department Top Three Salaries (correlated subquery) ↗
Commonly asked mid concept common When would you use a CTE (WITH clause) over a subquery or a temp table?
A CTE names an intermediate result so you can reference it (sometimes multiple times) and read the query top-to-bottom — mainly a readability win, and the only way to write a recursive query (WITH RECURSIVE).
Vs a subquery: same logic, clearer structure. Vs a temp table: a CTE is scoped to the single statement and (usually) not materialized to disk. Note: in some engines a CTE is an optimization fence (older Postgres materialized them); Postgres 12+ inlines non-recursive CTEs unless you say MATERIALIZED.
Follow-ups they push on
- Write a recursive CTE to walk an org hierarchy.
- When does a CTE act as an optimization barrier?
Red flag Claiming CTEs are always faster — pre-12 Postgres materialized them, which could be slower than an inlined subquery.
source: PostgreSQL docs — WITH Queries (Common Table Expressions) ↗
Meta mid concept very common How does a window function differ from GROUP BY?
GROUP BY collapses each group into one row — you lose the individual rows. A window function (… OVER (PARTITION BY …)) computes an aggregate/rank across a window of rows but keeps every row, attaching the result alongside.
So to show each employee *and* their department's average salary in the same row, you need AVG(salary) OVER (PARTITION BY dept), not GROUP BY. Window functions also give you ROW_NUMBER/RANK/LAG/LEAD and running totals, which GROUP BY can't express.
Follow-ups they push on
- Give a running total with `SUM(x) OVER (ORDER BY d)`.
- Difference between PARTITION BY and a plain GROUP BY?
Red flag Saying they're interchangeable — GROUP BY reduces row count, a window function preserves it.
source: PostgreSQL docs — Window Functions ↗
Amazon mid concept very common What's the difference between ROW_NUMBER, RANK, and DENSE_RANK on tied values?
On a tie of two rows ranked 1st:
- ROW_NUMBER — always unique, arbitrary among ties: 1, 2, 3, 4 …
- RANK — ties share a rank, then it skips: 1, 1, 3, 4 …
- DENSE_RANK — ties share a rank, no gap: 1, 1, 2, 3 …
Pick ROW_NUMBER for 'one row per group / dedup', RANK/DENSE_RANK for leaderboards. 'Top 3 salaries including ties' usually wants DENSE_RANK <= 3.
Follow-ups they push on
- Which one for 'top N salaries, ties count as one place'?
- How to make ROW_NUMBER deterministic when the ORDER BY has ties?
Red flag Using ROW_NUMBER for a 'top N including ties' question and arbitrarily dropping tied rows, or confusing RANK's gaps with DENSE_RANK's continuity.
source: StrataScratch — Amazon 'Top-Rated Support Employees' (DENSE_RANK) ↗
Amazon senior coding common Compute the month-over-month percentage change in revenue using a window function.
Aggregate to monthly revenue, then use LAG to reach the previous month:
WITH m AS (SELECT DATE_TRUNC('month', tx) AS mth, SUM(amount) AS rev FROM orders GROUP BY 1) SELECT mth, ROUND(100.0 * (rev - LAG(rev) OVER (ORDER BY mth)) / LAG(rev) OVER (ORDER BY mth), 2) AS pct_change FROM m ORDER BY mth;
LAG(rev) OVER (ORDER BY mth) pulls the prior row's value; the first month is NULL (no prior). Multiply by 100.0 to force float division.
Follow-ups they push on
- Use LEAD instead — what changes?
- Why might integer division give you 0% everywhere?
Red flag Integer division truncating the ratio to 0, or self-joining the table to itself on month-1 instead of the cleaner LAG.
source: StrataScratch — Amazon 'Monthly Percentage Difference' ↗
Commonly asked mid coding common Write a running (cumulative) total of daily sales ordered by date.
A windowed SUM with an ORDER BY gives a running total:
SELECT sale_date, amount, SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total FROM sales;
Adding ORDER BY inside OVER switches the default frame from 'whole partition' to 'start … current row', which is exactly a cumulative sum. Add PARTITION BY region to get one running total per region.
Follow-ups they push on
- Default window frame with vs without ORDER BY?
- RANGE vs ROWS for the frame — when do they differ?
Red flag Omitting ORDER BY in OVER (you get the grand total on every row, not a running one), or being surprised by RANGE's behavior on duplicate dates.
source: PostgreSQL docs — Window Function Calls (frames) ↗
Meta senior coding occasional Find users with three or more consecutive days of activity (a gap-and-islands problem).
Classic 'gaps and islands': subtract a ROW_NUMBER from the date to give every consecutive run the same anchor:
WITH d AS (SELECT user_id, day, day - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY day))::int AS grp FROM activity) SELECT user_id, COUNT(*) AS streak FROM d GROUP BY user_id, grp HAVING COUNT(*) >= 3;
Within one user, on consecutive days both day and the row number increase by 1, so day - row_number is constant across a streak and changes at each gap — grouping on it isolates each island.
Follow-ups they push on
- Adapt it to find the *longest* streak per user.
- How would LAG-based gap detection compare?
Red flag Trying to detect consecutiveness with a single self-join on day+1 (breaks for runs longer than 2), or forgetting to PARTITION BY user.
source: StrataScratch — Meta 'User Streaks' (gap-and-island, LAG/DENSE_RANK) ↗
Commonly asked senior coding common What is a recursive CTE, and how does it walk an org hierarchy down to arbitrary depth?
A recursive CTE has two parts joined by UNION [ALL]: an anchor (the starting rows) and a recursive member that references the CTE itself, iterating until it adds no new rows.
WITH RECURSIVE chain AS (SELECT id, name, manager_id, 1 AS lvl FROM emp WHERE id = :root UNION ALL SELECT e.id, e.name, e.manager_id, c.lvl + 1 FROM emp e JOIN chain c ON e.manager_id = c.id) SELECT * FROM chain;
The anchor seeds the root; each pass joins employees onto the rows found so far, descending one level. It's the standard way to traverse trees/graphs (org charts, category trees, bill-of-materials) that a fixed number of self-joins can't handle.
What a strong answer covers
- Structure: anchor member UNION ALL recursive member that references the CTE.
- The recursive step runs repeatedly, feeding its output back in, until it produces no new rows.
- Used for arbitrary-depth trees/graphs: org charts, category trees, BOM, threaded comments.
- Guard against cycles (a depth cap or a visited-set) or the recursion never terminates.
Follow-ups they push on
- How do you prevent infinite recursion if the hierarchy has a cycle?
- UNION vs UNION ALL in a recursive CTE — what changes?
Red flag Forgetting the termination behavior on cyclic data (infinite loop) or trying to express arbitrary depth with a fixed chain of self-joins.
source: PostgreSQL docs — WITH Queries (Recursive Queries) ↗
★ must-know Amazon mid trick common Why can't you put a window function in a WHERE clause, and how do you filter on its result?
Window functions are computed in the SELECT step, which logically runs *after* WHERE, GROUP BY, and HAVING. So WHERE rn = 1 referencing ROW_NUMBER() … AS rn errors — the window result doesn't exist yet when WHERE is evaluated.
The fix is to compute the window function in an inner query (a subquery or CTE) and filter on its alias in the outer query: WITH r AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn FROM emp) SELECT * FROM r WHERE rn = 1;. This 'rank-then-filter' wrapper is the single most common window-function pattern in interviews.
What a strong answer covers
- Window functions evaluate in SELECT, after WHERE/GROUP BY/HAVING.
- Referencing a window alias in the same query's WHERE/HAVING is an error.
- Wrap it in a CTE/subquery and filter on the alias in the outer query.
- This 'rank in inner, filter in outer' is the top-N-per-group backbone.
Quick self-check
You want the single highest-paid employee per department. Which is valid?
Follow-ups they push on
- Could you ever use a window function in HAVING? (no — same reason)
- How does this relate to the top-N-per-group pattern?
Red flag Writing `WHERE ROW_NUMBER() OVER (...) = 1` directly and being surprised by a syntax/semantic error instead of wrapping it in a subquery.
source: PostgreSQL docs — Window Function Processing ↗
Commonly asked mid concept common What's the difference between EXISTS and IN with a subquery, and when does each win?
IN (subquery) materializes the subquery's values and checks membership; EXISTS (subquery) is a correlated semi-join that returns true as soon as one matching row is found (short-circuits).
Semantically the big difference is NULL handling: NOT IN returns no rows if the subquery yields a NULL, whereas NOT EXISTS is NULL-safe — so prefer NOT EXISTS for anti-joins. Performance-wise, modern optimizers often rewrite both into the same semi-/anti-join, but EXISTS tends to win when the subquery is large (it can stop early) and IN reads fine for small, NULL-free value lists. Use EXISTS when you only test *existence*; use IN for a short, known set.
What a strong answer covers
- IN tests membership in a value set; EXISTS tests whether any correlated row exists (short-circuits).
- NOT IN + a NULL in the subquery returns zero rows; NOT EXISTS is NULL-safe.
- Optimizers frequently rewrite both to semi-joins, so results — not raw form — usually drive the plan.
- Rule of thumb: EXISTS for existence tests / large subqueries; IN for small NULL-free lists.
Quick self-check
`SELECT * FROM a WHERE a.x NOT IN (SELECT b.y FROM b)` where `b.y` contains one NULL. Result?
Follow-ups they push on
- Show the NULL case where NOT IN and NOT EXISTS diverge.
- Why can EXISTS stop scanning after the first match?
Red flag Treating IN and EXISTS as always identical and getting burned by `NOT IN` with a nullable subquery column returning no rows.
source: PostgreSQL docs — Subquery Expressions (EXISTS / IN) ↗
Commonly asked mid coding occasional Pivot a tall table (one row per month) into a wide one (a column per month) in SQL.
The portable, engine-agnostic way is conditional aggregation — SUM(CASE WHEN …) per target column:
SELECT product, SUM(CASE WHEN month = 'Jan' THEN revenue END) AS jan, SUM(CASE WHEN month = 'Feb' THEN revenue END) AS feb FROM sales GROUP BY product;
Each CASE isolates one month's value; the GROUP BY collapses to one row per product. You must enumerate the target columns explicitly — SQL's result shape is fixed at plan time, so a truly dynamic pivot needs generated SQL or an engine extension (Postgres crosstab, SQL Server PIVOT).
What a strong answer covers
- Conditional aggregation: one SUM(CASE WHEN key = 'X' THEN val END) per output column.
- GROUP BY the row dimension; each CASE picks out one pivot value.
- Output columns must be hard-coded — SQL can't return a runtime-variable number of columns.
- Dynamic pivots need generated SQL or extensions (Postgres crosstab, T-SQL PIVOT).
Follow-ups they push on
- How would you handle a column set that isn't known until query time?
- How do you un-pivot (wide back to tall)?
Red flag Expecting a single SQL statement to produce a dynamic, data-dependent number of columns — the column list is fixed at plan time.
source: PostgreSQL docs — tablefunc (crosstab / pivot) ↗
Commonly asked mid concept occasional Compare INTERSECT, EXCEPT (MINUS), and UNION — and how do they handle duplicates?
All three are set operators combining two result sets with matching column counts/types, and all remove duplicates by default (each has an ALL variant to keep them):
- UNION — rows in either set.
- INTERSECT — rows in both sets.
- EXCEPT (Oracle calls it MINUS) — rows in the first set not in the second.
They compare whole rows and treat NULLs as equal to each other for this purpose (unlike =). EXCEPT is a clean way to express an anti-join, and INTERSECT a semi-join, when you're comparing identically-shaped queries.
What a strong answer covers
- UNION = either, INTERSECT = both, EXCEPT/MINUS = first-minus-second.
- All dedup by default; UNION ALL / INTERSECT ALL / EXCEPT ALL keep duplicates.
- They match on the entire row and treat NULL = NULL (unlike =).
- EXCEPT is a tidy anti-join; INTERSECT a tidy semi-join for same-shaped queries.
- Oracle uses MINUS; most others use EXCEPT.
Quick self-check
`SELECT id FROM a EXCEPT SELECT id FROM b` returns…
Follow-ups they push on
- How do set operators treat NULLs differently from a `=` comparison?
- Rewrite an EXCEPT query as a NOT EXISTS anti-join.
Red flag Forgetting these dedup by default (surprising row counts), or assuming `EXCEPT` exists in Oracle, where it's `MINUS`.
source: PostgreSQL docs — Combining Queries (UNION/INTERSECT/EXCEPT) ↗
Commonly asked mid coding occasional Use NTILE / percentile window functions to bucket users into quartiles by spend.
NTILE(n) splits ordered rows into n roughly-equal buckets and labels each row 1..n:
SELECT user_id, spend, NTILE(4) OVER (ORDER BY spend DESC) AS quartile FROM users;
Quartile 1 is the top quarter of spenders. NTILE distributes any remainder to the earliest buckets, so groups can differ by one row. If you instead want a *value* threshold (the spend at the 90th percentile), use PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY spend) (an ordered-set aggregate), not NTILE — NTILE buckets *rows*, percentiles compute a *value*.
What a strong answer covers
- NTILE(n) OVER (ORDER BY …) assigns each row a bucket number 1..n of near-equal size.
- Uneven counts: the first buckets get the extra rows.
- NTILE labels *rows by rank position*; it does not compute a threshold value.
- For a percentile *value*, use PERCENTILE_CONT/PERCENTILE_DISC … WITHIN GROUP.
Follow-ups they push on
- Difference between NTILE(4) and PERCENTILE_CONT(0.25)?
- How does NTILE distribute rows when the count isn't divisible by n?
Red flag Using NTILE to get a percentile *threshold value* (it returns bucket labels, not the value at a percentile) or assuming all NTILE buckets have exactly equal size.
source: PostgreSQL docs — Window Functions (NTILE) & Aggregate (percentile) ↗