Relational model & SQL basics
Keys, SELECT/WHERE/ORDER BY from scratch, and GROUP BY with aggregates — plus the WHERE-vs-HAVING distinction. Practice it live in the playground.
SQL is a declarative language: you describe the rows you want, and the database
figures out how to fetch them. This chapter rebuilds the spine — keys, the
SELECT skeleton, and the one distinction interviewers love to probe: WHERE
versus HAVING. Everything here runs against the playground schema below.
Keys: what makes a row findable
A primary key answers “which row is this?” — it is unique, never NULL, and the
database usually builds an index on it automatically. A foreign key is how one
table points at another: orders.customer_id holds a customers.id, and the
database can refuse to insert an order for a customer that doesn’t exist
(referential integrity). A composite key uses more than one column because no
single column is unique on its own — order_items has no id; a row is identified
by the (order_id, product_id) pair. A unique key is like a primary key’s
relaxed cousin: it forbids duplicates but tolerates NULL and can coexist with
other unique constraints on the same table.
The SELECT skeleton
Every query you write is some subset of this shape, and the clauses are executed in a different order than they’re written. That mismatch is the source of most beginner confusion, so it’s worth memorizing:
Written order: SELECT … FROM … WHERE … GROUP BY … HAVING … ORDER BY … LIMIT
Logical order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT
The practical consequence: WHERE runs before grouping (so it can’t see
aggregates), and ORDER BY runs after SELECT (so it can use column aliases).
List USA customers, newest signup first:
SELECT name, country, signup_date
FROM customers
WHERE country = 'USA'
ORDER BY signup_date DESC;Returns the three USA rows (Ava Stone, Noah Kim, Ethan Brown) ordered by date,
with Ethan Brown (2023-07-09) at the top. ORDER BY ... DESC sorts descending;
drop DESC for ascending (the default).
GROUP BY + aggregates
GROUP BY folds rows that share a value into one row per group, and aggregate
functions summarize each group. The iron rule: every column in SELECT must
either appear in GROUP BY or be wrapped in an aggregate. Anything else is
ambiguous — the database wouldn’t know which of the grouped rows to show.
SELECT category,
COUNT(*) AS num_products,
AVG(price) AS avg_price,
MAX(price) AS top_price
FROM products
GROUP BY category
ORDER BY avg_price DESC;One row per category. Computers leads on avg_price (~1263 — three pricey
laptops/tablet, top 1850), Accessories trails (~38). COUNT(*) counts rows in
the group; AVG/MAX summarize the price column within it. Note ORDER BY avg_price works because ordering happens after the aggregates are computed.
WHERE vs HAVING — the classic confusion
This is the distinction the source flags by name, and it falls straight out of the logical execution order above:
| Clause | Runs | Filters | Can it see aggregates? |
|---|---|---|---|
WHERE | before GROUP BY | individual rows | No — aggregates don't exist yet |
HAVING | after GROUP BY | whole groups | Yes — that's its whole job |
“Among orders placed from April 2024 onward, which customers have more than
one order?” The date restriction is a per-row test (WHERE); the “more than one”
test is a per-group test on a count (HAVING):
SELECT customer_id,
COUNT(*) AS order_count
FROM orders
WHERE order_date >= '2024-04-01' -- row filter, before grouping
GROUP BY customer_id
HAVING COUNT(*) > 1 -- group filter, after grouping
ORDER BY order_count DESC;WHERE thins the rows first — and it genuinely thins them: customer 1’s two early
orders (Jan 5, Feb 11) are filtered out, leaving them just one surviving order, so
they drop out of the result entirely. After grouping, HAVING keeps only buckets
with a count above 1. Three customers survive: 9 (three orders), then 4 and
8 (two each). Run it without the WHERE line and customer 1 reappears with
three — proof the date filter changed the outcome.
Open the playground below and run the GROUP BY + HAVING example — then change > 1
to >= 1, or delete the WHERE line, and watch the result set grow.
01 Learning objectives
0 / 3 done02 Curated reading
03 SQL challenges
SQL-01List the names of all customers from Canada, in alphabetical order.
04 Knowledge check
- 01easy
WHERE vs HAVING — which filters AFTER grouping?
- 02easy
A FOREIGN KEY's job is to:
- 03medium
Which aggregate ignores NULLs when counting a specific column, e.g. COUNT(col)?
- 04medium
You can reference a column alias defined in SELECT inside the WHERE clause.
05 Interview questions
browse all ↗What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.
-
What is the difference between a PRIMARY KEY, a UNIQUE constraint, and a FOREIGN KEY?
A primary key uniquely identifies a row: it is
UNIQUEandNOT NULL, and there is exactly one per table.A UNIQUE constraint also forbids duplicates but *does* allow a
NULL(one, in most engines), and a table can have many of them.A foreign key is a column whose values must exist as a key in another table — it enforces referential integrity (you cannot insert an order for a customer that does not exist, and the DB can block/cascade deletes).
Follow-ups they push on- What is a composite key?
- Can a foreign key reference a UNIQUE column instead of a primary key?
Red flag Saying a primary key is 'just a unique column' and forgetting the implicit NOT NULL, or claiming a table can have several primary keys (it has one, possibly composite).
source: DataLemur — Amazon SQL Interview Questions ↗ -
Write a query to return all employees in the Engineering department earning more than 100000, sorted by salary descending.
Straight
SELECT … WHERE … ORDER BY:SELECT name, salary FROM employees WHERE department = 'Engineering' AND salary > 100000 ORDER BY salary DESC;Watch the clause order —
WHEREfilters rows,ORDER BYruns last. String literals are single-quoted; double quotes mean an identifier in standard SQL.Follow-ups they push on- Add a tie-breaker so equal salaries sort by name.
- Return only the top 5 — LIMIT vs TOP vs FETCH FIRST?
Red flag Using double quotes around the string literal (an identifier in standard SQL/Postgres), or putting ORDER BY before WHERE.
source: PG Exercises — Basic ↗ -
What is the difference between WHERE and HAVING, and why can't you put an aggregate in WHERE?
WHEREfilters individual rows before grouping;HAVINGfilters groups after theGROUP BYruns.An aggregate like
COUNT(*)is not known until rows are grouped, so it cannot appear inWHERE— it belongs inHAVING. Example:SELECT dept, COUNT(*) FROM emp WHERE active = true GROUP BY dept HAVING COUNT(*) > 5;—activeis filtered per-row, the head-count per-group.Follow-ups they push on- Logical order of evaluation of FROM/WHERE/GROUP BY/HAVING/SELECT/ORDER BY?
- Can you reference a SELECT alias in HAVING?
Red flag Putting `WHERE COUNT(*) > 5`, or believing HAVING is just 'WHERE for the GROUP BY query' with no semantic difference.
source: PostgreSQL docs — GROUP BY and HAVING ↗ -
Write a SQL query to get the average star rating for each product for each month.
Extract the month from the timestamp, group by it and the product, average the stars:
SELECT EXTRACT(MONTH FROM submit_date) AS mth, product_id, ROUND(AVG(stars), 2) AS avg_stars FROM reviews GROUP BY mth, product_id ORDER BY mth, product_id;Every non-aggregated SELECT column must appear in GROUP BY. ROUND tidies the output.
Follow-ups they push on- Group by year-and-month so January 2023 and January 2024 don't collapse together.
- How would NULL stars affect AVG?
Red flag Selecting a column that is neither grouped nor aggregated (errors in Postgres, silently picks an arbitrary row in old MySQL), or grouping by month only so different years merge.
source: DataLemur — Amazon 'Average Review Ratings' ↗ -
What does NULL mean in SQL, and why does `WHERE col = NULL` return nothing?
NULLis 'unknown', not a value. Any comparison withNULLusing=/<>yieldsUNKNOWN(not true), so the row is dropped —WHERE col = NULLalways returns zero rows.Use the dedicated operators:
WHERE col IS NULL/IS NOT NULL. Note aggregates skip NULLs (AVG,COUNT(col)) butCOUNT(*)counts the row regardless.Follow-ups they push on- What does `NULL = NULL` evaluate to?
- How does NULL behave inside NOT IN (subquery) — and why is that a trap?
Red flag Treating NULL as a value you can equality-test, or assuming `NOT IN` works when the subquery can yield a NULL (it then returns no rows).
source: PostgreSQL docs — Comparison Functions and Operators ↗ -
Find all duplicate email addresses in a Person table (emails appearing more than once).
Group by the column and keep groups of size > 1:
SELECT email FROM person GROUP BY email HAVING COUNT(*) > 1;This is the canonical 'GROUP BY + HAVING COUNT' pattern. To actually delete dupes you would keep
MIN(id)per group and remove the rest.Follow-ups they push on- Now delete the duplicates, keeping the row with the smallest id.
- Could a self-join solve this too? Compare it to GROUP BY.
Red flag Using `WHERE COUNT(*) > 1`, or `SELECT DISTINCT` (which hides duplicates rather than finding them).
source: LeetCode 196 — Duplicate Emails ↗ -
What is the difference between DELETE, TRUNCATE, and DROP?
DELETEremoves rows one at a time, can have aWHERE, fires triggers, is fully transactional and rollback-able.TRUNCATEempties the whole table in one fast metadata operation — no per-row WHERE, usually resets identity counters, minimal logging.DROPremoves the table definition itself (and its data) from the schema.Mnemonic: DELETE = some/all rows, TRUNCATE = all rows fast, DROP = the table is gone.
Follow-ups they push on- Is TRUNCATE transactional in Postgres? (Yes.) In other engines?
- Which of these can you roll back?
Red flag Claiming TRUNCATE can take a WHERE clause, or that DELETE and TRUNCATE are interchangeable (triggers, identity reset, and speed differ).
source: PostgreSQL docs — TRUNCATE ↗ -
What is the difference between UNION and UNION ALL, and which is faster?
UNIONconcatenates two result sets and removes duplicates (an implicit DISTINCT, which costs a sort/hash).UNION ALLkeeps every row, including duplicates.UNION ALLis faster because it skips the dedup step — prefer it whenever you know the inputs are already disjoint or duplicates are acceptable. Both require the same column count and compatible types in each branch.Follow-ups they push on- When is UNION (with dedup) actually required?
- Difference between UNION and a FULL OUTER JOIN?
Red flag Defaulting to UNION everywhere and paying for a needless dedup, or assuming the column lists must have identical names (only count/type must match).
source: StrataScratch — Meta SQL Interview Questions ↗ -
What is the logical order of evaluation of a SELECT's clauses, and why does it explain why you can't use a SELECT alias in WHERE?
Although you *write*
SELECT … FROM … WHERE … GROUP BY … HAVING … ORDER BY, the engine *logically* evaluates them as:FROM/joins ->WHERE->GROUP BY->HAVING->SELECT(where aliases are assigned) ->DISTINCT->ORDER BY->LIMIT.Because
SELECTruns afterWHERE, an alias defined inSELECTdoesn't yet exist whenWHEREis evaluated — soWHERE total > 100referencing aSELECT … AS totalerrors.ORDER BYruns last, which is why it *can* see SELECT aliases. (MySQL leniently allows aliases in some clauses as an extension, but the standard doesn't.)What a strong answer coversLogical order: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT.
Aliases are created in the
SELECTstep, soWHERE/GROUP BY/HAVINGgenerally can't see them.ORDER BYis the one clause that *can* reference SELECT aliases — it runs after SELECT.Workaround: repeat the expression, or wrap the query in a subquery/CTE and filter on the alias outside.
Quick self-checkGiven `SELECT price * qty AS total FROM line_items WHERE total > 100;`, what happens?
-
No — WHERE runs before SELECT assigns the `total` alias, so the alias isn't visible there.
-
Correct — the alias is created in the SELECT step, which is evaluated after WHERE.
-
No — the database raises an error rather than silently ignoring an unknown column.
-
No — Postgres errors; MySQL has extensions that allow aliases in some clauses, so behavior differs.
Follow-ups they push on- Why can ORDER BY use a SELECT alias but WHERE cannot?
- Does MySQL deviate from this? (it allows aliases in GROUP BY/HAVING as an extension)
Red flag Assuming the written clause order is the execution order, then being confused why a SELECT alias is 'not recognized' in WHERE or GROUP BY.
source: PostgreSQL docs — SELECT (clause processing order) ↗ -
What's the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?
COUNT(*)counts rows, including rows where every column is NULL.COUNT(col)counts rows wherecolis not NULL — NULLs are skipped.COUNT(DISTINCT col)counts the number of distinct non-NULL values.So on a column with NULLs,
COUNT(*)>=COUNT(col)>=COUNT(DISTINCT col). This trips people up in 'how many customers placed an order' style questions, where a LEFT JOIN leaves NULLs andCOUNT(*)over-counts.What a strong answer coversCOUNT(*)counts every row regardless of NULLs.COUNT(col)ignores rows wherecol IS NULL.COUNT(DISTINCT col)ignores NULLs and collapses duplicates.After a LEFT JOIN, count a non-null right-side column (not
*) to avoid counting unmatched rows.
Quick self-checkA `votes(id, choice)` column has 5 rows; `choice` is NULL in 2 of them, and the 3 non-null values are 'A','A','B'. What are COUNT(*), COUNT(choice), COUNT(DISTINCT choice)?
-
No — COUNT(choice) skips the 2 NULLs, so it's 3, not 5.
-
Correct — 5 rows total; 3 non-null values; 2 distinct non-null values (A, B).
-
No — COUNT(*) counts all 5 rows, and there are only 2 distinct values.
-
No — only two distinct non-null values exist ('A' and 'B'), so COUNT(DISTINCT) is 2.
Follow-ups they push on- After a LEFT JOIN, why does COUNT(*) over-report and COUNT(right_col) fix it?
- Is COUNT(1) any different from COUNT(*)? (no — same thing)
Red flag Assuming COUNT(col) counts all rows like COUNT(*), or using COUNT(*) after a LEFT JOIN and counting the NULL-filled unmatched rows.
source: PostgreSQL docs — Aggregate Functions ↗ -
Find the employee(s) with the highest salary in each department.
The robust way is a window rank so ties are kept:
WITH r AS (SELECT name, department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rk FROM employee) SELECT name, department, salary FROM r WHERE rk = 1;A correlated-subquery form also works:
WHERE salary = (SELECT MAX(salary) FROM employee e2 WHERE e2.department = e1.department). UseRANK(notROW_NUMBER) so two employees tied at the top of a department both appear.What a strong answer coversPartition by department, order by salary DESC, keep the top rank.
RANK() = 1keeps all ties;ROW_NUMBER() = 1arbitrarily keeps just one.Equivalent correlated subquery: compare each salary to that department's
MAX.A global
ORDER BY salary DESC LIMIT 1is wrong — it returns one row overall, not one per department.
Follow-ups they push on- Why RANK rather than ROW_NUMBER if ties should all be returned?
- Rewrite it without a window function using a correlated subquery.
Red flag Using `ORDER BY salary DESC LIMIT 1` (top earner overall, not per department) or ROW_NUMBER, which silently drops tied top earners.
source: LeetCode 184 — Department Highest Salary ↗ -
What does `WHERE status NOT IN ('shipped', 'delivered')` do to rows where status is NULL, and why?
It excludes them — a row with
status = NULLis *not* returned, even though NULL is obviously 'not shipped and not delivered' to a human.NOT IN (...)expands tostatus <> 'shipped' AND status <> 'delivered'. ComparingNULL <> anythingyieldsUNKNOWN, so the wholeANDisUNKNOWN, andWHEREkeeps only rows that areTRUE. To include NULLs you must say so explicitly:WHERE status NOT IN (...) OR status IS NULL.What a strong answer coversNOT INis sugar for a chain of<>comparisons joined byAND.Any comparison with NULL is
UNKNOWN, andWHEREkeeps onlyTRUErows — so NULL rows drop.The far more dangerous case: a NULL inside the list makes
NOT INreturn *no rows at all*.Add
OR col IS NULLto include NULLs, or preferNOT EXISTSwhich is NULL-safe.
Quick self-check`orders` has statuses 'shipped', 'pending', and NULL. `SELECT * FROM orders WHERE status NOT IN ('shipped');` returns…
-
Correct — `NULL <> 'shipped'` is UNKNOWN, so NULL rows fail the WHERE and only 'pending' survives.
-
No — NULL doesn't satisfy the comparison; it evaluates to UNKNOWN, not TRUE.
-
No — that catastrophe happens when a NULL is *inside the IN list*, not when a row's column is NULL.
-
No — the NULL rows are also dropped, so it isn't simply 'everything but shipped'.
Follow-ups they push on- What happens if the value list itself contains a NULL (e.g. from a subquery)?
- How does NOT EXISTS avoid this NULL trap?
Red flag Expecting NULL rows to satisfy a `NOT IN`/`<>` filter, or using `NOT IN (subquery)` where the subquery can yield NULL and silently returning zero rows.
source: PostgreSQL docs — Row and Array Comparisons (IN / NOT IN and NULL) ↗ -
What is the difference between CHAR, VARCHAR, and TEXT, and when does the choice matter?
CHAR(n)is fixed-length — it pads with spaces ton, so it suits truly fixed codes (a 2-char country code, a fixed hash).VARCHAR(n)is variable-length with a declared max, erroring if you exceed it.TEXTis variable-length with no practical limit.In PostgreSQL there is no performance difference between them — the manual recommends
textorvarcharand noteschar(n)is usually the *slowest* due to padding. The length limit is mainly a data-integrity constraint, not an optimization. (In some other engines, like older MySQL row formats, fixed vs variable length had storage implications.)What a strong answer coversCHAR(n): fixed length, space-padded — only for genuinely fixed-width values.VARCHAR(n): variable length with an enforced maximum.TEXT: variable length, effectively unlimited.In Postgres these perform the same; a length cap is a constraint, not a speed win.
Follow-ups they push on- Does a VARCHAR(255) store faster than VARCHAR(1000) in Postgres? (no)
- When is CHAR(n) actually the right choice?
Red flag Believing a smaller VARCHAR(n) is faster or saves space in Postgres, or using CHAR for general text and getting surprised by trailing-space padding.
source: PostgreSQL docs — Character Types ↗ -
How do you classify employees into salary bands ('low'/'mid'/'high') in a single SELECT?
Use a
CASEexpression, which is SQL's inline if/else:SELECT name, salary, CASE WHEN salary < 50000 THEN 'low' WHEN salary < 100000 THEN 'mid' ELSE 'high' END AS band FROM employee;The searched
CASEevaluatesWHENbranches top-to-bottom and returns the first match, so order the boundaries carefully. With noELSE, unmatched rows getNULL. You can also wrapCASEinside an aggregate (SUM(CASE WHEN … THEN 1 ELSE 0 END)) for conditional counts — the classic pivot trick.What a strong answer coversCASE WHEN … THEN … [WHEN …] ELSE … ENDreturns the first matching branch.Branches are evaluated in order — overlapping conditions resolve to the first true one.
Omitting
ELSEyieldsNULLfor unmatched rows.SUM(CASE WHEN cond THEN 1 ELSE 0 END)does conditional counting / pivoting.
Follow-ups they push on- Rewrite a conditional COUNT using SUM(CASE WHEN …).
- What's the difference between a simple CASE and a searched CASE?
Red flag Ordering CASE branches so a broad condition shadows a narrower one, or forgetting that without ELSE the result is NULL, not 0.
source: PostgreSQL docs — Conditional Expressions (CASE) ↗