3.1 ★ core [J] 14 interview Q's

Relational model & SQL basics

Keys, SELECT/WHERE/ORDER BY from scratch, and GROUP BY with aggregates — plus the WHERE-vs-HAVING distinction. Practice it live in the playground.

SQL is a declarative language: you describe the rows you want, and the database figures out how to fetch them. This chapter rebuilds the spine — keys, the SELECT skeleton, and the one distinction interviewers love to probe: WHERE versus HAVING. Everything here runs against the playground schema below.

Keys: what makes a row findable

A primary key answers “which row is this?” — it is unique, never NULL, and the database usually builds an index on it automatically. A foreign key is how one table points at another: orders.customer_id holds a customers.id, and the database can refuse to insert an order for a customer that doesn’t exist (referential integrity). A composite key uses more than one column because no single column is unique on its own — order_items has no id; a row is identified by the (order_id, product_id) pair. A unique key is like a primary key’s relaxed cousin: it forbids duplicates but tolerates NULL and can coexist with other unique constraints on the same table.

The SELECT skeleton

Every query you write is some subset of this shape, and the clauses are executed in a different order than they’re written. That mismatch is the source of most beginner confusion, so it’s worth memorizing:

Written order:    SELECT … FROM … WHERE … GROUP BY … HAVING … ORDER BY … LIMIT
Logical order:    FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT

The practical consequence: WHERE runs before grouping (so it can’t see aggregates), and ORDER BY runs after SELECT (so it can use column aliases).

Logical evaluation order. Rows enter at FROM and are progressively filtered, grouped, filtered again, projected, then sorted — which is why WHERE can't see an aggregate (it runs upstream of GROUP BY) but ORDER BY can use a SELECT alias (it runs downstream).

SELECT … WHERE … ORDER BY

List USA customers, newest signup first:

SELECT  name, country, signup_date
FROM    customers
WHERE   country = 'USA'
ORDER BY signup_date DESC;

Returns the three USA rows (Ava Stone, Noah Kim, Ethan Brown) ordered by date, with Ethan Brown (2023-07-09) at the top. ORDER BY ... DESC sorts descending; drop DESC for ascending (the default).

GROUP BY + aggregates

GROUP BY folds rows that share a value into one row per group, and aggregate functions summarize each group. The iron rule: every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate. Anything else is ambiguous — the database wouldn’t know which of the grouped rows to show.

Revenue per product category

SELECT  category,
        COUNT(*)        AS num_products,
        AVG(price)      AS avg_price,
        MAX(price)      AS top_price
FROM    products
GROUP BY category
ORDER BY avg_price DESC;

One row per category. Computers leads on avg_price (~1263 — three pricey laptops/tablet, top 1850), Accessories trails (~38). COUNT(*) counts rows in the group; AVG/MAX summarize the price column within it. Note ORDER BY avg_price works because ordering happens after the aggregates are computed.

WHERE vs HAVING — the classic confusion

This is the distinction the source flags by name, and it falls straight out of the logical execution order above:

Clause	Runs	Filters	Can it see aggregates?
`WHERE`	before `GROUP BY`	individual rows	No — aggregates don't exist yet
`HAVING`	after `GROUP BY`	whole groups	Yes — that's its whole job

WHERE filters rows going IN; HAVING filters groups coming OUT.

WHERE and HAVING in one query

“Among orders placed from April 2024 onward, which customers have more than one order?” The date restriction is a per-row test (WHERE); the “more than one” test is a per-group test on a count (HAVING):

SELECT  customer_id,
        COUNT(*) AS order_count
FROM    orders
WHERE   order_date >= '2024-04-01'        -- row filter, before grouping
GROUP BY customer_id
HAVING  COUNT(*) > 1                       -- group filter, after grouping
ORDER BY order_count DESC;

WHERE thins the rows first — and it genuinely thins them: customer 1’s two early orders (Jan 5, Feb 11) are filtered out, leaving them just one surviving order, so they drop out of the result entirely. After grouping, HAVING keeps only buckets with a count above 1. Three customers survive: 9 (three orders), then 4 and 8 (two each). Run it without the WHERE line and customer 1 reappears with three — proof the date filter changed the outcome.

Don't put an aggregate in WHERE

This fails — WHERE runs before any grouping, so COUNT(*) isn’t defined yet:

-- ERROR: aggregate functions are not allowed in WHERE
SELECT customer_id FROM orders
WHERE COUNT(*) > 1
GROUP BY customer_id;

Move the aggregate test to HAVING. Conversely, don’t push a plain row condition into HAVING “because it’s after GROUP BY” — it works, but filtering rows in WHERE first is cheaper (fewer rows to group).

From the trenches

The most expensive bugs in reporting aren’t crashes — they’re queries that run fine and return a wrong number nobody questions. The classic is three-valued logic: NULL is neither equal nor unequal to anything, so WHERE status = 'active' silently excludes every row where status IS NULL, and COUNT(col) quietly skips them too. A dashboard sums “revenue per region,” a handful of orders have a NULL region, and the totals are subtly low — for months — until finance reconciles against the ledger and the numbers don’t tie out. There’s no error to grep for; the query is valid, just not asking what its author thought. The defensive habits: prefer COUNT(*) when you mean “how many rows,” reach for COUNT(col) deliberately when you mean “how many non-null,” and use COALESCE(col, 'unknown') before grouping so NULLs become a visible bucket instead of vanishing.

Open the playground below and run the GROUP BY + HAVING example — then change > 1 to >= 1, or delete the WHERE line, and watch the result set grow.

01 Learning objectives

0 / 3 done

02 Curated reading

PG Exercises
essential 30m — Hands-on graded SQL on a sample schema.
PostgreSQL Tutorial — SELECT
optional 15m — Syntax refresher for the basics.

03 SQL challenges

SQL challenges · ecommerce.db

0/5 solved

SQL-01List the names of all customers from Canada, in alphabetical order.

04 Knowledge check

knowledge check4 questions · pass ≥ 70%

01easy
WHERE vs HAVING — which filters AFTER grouping?
02easy
A FOREIGN KEY's job is to:
03medium
Which aggregate ignores NULLs when counting a specific column, e.g. COUNT(col)?
04medium
You can reference a column alias defined in SELECT inside the WHERE clause.

05 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

Amazon junior concept very common What is the difference between a PRIMARY KEY, a UNIQUE constraint, and a FOREIGN KEY?
A primary key uniquely identifies a row: it is UNIQUE and NOT NULL, and there is exactly one per table.
A UNIQUE constraint also forbids duplicates but *does* allow a NULL (one, in most engines), and a table can have many of them.
A foreign key is a column whose values must exist as a key in another table — it enforces referential integrity (you cannot insert an order for a customer that does not exist, and the DB can block/cascade deletes).
Follow-ups they push on
- What is a composite key?
- Can a foreign key reference a UNIQUE column instead of a primary key?
Red flag Saying a primary key is 'just a unique column' and forgetting the implicit NOT NULL, or claiming a table can have several primary keys (it has one, possibly composite).
source: DataLemur — Amazon SQL Interview Questions ↗
Commonly asked junior coding very common Write a query to return all employees in the Engineering department earning more than 100000, sorted by salary descending.
Straight SELECT … WHERE … ORDER BY:
SELECT name, salary FROM employees WHERE department = 'Engineering' AND salary > 100000 ORDER BY salary DESC;
Watch the clause order — WHERE filters rows, ORDER BY runs last. String literals are single-quoted; double quotes mean an identifier in standard SQL.
Follow-ups they push on
- Add a tie-breaker so equal salaries sort by name.
- Return only the top 5 — LIMIT vs TOP vs FETCH FIRST?
Red flag Using double quotes around the string literal (an identifier in standard SQL/Postgres), or putting ORDER BY before WHERE.
source: PG Exercises — Basic ↗
Commonly asked junior concept very common What is the difference between WHERE and HAVING, and why can't you put an aggregate in WHERE?
WHERE filters individual rows before grouping; HAVING filters groups after the GROUP BY runs.
An aggregate like COUNT(*) is not known until rows are grouped, so it cannot appear in WHERE — it belongs in HAVING. Example: SELECT dept, COUNT(*) FROM emp WHERE active = true GROUP BY dept HAVING COUNT(*) > 5; — active is filtered per-row, the head-count per-group.
Follow-ups they push on
- Logical order of evaluation of FROM/WHERE/GROUP BY/HAVING/SELECT/ORDER BY?
- Can you reference a SELECT alias in HAVING?
Red flag Putting `WHERE COUNT(*) > 5`, or believing HAVING is just 'WHERE for the GROUP BY query' with no semantic difference.
source: PostgreSQL docs — GROUP BY and HAVING ↗
Amazon mid coding common Write a SQL query to get the average star rating for each product for each month.
Extract the month from the timestamp, group by it and the product, average the stars:
SELECT EXTRACT(MONTH FROM submit_date) AS mth, product_id, ROUND(AVG(stars), 2) AS avg_stars FROM reviews GROUP BY mth, product_id ORDER BY mth, product_id;
Every non-aggregated SELECT column must appear in GROUP BY. ROUND tidies the output.
Follow-ups they push on
- Group by year-and-month so January 2023 and January 2024 don't collapse together.
- How would NULL stars affect AVG?
Red flag Selecting a column that is neither grouped nor aggregated (errors in Postgres, silently picks an arbitrary row in old MySQL), or grouping by month only so different years merge.
source: DataLemur — Amazon 'Average Review Ratings' ↗
Commonly asked junior trick common What does NULL mean in SQL, and why does `WHERE col = NULL` return nothing?
NULL is 'unknown', not a value. Any comparison with NULL using =/<> yields UNKNOWN (not true), so the row is dropped — WHERE col = NULL always returns zero rows.
Use the dedicated operators: WHERE col IS NULL / IS NOT NULL. Note aggregates skip NULLs (AVG, COUNT(col)) but COUNT(*) counts the row regardless.
Follow-ups they push on
- What does `NULL = NULL` evaluate to?
- How does NULL behave inside NOT IN (subquery) — and why is that a trap?
Red flag Treating NULL as a value you can equality-test, or assuming `NOT IN` works when the subquery can yield a NULL (it then returns no rows).
source: PostgreSQL docs — Comparison Functions and Operators ↗
Commonly asked junior coding common Find all duplicate email addresses in a Person table (emails appearing more than once).
Group by the column and keep groups of size > 1:
SELECT email FROM person GROUP BY email HAVING COUNT(*) > 1;
This is the canonical 'GROUP BY + HAVING COUNT' pattern. To actually delete dupes you would keep MIN(id) per group and remove the rest.
Follow-ups they push on
- Now delete the duplicates, keeping the row with the smallest id.
- Could a self-join solve this too? Compare it to GROUP BY.
Red flag Using `WHERE COUNT(*) > 1`, or `SELECT DISTINCT` (which hides duplicates rather than finding them).
source: LeetCode 196 — Duplicate Emails ↗
Commonly asked junior concept common What is the difference between DELETE, TRUNCATE, and DROP?
DELETE removes rows one at a time, can have a WHERE, fires triggers, is fully transactional and rollback-able.
TRUNCATE empties the whole table in one fast metadata operation — no per-row WHERE, usually resets identity counters, minimal logging.
DROP removes the table definition itself (and its data) from the schema.
Mnemonic: DELETE = some/all rows, TRUNCATE = all rows fast, DROP = the table is gone.
Follow-ups they push on
- Is TRUNCATE transactional in Postgres? (Yes.) In other engines?
- Which of these can you roll back?
Red flag Claiming TRUNCATE can take a WHERE clause, or that DELETE and TRUNCATE are interchangeable (triggers, identity reset, and speed differ).
source: PostgreSQL docs — TRUNCATE ↗
Commonly asked junior concept common What is the difference between UNION and UNION ALL, and which is faster?
UNION concatenates two result sets and removes duplicates (an implicit DISTINCT, which costs a sort/hash). UNION ALL keeps every row, including duplicates.
UNION ALL is faster because it skips the dedup step — prefer it whenever you know the inputs are already disjoint or duplicates are acceptable. Both require the same column count and compatible types in each branch.
Follow-ups they push on
- When is UNION (with dedup) actually required?
- Difference between UNION and a FULL OUTER JOIN?
Red flag Defaulting to UNION everywhere and paying for a needless dedup, or assuming the column lists must have identical names (only count/type must match).
source: StrataScratch — Meta SQL Interview Questions ↗
Commonly asked mid concept common What is the logical order of evaluation of a SELECT's clauses, and why does it explain why you can't use a SELECT alias in WHERE?
Although you *write* SELECT … FROM … WHERE … GROUP BY … HAVING … ORDER BY, the engine *logically* evaluates them as: FROM/joins -> WHERE -> GROUP BY -> HAVING -> SELECT (where aliases are assigned) -> DISTINCT -> ORDER BY -> LIMIT.
Because SELECT runs after WHERE, an alias defined in SELECT doesn't yet exist when WHERE is evaluated — so WHERE total > 100 referencing a SELECT … AS total errors. ORDER BY runs last, which is why it *can* see SELECT aliases. (MySQL leniently allows aliases in some clauses as an extension, but the standard doesn't.)
What a strong answer covers
- Logical order: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT.
- Aliases are created in the SELECT step, so WHERE/GROUP BY/HAVING generally can't see them.
- ORDER BY is the one clause that *can* reference SELECT aliases — it runs after SELECT.
- Workaround: repeat the expression, or wrap the query in a subquery/CTE and filter on the alias outside.
Quick self-check
Given `SELECT price * qty AS total FROM line_items WHERE total > 100;`, what happens?
Follow-ups they push on
- Why can ORDER BY use a SELECT alias but WHERE cannot?
- Does MySQL deviate from this? (it allows aliases in GROUP BY/HAVING as an extension)
Red flag Assuming the written clause order is the execution order, then being confused why a SELECT alias is 'not recognized' in WHERE or GROUP BY.
source: PostgreSQL docs — SELECT (clause processing order) ↗
★ must-know Commonly asked junior concept very common What's the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?
COUNT(*) counts rows, including rows where every column is NULL. COUNT(col) counts rows where col is not NULL — NULLs are skipped. COUNT(DISTINCT col) counts the number of distinct non-NULL values.
So on a column with NULLs, COUNT(*) >= COUNT(col) >= COUNT(DISTINCT col). This trips people up in 'how many customers placed an order' style questions, where a LEFT JOIN leaves NULLs and COUNT(*) over-counts.
What a strong answer covers
- COUNT(*) counts every row regardless of NULLs.
- COUNT(col) ignores rows where col IS NULL.
- COUNT(DISTINCT col) ignores NULLs and collapses duplicates.
- After a LEFT JOIN, count a non-null right-side column (not *) to avoid counting unmatched rows.
Quick self-check
A `votes(id, choice)` column has 5 rows; `choice` is NULL in 2 of them, and the 3 non-null values are 'A','A','B'. What are COUNT(*), COUNT(choice), COUNT(DISTINCT choice)?
Follow-ups they push on
- After a LEFT JOIN, why does COUNT(*) over-report and COUNT(right_col) fix it?
- Is COUNT(1) any different from COUNT(*)? (no — same thing)
Red flag Assuming COUNT(col) counts all rows like COUNT(*), or using COUNT(*) after a LEFT JOIN and counting the NULL-filled unmatched rows.
source: PostgreSQL docs — Aggregate Functions ↗
Amazon mid coding very common Find the employee(s) with the highest salary in each department.
The robust way is a window rank so ties are kept:
WITH r AS (SELECT name, department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rk FROM employee) SELECT name, department, salary FROM r WHERE rk = 1;
A correlated-subquery form also works: WHERE salary = (SELECT MAX(salary) FROM employee e2 WHERE e2.department = e1.department). Use RANK (not ROW_NUMBER) so two employees tied at the top of a department both appear.
What a strong answer covers
- Partition by department, order by salary DESC, keep the top rank.
- RANK() = 1 keeps all ties; ROW_NUMBER() = 1 arbitrarily keeps just one.
- Equivalent correlated subquery: compare each salary to that department's MAX.
- A global ORDER BY salary DESC LIMIT 1 is wrong — it returns one row overall, not one per department.
Follow-ups they push on
- Why RANK rather than ROW_NUMBER if ties should all be returned?
- Rewrite it without a window function using a correlated subquery.
Red flag Using `ORDER BY salary DESC LIMIT 1` (top earner overall, not per department) or ROW_NUMBER, which silently drops tied top earners.
source: LeetCode 184 — Department Highest Salary ↗
Commonly asked mid trick common What does `WHERE status NOT IN ('shipped', 'delivered')` do to rows where status is NULL, and why?
It excludes them — a row with status = NULL is *not* returned, even though NULL is obviously 'not shipped and not delivered' to a human.
NOT IN (...) expands to status <> 'shipped' AND status <> 'delivered'. Comparing NULL <> anything yields UNKNOWN, so the whole AND is UNKNOWN, and WHERE keeps only rows that are TRUE. To include NULLs you must say so explicitly: WHERE status NOT IN (...) OR status IS NULL.
What a strong answer covers
- NOT IN is sugar for a chain of <> comparisons joined by AND.
- Any comparison with NULL is UNKNOWN, and WHERE keeps only TRUE rows — so NULL rows drop.
- The far more dangerous case: a NULL inside the list makes NOT IN return *no rows at all*.
- Add OR col IS NULL to include NULLs, or prefer NOT EXISTS which is NULL-safe.
Quick self-check
`orders` has statuses 'shipped', 'pending', and NULL. `SELECT * FROM orders WHERE status NOT IN ('shipped');` returns…
Follow-ups they push on
- What happens if the value list itself contains a NULL (e.g. from a subquery)?
- How does NOT EXISTS avoid this NULL trap?
Red flag Expecting NULL rows to satisfy a `NOT IN`/`<>` filter, or using `NOT IN (subquery)` where the subquery can yield NULL and silently returning zero rows.
source: PostgreSQL docs — Row and Array Comparisons (IN / NOT IN and NULL) ↗
Commonly asked junior concept occasional What is the difference between CHAR, VARCHAR, and TEXT, and when does the choice matter?
CHAR(n) is fixed-length — it pads with spaces to n, so it suits truly fixed codes (a 2-char country code, a fixed hash). VARCHAR(n) is variable-length with a declared max, erroring if you exceed it. TEXT is variable-length with no practical limit.
In PostgreSQL there is no performance difference between them — the manual recommends text or varchar and notes char(n) is usually the *slowest* due to padding. The length limit is mainly a data-integrity constraint, not an optimization. (In some other engines, like older MySQL row formats, fixed vs variable length had storage implications.)
What a strong answer covers
- CHAR(n): fixed length, space-padded — only for genuinely fixed-width values.
- VARCHAR(n): variable length with an enforced maximum.
- TEXT: variable length, effectively unlimited.
- In Postgres these perform the same; a length cap is a constraint, not a speed win.
Follow-ups they push on
- Does a VARCHAR(255) store faster than VARCHAR(1000) in Postgres? (no)
- When is CHAR(n) actually the right choice?
Red flag Believing a smaller VARCHAR(n) is faster or saves space in Postgres, or using CHAR for general text and getting surprised by trailing-space padding.
source: PostgreSQL docs — Character Types ↗
Commonly asked junior coding common How do you classify employees into salary bands ('low'/'mid'/'high') in a single SELECT?
Use a CASE expression, which is SQL's inline if/else:
SELECT name, salary, CASE WHEN salary < 50000 THEN 'low' WHEN salary < 100000 THEN 'mid' ELSE 'high' END AS band FROM employee;
The searched CASE evaluates WHEN branches top-to-bottom and returns the first match, so order the boundaries carefully. With no ELSE, unmatched rows get NULL. You can also wrap CASE inside an aggregate (SUM(CASE WHEN … THEN 1 ELSE 0 END)) for conditional counts — the classic pivot trick.
What a strong answer covers
- CASE WHEN … THEN … [WHEN …] ELSE … END returns the first matching branch.
- Branches are evaluated in order — overlapping conditions resolve to the first true one.
- Omitting ELSE yields NULL for unmatched rows.
- SUM(CASE WHEN cond THEN 1 ELSE 0 END) does conditional counting / pivoting.
Follow-ups they push on
- Rewrite a conditional COUNT using SUM(CASE WHEN …).
- What's the difference between a simple CASE and a searched CASE?
Red flag Ordering CASE branches so a broad condition shadows a narrower one, or forgetting that without ELSE the result is NULL, not 0.
source: PostgreSQL docs — Conditional Expressions (CASE) ↗