> cs·fundamentals
interview 0% 30m read
3.5 ★ core [J][A] 14 interview Q's

Schema design & transactions

Normalization (1NF–BCNF) and when to denormalize, ACID, and the isolation levels with the exact anomaly each prevents (dirty / non-repeatable / phantom read).

Schema design decides how cleanly your data models reality; transactions decide what happens when many users touch it at once. This chapter covers normalization (and when to break it), the ACID guarantees, and the part interviewers drill hardest — the isolation levels and the exact anomaly each one prevents.

Normalization: 1NF → BCNF

Normalization is a ladder. Each form removes a specific kind of redundancy:

  • 1NF — atomic columns. Each cell holds a single value; no comma-separated lists, no repeating groups. A tags column of "a,b,c" violates 1NF.
  • 2NF — no partial dependencies. 1NF and every non-key column depends on the whole composite key, not just part of it. If order_items(order_id, product_id) stored product_name, that depends only on product_id (half the key) → split it out into products.
  • 3NF — no transitive dependencies. 2NF and non-key columns depend only on the key, not on another non-key column. Storing a customer’s country and country_dialing_code together is a transitive dependency (code depends on country, not the customer) → move country → code to its own table.
  • BCNF — every determinant is a candidate key. A stricter 3NF that closes an edge case where a non-key column determines part of a candidate key. In practice, “3NF + a rare corner case.”

The one-line mnemonic interviewers want: “the key, the whole key, and nothing but the key.” — that’s 2NF (whole key) and 3NF (nothing but the key).

before — transitive dependencyorder_idcustomercountrycode7EmmaBrazil+559LucasBrazil+5516EmmaBrazil+55”+55” duplicated 3× → change it once, miss a row, data liesafter — 3NF (one fact, one place)order_idcustcountry7EmmaBrazil9LucasBrazil16EmmaBrazilcountrycodeBrazil+55USA+1code stored once
The 3NF split. On the left, an orders table that crams the customer's country and its dialing code inline — a transitive dependency, since code is a fact about the country, not the order. Edit Brazil's code in one row and the others go stale (an update anomaly). Normalizing pulls country → code into its own table; each fact now lives in exactly one place, and orders just references it.

ACID

A transaction’s guarantees are ACID:

  • Atomicity — all-or-nothing. The classic example: a bank transfer debits one account and credits another; either both happen or neither does. A crash mid-way rolls back.
  • Consistency — the transaction moves the DB from one valid state to another, respecting all constraints (foreign keys, CHECKs, uniqueness). It never leaves half-applied invariants.
  • Isolation — concurrent transactions don’t step on each other; the result is as if they ran in some serial order. How much isolation is the dial we tune below.
  • Durability — once committed, it survives a crash (written to the write-ahead log / disk).
A transaction that must be atomic

Moving an order’s line item between two orders must be all-or-nothing:

BEGIN;
  UPDATE order_items SET order_id = 2 WHERE order_id = 1 AND product_id = 2;
  UPDATE orders SET status = 'pending' WHERE id = 1;
COMMIT;   -- ROLLBACK here would undo BOTH statements

Without the transaction, a crash between the two UPDATEs leaves the data inconsistent — the line item moved to order 2 but order 1 never marked pending. BEGIN … COMMIT makes them a single atomic unit.

Isolation levels and the anomalies they prevent

This is the table to memorize. As you raise the isolation level, you forbid one more anomaly — at the cost of more locking/contention. First, the three anomalies:

  • Dirty read — you read a row another transaction has written but not yet committed; if it rolls back, you read data that never officially existed.
  • Non-repeatable read — you read a row twice in one transaction and get different values, because another transaction committed an UPDATE in between.
  • Phantom read — you run the same WHERE query twice and the set of rows changes (new rows appear / disappear), because another transaction committed an INSERT/DELETE matching your filter.
dirty readT2T1UPDATE bal=50read 50ROLLBACKread a value that never committednon-repeatable readT2T1read 50UPDATE→80, COMMITread 80same row, value changed mid-txnphantom readT2T15 rowsINSERT, COMMIT6 rowssame WHERE, row set changedeach level up forbids one more anomaly:READ UNCOMMITTEDallows all threeREAD COMMITTEDno dirtyREPEATABLE READ+ no non-repeatableSERIALIZABLE+ no phantoms
The three anomalies on a timeline, each between two concurrent transactions T1 and T2. Dirty read: T1 reads a value T2 hasn't committed (and may roll back). Non-repeatable read: T1 reads the same row twice and the value changes because T2 committed an UPDATE in between. Phantom read: T1 runs the same WHERE twice and the row count changes because T2 committed an INSERT.
Isolation levelDirty readNon-repeatable readPhantom read
READ UNCOMMITTEDpossiblepossiblepossible
READ COMMITTEDpreventedpossiblepossible
REPEATABLE READpreventedpreventedpossible*
SERIALIZABLEpreventedpreventedprevented
Each level up forbids one more anomaly. (*ANSI allows phantoms at REPEATABLE READ; some engines like PostgreSQL/InnoDB prevent them at this level via snapshot/next-key locking.)

The progression is the thing to internalize: READ UNCOMMITTED allows everything (dirty reads included). READ COMMITTED — the common default in PostgreSQL/Oracle — only ever reads committed data, killing dirty reads. REPEATABLE READ — MySQL/InnoDB’s default — additionally guarantees a row you’ve read won’t change under you. SERIALIZABLE is the strongest: transactions behave as if run one-at-a-time, eliminating phantoms too.

Locking

Isolation is implemented partly with locks. The two basic modes:

  • Shared (S) lock — held for reads; multiple shared locks can coexist on the same row (many readers, no writer).
  • Exclusive (X) lock — held for writes; it blocks all other locks (one writer, no one else) until released at commit.

The hazard is a deadlock: transaction A holds row 1 and wants row 2 while B holds row 2 and wants row 1. The database detects the cycle and aborts one of them — which your application must catch and retry. (Many modern databases sidestep read/write blocking entirely with MVCC, keeping multiple row versions so readers never block writers.)

ORMs vs raw SQL

ORMRaw SQL
Speed of developmentfast — models, migrations, less boilerplateslower — hand-write everything
Safetyparameterized by default (SQL-injection resistant)you must parameterize manually
Control / performancecan hide bad queries (N+1!), harder to tunefull control; tune exactly
Complex queriesawkward — window functions, CTEs fight the abstractionnatural
Most teams use an ORM for CRUD and drop to raw SQL for the hot/complex queries.

The pragmatic answer to “ORM or raw SQL?” is both: lean on the ORM for the 90% of CRUD where productivity and injection-safety matter, and drop to raw SQL for reporting, window functions, and the few queries you need to hand-tune. The ORM’s biggest trap is the N+1 problem from chapter 3.4 — it makes the expensive thing invisible.

01 Learning objectives

0 / 5 done

02 Curated reading

03 Knowledge check

knowledge check4 questions · pass ≥ 70%
  1. 01easy

    The 'D' in ACID stands for:

  2. 02medium

    Denormalizing (duplicating data) can be a deliberate choice to speed up reads.

  3. 03hard

    T1 reads a row; T2 updates+commits it; T1 re-reads and sees a different value. This anomaly is a:

  4. 04hard

    Which isolation level prevents dirty reads but still allows non-repeatable reads?

04 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

  • Commonly asked mid concept common Explain 1NF, 2NF, and 3NF each in a sentence, with an example violation.

    1NF — atomic columns, no repeating groups/arrays in a cell (violated by a comma-separated phones column).

    2NF — 1NF plus no non-key column depends on only part of a composite key (in (order_id, product_id) -> product_name, product_name depends on product_id alone — split it out).

    3NF — 2NF plus no transitive dependency: non-key columns depend only on the key (storing zip and city, where zip -> city, is a transitive dependency; move it to a zip table).

    Mnemonic: 'the key, the whole key, and nothing but the key.'

    Red flag Reciting the names without being able to name a concrete violation, or conflating 2NF (partial dependency) with 3NF (transitive dependency).

    source: Wikipedia — Database normalization ↗
  • Commonly asked mid concept common When would you deliberately denormalize a schema?

    Denormalize to trade write/consistency cost for read speed when reads dominate and joins are the bottleneck. Common cases: duplicating a category_name onto an orders table to avoid a join on every report; precomputed counts/totals (a comment_count column) to skip aggregation; materialized views; read-optimized analytics tables.

    The cost: every duplicated fact must be kept in sync on write (triggers, app logic, or background jobs), risking drift. Rule of thumb: normalize first for correctness, denormalize surgically where a measured read path demands it.

    Red flag Denormalizing prematurely 'for performance' without a measured hot path, then fighting update anomalies and data drift.

    source: Wikipedia — Denormalization ↗
  • Amazon mid concept very common What does ACID stand for, and what does each property actually guarantee?

    Atomicity — a transaction is all-or-nothing; partial failure rolls the whole thing back.

    Consistency — a committed transaction moves the DB from one valid state to another, preserving constraints/invariants.

    Isolation — concurrent transactions don't see each other's uncommitted, in-flight state (degree set by the isolation level).

    Durability — once committed, the change survives a crash (write-ahead log / fsync).

    Classic example: a bank transfer must debit and credit atomically, never leaving money half-moved.

    Red flag Conflating ACID's 'Consistency' (constraint preservation) with the distributed-systems 'consistency' of CAP — different concepts.

    source: PostgreSQL docs — Transactions ↗
  • Commonly asked senior concept very common Define dirty read, non-repeatable read, and phantom read, and map each to the isolation level that prevents it.

    Dirty read — you read another transaction's uncommitted change (which may be rolled back). Prevented at READ COMMITTED and above.

    Non-repeatable read — you read a row twice and get different values because another committed transaction updated it between reads. Prevented at REPEATABLE READ and above.

    Phantom read — you re-run a range query and new rows appear (or vanish) because another transaction inserted/deleted matching rows. Prevented at SERIALIZABLE.

    So the ladder is READ UNCOMMITTED -> READ COMMITTED -> REPEATABLE READ -> SERIALIZABLE, each forbidding one more anomaly.

    Red flag Swapping non-repeatable (an UPDATE to existing rows) with phantom (INSERT/DELETE changing which rows match), or assuming every engine maps the levels identically.

    source: PostgreSQL docs — Transaction Isolation ↗
  • Commonly asked senior trick occasional PostgreSQL's REPEATABLE READ prevents phantom reads, which the SQL standard doesn't require at that level. Why?

    Because Postgres implements isolation with MVCC + snapshots, not range locks. At REPEATABLE READ it takes one consistent snapshot at the first statement and every read in the transaction sees the database exactly as of that snapshot — so new rows inserted by others are invisible, eliminating phantoms too.

    The SQL standard only *requires* REPEATABLE READ to block dirty + non-repeatable reads; Postgres is strictly stronger. (Its SERIALIZABLE adds Serializable Snapshot Isolation to also catch write-skew.) Takeaway: the named levels are minimum guarantees — engines often exceed them, so verify per-engine.

    Red flag Assuming the SQL-standard anomaly table is literally true for every database — engine implementations (MVCC vs locking) change the real guarantees.

    source: PostgreSQL docs — Repeatable Read Isolation Level ↗
  • Commonly asked senior concept common Explain shared vs exclusive locks and how a deadlock arises.

    A shared (read) lock lets many transactions hold it at once but blocks writers. An exclusive (write) lock is held by exactly one transaction and blocks everyone else on that resource. Shared/shared is compatible; anything with exclusive is not.

    A deadlock is a cycle of waits: T1 holds A and wants B; T2 holds B and wants A — neither can proceed. The DB detects the cycle and aborts one transaction (the 'deadlock victim'); your code should catch the error and retry. Avoid them by acquiring locks in a consistent order and keeping transactions short.

    Red flag Thinking the DB hangs forever on a deadlock — it detects the cycle and kills a victim; the app must handle the retry. Also confusing a deadlock with a long lock-wait.

    source: PostgreSQL docs — Explicit Locking / Deadlocks ↗
  • Commonly asked senior design common Two users buy the last item in stock at the same time and you oversell. How do you prevent the race with the database?

    It's a lost-update / check-then-act race: both read stock = 1, both decrement. Fixes:

    - Pessimistic lock: SELECT stock FROM items WHERE id = ? FOR UPDATE inside a transaction — the second buyer blocks until the first commits, then sees 0.
    - Atomic conditional write: UPDATE items SET stock = stock - 1 WHERE id = ? AND stock > 0 and check the affected-row count — zero rows means it was already sold out. No separate read needed.
    - Optimistic concurrency: a version column, UPDATE … WHERE version = ?; retry on conflict. Best under low contention.

    The atomic conditional UPDATE is usually the simplest correct answer.

    Red flag Doing read-then-write in application code without a lock or atomic update and assuming the transaction wrapper alone prevents the lost update (it doesn't at READ COMMITTED).

    source: PostgreSQL docs — Explicit Locking (Row-Level Locks / FOR UPDATE) ↗
  • Commonly asked mid concept common ORM vs raw SQL — what are the trade-offs and when do you drop to raw SQL?

    ORM wins on productivity and safety: less boilerplate, parameterized queries (SQL-injection resistant by default), migrations, mapping rows to objects, DB portability.

    Raw SQL wins on control and performance: complex joins, window functions, CTEs, query-plan tuning, and bulk operations the ORM expresses poorly or N+1's.

    Practical stance: ORM for the 90% of CRUD, drop to raw/handwritten SQL (most ORMs allow it) for hot, complex, or analytical queries. The ORM's biggest footgun is hidden N+1 queries.

    Red flag Treating it as religious all-or-nothing, or not knowing the ORM's N+1 / lazy-loading traps and over-fetching.

    source: StrataScratch — SQL Interview Questions: The Ultimate Guide ↗
  • Commonly asked senior trick occasional What is a write-skew anomaly, and why can it slip past REPEATABLE READ / snapshot isolation?

    Write skew: two transactions each read an overlapping set of rows, each checks an invariant that currently holds, then each writes to a different row — and the combined result violates the invariant that neither saw broken.

    Classic case: a hospital requires >=1 doctor on call. Two on-call doctors each run 'if more than one is on call, I can go off-call', both read 2-on-call (true), both update their own row, and now zero are on call. Snapshot isolation / REPEATABLE READ doesn't catch it because the two transactions write *disjoint* rows — there's no write-write conflict, only a read-write dependency cycle. Only SERIALIZABLE (in Postgres, Serializable Snapshot Isolation) detects the dependency cycle and aborts one.

    What a strong answer covers
    • Write skew: concurrent transactions read overlapping data, then write *disjoint* rows, breaking an invariant.

    • Snapshot isolation misses it because there's no write-write conflict to detect.

    • It's a read-write dependency cycle, not a lost update.

    • Only SERIALIZABLE (SSI in Postgres) prevents it; or use explicit SELECT … FOR UPDATE to materialize the conflict.

    Quick self-check

    Which isolation level is required to reliably prevent write skew?

    Red flag Believing REPEATABLE READ/snapshot isolation prevents all anomalies — it still allows write skew, which needs SERIALIZABLE or explicit locking.

    source: PostgreSQL docs — Serializable Isolation Level (write skew) ↗
  • ★ must-know Commonly asked senior concept common Optimistic vs pessimistic concurrency control — how do they work and when do you pick each?

    Pessimistic: assume conflicts are likely, so lock the row up front (SELECT … FOR UPDATE) and hold it until commit; others wait. Correct and simple, but locks reduce concurrency and risk deadlocks and lock-wait timeouts.

    Optimistic: assume conflicts are rare, so don't lock — read a version/timestamp, and at write time do UPDATE … WHERE id = ? AND version = :read_version. If zero rows update, someone else changed it: abort and retry. No locks held during the user's think-time.

    Pick pessimistic for high contention / short critical sections where retries would thrash; optimistic for low contention and long read-think-write cycles (web forms, APIs) where holding a lock across a round-trip is unacceptable.

    What a strong answer covers
    • Pessimistic = lock first (FOR UPDATE); others block until commit.

    • Optimistic = no lock; detect conflict at write via a version/timestamp check, then retry.

    • High contention favors pessimistic (avoid retry storms); low contention favors optimistic.

    • Optimistic avoids holding a lock across user think-time / network round-trips.

    • Both need a transaction; optimistic additionally needs retry logic in the app.

    Quick self-check

    A web 'edit profile' form is open for minutes before submit; conflicts are rare. Best concurrency strategy?

    Red flag Using optimistic locking under heavy contention (constant retry/abort churn), or holding a pessimistic lock across a user's think-time and serializing everyone.

    source: PostgreSQL docs — Concurrency Control / Explicit Locking ↗
  • Commonly asked junior concept common What does referential integrity mean, and what are ON DELETE CASCADE / RESTRICT / SET NULL?

    Referential integrity is the guarantee that a foreign key always points at a row that exists (or is NULL) — you can't have an order for a customer who was deleted. The DB enforces it for you.

    The ON DELETE (and ON UPDATE) clause decides what happens to children when the parent is deleted:

    - RESTRICT / NO ACTION — block the delete if children exist (the safe default).
    - CASCADE — delete the children too.
    - SET NULL — keep children but null out their FK (requires a nullable column).
    - SET DEFAULT — set the FK to its default.

    Choose CASCADE only when children are truly owned by the parent (an order's line items); use RESTRICT for shared/important references to avoid accidental mass deletes.

    What a strong answer covers
    • Referential integrity: every FK value must match an existing PK (or be NULL).

    • RESTRICT/NO ACTION blocks deleting a parent that still has children.

    • CASCADE deletes the children with the parent — powerful but easy to mass-delete by accident.

    • SET NULL/SET DEFAULT keep the child but clear/replace its FK.

    • FK enforcement requires an index (often the child FK column) for the check to be efficient.

    Red flag Adding `ON DELETE CASCADE` everywhere and triggering a surprise mass-delete, or forgetting that SET NULL needs the FK column to be nullable.

    source: PostgreSQL docs — Foreign Keys (referential actions) ↗
  • Commonly asked senior concept occasional What is BCNF and how does it differ from 3NF? Give a case where a table is in 3NF but not BCNF.

    BCNF (Boyce-Codd Normal Form) is a stricter 3NF: for *every* non-trivial functional dependency X -> Y, X must be a superkey. 3NF allows a narrow exception — a dependency is OK if its right side is a *prime* attribute (part of some candidate key) — and BCNF removes that exception.

    The textbook case needs overlapping candidate keys. Table (student, course, instructor) where each course is taught by one instructor (instructor -> course) and a student takes a course with one instructor ({student, course} -> instructor). Candidate keys are {student, course} and {student, instructor}. The dependency instructor -> course has a non-superkey left side, so it violates BCNF — yet the table is in 3NF because course is a prime attribute. Fix: split into (instructor, course) and (student, instructor).

    What a strong answer covers
    • BCNF: every non-trivial FD's determinant (left side) must be a superkey — no exceptions.

    • 3NF permits a dependency whose right side is a prime (key) attribute; BCNF forbids it.

    • Violations require overlapping/composite candidate keys.

    • BCNF decomposition can occasionally sacrifice dependency-preservation — a real trade-off.

    Red flag Claiming 3NF and BCNF are identical — they diverge precisely when a non-key attribute determines a prime attribute under overlapping candidate keys.

    source: Wikipedia — Boyce-Codd normal form ↗
  • Commonly asked mid concept common What is the difference between surrogate and natural primary keys, and what are the trade-offs?

    A natural key is a real-world attribute already unique (SSN, ISBN, email, country code). A surrogate key is a system-generated, meaningless identifier (auto-increment id, UUID) added solely to identify the row.

    Surrogates win in practice: they're stable (a natural key like email can change, breaking every FK referencing it), compact, and uniform. Naturals avoid an extra column and can prevent duplicate business rows. Common pattern: use a surrogate PK for joins/FKs and a UNIQUE constraint on the natural key to enforce business uniqueness. Note the UUID choice matters: random UUIDv4 PKs fragment a clustered index (random insert order); UUIDv7/ULID are time-ordered to avoid that.

    What a strong answer covers
    • Natural key = meaningful real-world attribute; surrogate = synthetic id (serial/UUID).

    • Surrogates are stable under business changes; natural keys can mutate and cascade.

    • Best practice: surrogate PK + a UNIQUE constraint on the natural key.

    • Random UUIDv4 as a clustered PK hurts insert locality; prefer UUIDv7/ULID or bigserial.

    Red flag Using a mutable natural key (email/phone) as the PK so a single change cascades through every foreign key, or choosing random UUIDv4 PKs and fragmenting the clustered index.

    source: Wikipedia — Surrogate key ↗
  • Commonly asked senior concept occasional Why should long-running transactions be avoided, especially under MVCC?

    Under MVCC, an UPDATE/DELETE doesn't overwrite — it creates a new row version and leaves the old one as a 'dead tuple' until no transaction could still need it. A long-running (or idle-in-transaction) transaction holds an old snapshot open, so the vacuum/garbage-collector can't reclaim those dead tuples — leading to table/index bloat, slower scans, and transaction-ID wraparound pressure in Postgres.

    Long transactions also hold locks longer (more contention and deadlock risk) and amplify lost-update windows. The fix: keep transactions short, never leave one open across user think-time or external API calls, batch large mutations, and watch for idle in transaction connections.

    What a strong answer covers
    • MVCC keeps old row versions until no open snapshot needs them.

    • A long/idle transaction pins an old snapshot, blocking VACUUM from reclaiming dead tuples -> bloat.

    • It also holds locks longer (contention, deadlocks) and, in Postgres, raises wraparound risk.

    • Keep transactions short; never span user think-time or slow external calls; batch big writes.

    Red flag Opening a transaction, then making a slow external API call or waiting on user input inside it — pinning the MVCC snapshot, blocking vacuum, and bloating the table.

    source: PostgreSQL docs — Routine Vacuuming (dead tuples / bloat) ↗