> cs·fundamentals
interview 0% 26m read
3.6 ★ core [A] 14 interview Q's

NoSQL

The four NoSQL families and when each fits, BASE vs ACID, and sharding vs replication vs partitioning.

“NoSQL” isn’t one thing — it’s four families of databases that each drop some relational guarantees to win something else (scale, flexibility, or a data shape SQL handles badly). This chapter is about matching the family to the workload, and understanding the consistency and scaling trade-offs underneath them all.

The four families

The mistake is treating NoSQL as a monolith. Each family is shaped around a different access pattern — pick by how you read and write, not by hype.

FamilyData shapeReach for it whenExamples
DocumentJSON-like documents, flexible schemacontent, catalogs, user profiles — nested data read/written wholeMongoDB, Couchbase
Key-valueopaque value behind a keycaching, sessions, leaderboards — fastest by-key accessRedis, DynamoDB
Wide-columnpartitioned rows, flexible columnstime-series, IoT, write-heavy at massive scaleCassandra, HBase
Graphnodes + edgessocial graphs, fraud rings, recommendations — relationship traversalNeo4j, Neptune
Choose by access pattern: whole-document, by-key, write-scale, or relationship traversal.

A few sharper framings:

  • Document shines when your entity is naturally a nested blob you load and save as a unit (an order with its line items embedded), and the schema evolves often. You trade ad-hoc multi-table joins for read locality. The flip side: data you embed in two documents is duplicated, so a fact that changes must be updated in every copy — the denormalization trade from chapter 3.5, now baked into the model.
  • Key-value is the simplest model and the fastest: O(1) by key, nothing else. It’s the workhorse behind caches (chapter 2.8), session stores, and leaderboards (Redis sorted sets). You cannot query by value — if you need “all users in Brazil,” a pure key-value store can’t help; you’d maintain that index yourself.
  • Wide-column is built for write throughput across a cluster. Cassandra has no joins and you design the table per query — the opposite of normalization. You duplicate data into one table per access pattern, because a write is cheap and a scan across partitions is the thing you must never do. Great for time-series/IoT firehoses.
  • Graph wins when the relationships are the point. “Friends of friends who bought X” is a cheap traversal in Neo4j (hop edge to edge) and an expensive pile of self-joins in SQL that gets worse with each degree of separation.

BASE vs ACID

Relational databases give you ACID (chapter 3.5): strong, immediate consistency. Many distributed NoSQL stores instead offer BASE:

  • Basically Available — the system stays up and answers, even during partitions (it favors availability).
  • Soft state — state may change over time without new input, as replicas converge.
  • Eventually consistent — given no new writes, all replicas eventually agree; but a read right after a write might return stale data.

This is the CAP theorem (chapter 2.7) playing out: under a network partition you must choose consistency or availability, and BASE systems lean toward availability. The trade is concrete — a value you just wrote on one node may not appear on another for a moment.

ACID (relational)BASE (many NoSQL)
Consistencystrong, immediateeventual
Availability under partitionmay reject writesstays available
Best formoney, inventory, anything that must be exactly rightfeeds, analytics, caches, high-write at scale
Mental modelcorrectness firstavailability + scale first
Pick ACID when wrong data is unacceptable; BASE when scale/uptime matter more than a moment's staleness.

Scaling: sharding vs replication vs partitioning

These three terms get used interchangeably and shouldn’t be — they solve different problems:

replication — full copies (scales reads + HA)leaderA–Z (all)followerA–Z (all)followerA–Z (all)stream writesreads spread across copies · writes still hit one leadersharding — split by key (scales writes/storage)shard 1keys A–Mshard 2keys N–Zrouter (shard key)each node owns a slice · cross-shard query must fan out
Replication vs sharding, side by side. Replication keeps a FULL copy of all data on every node: writes go to the leader and stream to followers, so reads scale and a follower can take over if the leader dies — but every write still funnels to one leader. Sharding SPLITS the data across nodes by a shard key (A–M here, N–Z there): each node owns a disjoint slice, so writes and storage scale, but a query that spans shards must fan out and recombine.
TechniqueWhat it doesScalesCost
Replicationcopies of the data on multiple servers (leader-follower)reads + availability (HA)doesn't scale writes; replica lag
Shardingsplits data across servers by a shard keywrites + storagecross-shard queries/joins get hard
Partitioningsplits one dataset into parts (rows = horizontal, columns = vertical)manageability + query pruningchoosing the partition key
Replication = copies; sharding = split across machines; partitioning = split a dataset.
  • Replication keeps full copies of the data on multiple nodes, usually leader-follower: writes go to the leader and stream to followers. It buys you high availability (a follower is promoted if the leader dies) and read scaling (route reads to read replicas). It does not scale writes — every write still hits the single leader — and followers may lag.
  • Sharding splits the data across nodes by a shard key (e.g. customers A–M on shard 1, N–Z on shard 2). Each node owns a slice, so writes and storage scale horizontally. The cost: queries spanning shards (and joins across them) become expensive or impossible, and a bad shard key creates hotspots — pick a key that spreads load evenly, never a monotonically increasing one like a timestamp (all today’s writes land on one shard).
  • Partitioning is splitting a single dataset into parts. Horizontal partitioning splits by rows (often the same idea as sharding when the parts live on different servers); vertical partitioning splits by columns (rarely-used columns to a separate table). It improves manageability and lets the engine prune irrelevant partitions.

01 Learning objectives

0 / 6 done

02 Curated reading

03 Knowledge check

knowledge check3 questions · pass ≥ 70%
  1. 01medium

    Caching, sessions, and leaderboards are the sweet spot of which NoSQL family?

  2. 02medium

    BASE (vs ACID) emphasises:

  3. 03hard

    Sharding primarily scales ___, while replication primarily scales ___.

04 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

  • Commonly asked mid concept very common Name the four main NoSQL families and a use case where each beats a relational DB.

    Document (MongoDB) — flexible JSON-like docs; content, catalogs, user profiles where the shape varies.

    Key-value (Redis, DynamoDB) — fastest by-key access; caching, sessions, leaderboards, rate counters.

    Wide-column (Cassandra, HBase) — massive distributed write scale; time-series, IoT, event logs.

    Graph (Neo4j) — relationship-heavy traversals; social graphs, fraud rings, recommendations.

    The through-line: each optimizes a specific access pattern that relational tables + joins serve poorly at scale.

    Red flag Treating 'NoSQL' as one thing, or claiming it's 'schemaless so always better' — each family has a narrow sweet spot.

    source: MongoDB — Types of NoSQL Databases ↗
  • Commonly asked mid concept very common In MongoDB, when do you embed a sub-document vs reference another collection?

    Embed when the child is owned by and always read with the parent, the relationship is one-to-few, and the embedded data doesn't grow unbounded — e.g. a user's addresses inside the user document. One read fetches everything; no join.

    Reference (store an ObjectId, join with $lookup or a second query) when the child is large, shared across parents (many-to-many), updated independently, or the array would grow without bound (a celebrity's millions of followers). This avoids the 16MB document cap and write amplification.

    Rule: model around your access patterns, not entities — 'data that is accessed together should be stored together.'

    Red flag Reflexively normalizing like a relational schema, or embedding an unbounded growing array that eventually hits the 16MB document limit.

    source: MongoDB — Data Modeling Introduction ↗
  • Commonly asked mid concept common What is BASE and how does it differ from ACID?

    BASE = Basically Available, Soft state, Eventual consistency. It's the consistency model many NoSQL/distributed stores choose: stay available and partition-tolerant, accept that replicas converge *eventually* rather than being instantly consistent.

    Vs ACID, which insists every transaction leaves the DB strongly consistent and isolated. BASE relaxes that to gain availability and horizontal scale. It's the practical face of the CAP theorem: under a network partition you pick availability (BASE/AP) or consistency (ACID/CP). Use BASE where stale-by-seconds reads are fine (feeds, product views); use ACID where they aren't (payments).

    Red flag Equating 'NoSQL' with 'no transactions' — many (MongoDB, DynamoDB) now offer ACID transactions; BASE is a choice, not an inherent limitation.

    source: MongoDB — ACID Transactions / Database Consistency ↗
  • Commonly asked senior concept common Explain sharding vs replication vs partitioning. How are they different?

    Replication — keep copies of the same data on multiple nodes (leader-follower). Goal: high availability + read scaling + durability. It does *not* increase write capacity (one leader takes writes).

    Sharding — split the dataset into disjoint pieces across nodes by a shard key, each node owning a subset. Goal: scale writes and storage beyond one machine.

    Partitioning — the general term for splitting a table: horizontal = rows split across partitions (sharding is horizontal partitioning across servers); vertical = columns split into separate tables.

    In practice you combine them: shard for write scale, then replicate each shard for HA.

    Red flag Using sharding and replication interchangeably — replicas are full copies (HA + reads), shards are disjoint subsets (write/storage scale).

    source: MongoDB — Sharding ↗
  • Commonly asked senior design occasional How would you choose a shard key, and what goes wrong with a bad one?

    A good shard key has high cardinality, even write distribution, and matches your query pattern so most queries hit one shard (targeted, not scatter-gather).

    Failure modes: a monotonically increasing key (timestamp, auto-increment id) sends all new writes to one shard — a 'hot shard'. A low-cardinality key (country, status) can't split finely enough. A key that doesn't appear in queries forces every query to fan out to all shards (broadcast).

    Mitigations: hashed shard keys to spread writes, or compound keys (e.g. user_id + time) to keep related data together while distributing load.

    Red flag Picking an auto-increment or timestamp shard key and creating a permanent hotspot, or a key absent from common queries forcing broadcasts.

    source: MongoDB — Choose a Shard Key ↗
  • Commonly asked mid concept common What is the MongoDB aggregation pipeline, and how does it map to SQL?

    The aggregation pipeline passes documents through ordered stages, each transforming the stream and feeding the next — like Unix pipes for data.

    Rough SQL mapping: $match ~ WHERE, $group ~ GROUP BY (+ aggregates), $project ~ SELECT (shape columns), $sort ~ ORDER BY, $limit/$skip ~ LIMIT/OFFSET, $lookup ~ LEFT JOIN, $unwind ~ flatten an array into rows.

    Stage order matters for performance: put $match and $sort early so they can use indexes and shrink the working set before expensive $group/$lookup.

    Red flag Ordering stages so `$match` comes after a `$group`/`$lookup`, defeating index use and processing far more documents than necessary.

    source: MongoDB — Aggregation Pipeline ↗
  • Commonly asked mid concept common Why use Redis for caching, and what are the main eviction/expiry concerns?

    Redis is an in-memory key-value store, so reads/writes are microsecond-fast — ideal as a cache in front of a slower primary DB, plus sessions, rate limiters, and leaderboards (sorted sets).

    Key concerns: set a TTL (EXPIRE) so stale data ages out; pick an eviction policy for when memory is full (allkeys-lru, allkeys-lfu, volatile-ttl, etc.); and have a cache-invalidation strategy on writes (write-through, or delete-on-update). Watch for stampede — many requests recomputing a hot key the instant it expires — mitigated by locks or jittered TTLs.

    Red flag Caching without a TTL or invalidation plan (serving stale data forever), or ignoring eviction so the cache silently drops keys under memory pressure.

    source: Redis — Key eviction (docs) ↗
  • Commonly asked mid concept common When would you NOT use NoSQL — i.e., when is a relational database still the right call?

    Choose relational when you need strong multi-row transactions / ACID (money, inventory, bookings), flexible ad-hoc queries and joins across well-structured related data, constraints and referential integrity enforced by the DB, and a stable schema.

    NoSQL earns its place for huge scale on a known access pattern, flexible/evolving document shapes, or relationship-traversal workloads. The honest senior answer is 'it depends on access patterns and consistency needs' — and modern Postgres (JSONB, partitioning, logical replication) covers many cases people reach for NoSQL for.

    Red flag Picking NoSQL for hype/scale you don't have, then reimplementing joins and transactions in application code; or assuming relational 'can't scale'.

    source: MongoDB — NoSQL vs SQL Databases ↗
  • ★ must-know Commonly asked senior concept common State the CAP theorem and explain why 'CA' isn't a real choice for a distributed database.

    CAP says that when a network partition (P) happens, a distributed system can preserve at most one of Consistency (every read sees the latest write) and Availability (every request gets a non-error response) — you must drop one.

    'CA' isn't a meaningful pick because partitions *will* happen in any real network — you don't get to opt out of P. So the real choice during a partition is CP (refuse/error to stay consistent — e.g. a leader-based store rejecting writes it can't replicate) or AP (answer with possibly-stale data and reconcile later — Dynamo-style stores). When there's *no* partition, a good system gives both C and A; CAP only forces the trade *during* a partition. PACELC extends it: else (no partition) you still trade latency vs consistency.

    What a strong answer covers
    • Under a partition you choose Consistency or Availability, not both.

    • Partitions are unavoidable in real networks, so P isn't optional — 'CA' is a non-choice.

    • CP = stay consistent, reject/err during partition; AP = stay available, serve stale, reconcile.

    • CAP only bites *during* a partition; PACELC adds the latency-vs-consistency trade for normal operation.

    Quick self-check

    During a network partition, a payment system that must never double-charge should behave as…

    Red flag Treating CAP as 'pick any two' (you can't drop P) or thinking it forces a permanent global trade rather than one that only applies during a partition.

    source: Wikipedia — CAP theorem ↗
  • Amazon senior concept occasional Why is NoSQL data modeling driven by access patterns, and what does DynamoDB single-table design illustrate?

    Relational modeling normalizes by entity and joins at read time. NoSQL stores (especially DynamoDB) have no joins and charge for every access, so you model queries first: list the access patterns, then design keys so each query is a single, indexed key lookup.

    Single-table design takes this to the extreme — multiple entity types (users, orders, items) share one table, distinguished by a composite primary key (a generic partition key + sort key, often PK/SK with prefixes like USER#123 / ORDER#456). Related items share a partition so one query fetches them together without a join, and secondary indexes (GSIs) serve alternate patterns. The cost is a rigid, query-specific schema that's painful to change when access patterns evolve.

    What a strong answer covers
    • No joins + per-request cost -> design around queries, not entities.

    • List access patterns first, then shape partition/sort keys so each is one key lookup.

    • Single-table design co-locates related items in a partition via prefixed composite keys.

    • Secondary indexes (GSIs) add alternate access patterns; the schema is rigid to new ones.

    Red flag Modeling a NoSQL store like a normalized relational schema and then needing joins the database can't do, forcing N round-trips or client-side joins.

    source: AWS docs — DynamoDB single-table design / data modeling ↗
  • Commonly asked senior concept common What is eventual consistency, and how do read-your-writes and quorum reads/writes fit in?

    Eventual consistency: replicas may temporarily disagree, but with no new writes they all converge to the same value. The window means a read just after a write can return stale data.

    Stronger guarantees layer on top. Read-your-writes ensures *you* always see your own latest write (route your reads to a replica known to have it, or to the leader). Quorum tunes consistency per operation: with N replicas, require W acks on write and R replicas on read; if R + W > N the read and write sets overlap, so a read is guaranteed to see the latest acknowledged write (strong consistency) — at the cost of latency/availability. Dynamo-style systems expose W/R so you trade consistency against speed per call.

    What a strong answer covers
    • Eventual consistency: replicas converge once writes stop; reads can be briefly stale.

    • Read-your-writes: a session always sees its own latest write.

    • Quorum: pick W (write acks) and R (read replicas) out of N.

    • R + W > N guarantees overlap -> a read sees the latest committed write (strong consistency).

    • Higher R/W means stronger consistency but more latency and less availability.

    Quick self-check

    With N=3 replicas, which (W, R) configuration guarantees strongly-consistent reads?

    Red flag Assuming eventual consistency means 'never consistent', or thinking any single quorum value is right — R/W are a per-workload latency-vs-consistency dial.

    source: Wikipedia — Eventual consistency ↗
  • Commonly asked mid concept occasional When is a graph database the right tool, and why does it beat relational recursive joins for deep traversals?

    Use a graph DB (Neo4j) when relationships are first-class and traversals are deep/variable-length: social graphs ('friends of friends of friends'), fraud rings, recommendation paths, dependency/permission graphs.

    In a relational store, each 'hop' is another self-join, and a 4-hop query means 4 joins whose cost compounds with table size — the optimizer re-finds matching rows by index lookup each level. A graph DB uses index-free adjacency: each node directly stores pointers to its neighbors, so traversing one more hop is O(neighbors of the current node), independent of total graph size. That makes variable-depth path queries (shortest path, reachability) both fast and natural to express (Cypher's MATCH (a)-[:FRIEND*1..4]->(b)).

    What a strong answer covers
    • Graph DBs shine when relationships and multi-hop traversal are the core workload.

    • Index-free adjacency: nodes point straight at neighbors, so a hop is local, not a global index lookup.

    • Relational deep traversal = N self-joins whose cost compounds with table size.

    • Variable-length paths (shortest path, reachability) are awkward in SQL, native in graph query languages.

    Red flag Forcing a deeply-connected, variable-depth traversal into repeated SQL self-joins/recursive CTEs and watching it degrade as hop count and table size grow.

    source: Neo4j — Graph Database Concepts (index-free adjacency) ↗
  • Commonly asked mid concept common Cache-aside vs write-through vs write-behind — compare the caching strategies.

    Cache-aside (lazy): the app checks the cache; on a miss it reads the DB and populates the cache, and on writes it updates the DB and *invalidates* the key. Simple and resilient (cache down ≠ data loss), but the first read after a miss/eviction is slow and there's a brief staleness window.

    Write-through: writes go to cache and DB synchronously, so the cache is always fresh — at the cost of higher write latency and caching data that may never be read.

    Write-behind (write-back): writes hit the cache and are flushed to the DB asynchronously — lowest write latency, highest throughput, but risks data loss if the cache fails before flushing and adds complexity. Cache-aside is the common default for read-heavy web workloads.

    What a strong answer covers
    • Cache-aside: app-managed, populate on miss, invalidate on write — simple, resilient, can serve stale briefly.

    • Write-through: write cache+DB together — always fresh, slower writes, may cache unread data.

    • Write-behind: async flush to DB — fastest writes, but risks loss on cache failure.

    • Default to cache-aside for read-heavy systems; reserve write-behind for write-heavy, loss-tolerant cases.

    Quick self-check

    Which strategy has the **lowest write latency** but the **highest risk of data loss**?

    Red flag Choosing write-behind for data you can't afford to lose, or running cache-aside without an invalidation step so the cache serves stale data after every update.

    source: AWS — Caching strategies (lazy loading / write-through) ↗
  • Commonly asked mid trick occasional A NoSQL store is 'schemaless' — what does that actually mean, and what's the catch?

    'Schemaless' means the database doesn't enforce a fixed schema — different documents in a collection can have different fields, and you can add a field without a migration. It's better called schema-on-read: the structure is interpreted by the application when it reads, rather than enforced by the database on write.

    The catch is the schema doesn't disappear — it moves into your application code, which must handle missing fields, mixed types, and old document shapes (versioning) forever. Without DB-level constraints you can silently write inconsistent data, so mature NoSQL stores add optional validation (MongoDB JSON Schema validators) and teams still enforce structure in code. 'Flexible' is the upside; 'no guardrails' is the downside.

    What a strong answer covers
    • Schemaless = the DB doesn't enforce structure; really 'schema-on-read'.

    • The schema moves into application code, which must tolerate missing/old/variant shapes.

    • Flexibility speeds iteration but removes the DB's data-integrity guardrails.

    • Mitigate with optional validators (MongoDB schema validation) and explicit document versioning.

    Quick self-check

    A 'schemaless' document store most accurately means…

    Red flag Believing 'schemaless' means no schema to manage — the schema is just enforced (or not) in application code, where inconsistencies accumulate silently.

    source: MongoDB — Schema Validation ↗