NoSQL

The four NoSQL families and when each fits, BASE vs ACID, and sharding vs replication vs partitioning.

“NoSQL” isn’t one thing — it’s four families of databases that each drop some relational guarantees to win something else (scale, flexibility, or a data shape SQL handles badly). This chapter is about matching the family to the workload, and understanding the consistency and scaling trade-offs underneath them all.

The four families

The mistake is treating NoSQL as a monolith. Each family is shaped around a different access pattern — pick by how you read and write, not by hype.

Family	Data shape	Reach for it when	Examples
Document	JSON-like documents, flexible schema	content, catalogs, user profiles — nested data read/written whole	MongoDB, Couchbase
Key-value	opaque value behind a key	caching, sessions, leaderboards — fastest by-key access	Redis, DynamoDB
Wide-column	partitioned rows, flexible columns	time-series, IoT, write-heavy at massive scale	Cassandra, HBase
Graph	nodes + edges	social graphs, fraud rings, recommendations — relationship traversal	Neo4j, Neptune

Choose by access pattern: whole-document, by-key, write-scale, or relationship traversal.

A few sharper framings:

Document shines when your entity is naturally a nested blob you load and save as a unit (an order with its line items embedded), and the schema evolves often. You trade ad-hoc multi-table joins for read locality. The flip side: data you embed in two documents is duplicated, so a fact that changes must be updated in every copy — the denormalization trade from chapter 3.5, now baked into the model.
Key-value is the simplest model and the fastest: O(1) by key, nothing else. It’s the workhorse behind caches (chapter 2.8), session stores, and leaderboards (Redis sorted sets). You cannot query by value — if you need “all users in Brazil,” a pure key-value store can’t help; you’d maintain that index yourself.
Wide-column is built for write throughput across a cluster. Cassandra has no joins and you design the table per query — the opposite of normalization. You duplicate data into one table per access pattern, because a write is cheap and a scan across partitions is the thing you must never do. Great for time-series/IoT firehoses.
Graph wins when the relationships are the point. “Friends of friends who bought X” is a cheap traversal in Neo4j (hop edge to edge) and an expensive pile of self-joins in SQL that gets worse with each degree of separation.

BASE vs ACID

Relational databases give you ACID (chapter 3.5): strong, immediate consistency. Many distributed NoSQL stores instead offer BASE:

Basically Available — the system stays up and answers, even during partitions (it favors availability).
Soft state — state may change over time without new input, as replicas converge.
Eventually consistent — given no new writes, all replicas eventually agree; but a read right after a write might return stale data.

This is the CAP theorem (chapter 2.7) playing out: under a network partition you must choose consistency or availability, and BASE systems lean toward availability. The trade is concrete — a value you just wrote on one node may not appear on another for a moment.

	ACID (relational)	BASE (many NoSQL)
Consistency	strong, immediate	eventual
Availability under partition	may reject writes	stays available
Best for	money, inventory, anything that must be exactly right	feeds, analytics, caches, high-write at scale
Mental model	correctness first	availability + scale first

Pick ACID when wrong data is unacceptable; BASE when scale/uptime matter more than a moment's staleness.

Scaling: sharding vs replication vs partitioning

These three terms get used interchangeably and shouldn’t be — they solve different problems:

Replication vs sharding, side by side. Replication keeps a FULL copy of all data on every node: writes go to the leader and stream to followers, so reads scale and a follower can take over if the leader dies — but every write still funnels to one leader. Sharding SPLITS the data across nodes by a shard key (A–M here, N–Z there): each node owns a disjoint slice, so writes and storage scale, but a query that spans shards must fan out and recombine.

Technique	What it does	Scales	Cost
Replication	copies of the data on multiple servers (leader-follower)	reads + availability (HA)	doesn't scale writes; replica lag
Sharding	splits data across servers by a shard key	writes + storage	cross-shard queries/joins get hard
Partitioning	splits one dataset into parts (rows = horizontal, columns = vertical)	manageability + query pruning	choosing the partition key

Replication = copies; sharding = split across machines; partitioning = split a dataset.

Replication keeps full copies of the data on multiple nodes, usually leader-follower: writes go to the leader and stream to followers. It buys you high availability (a follower is promoted if the leader dies) and read scaling (route reads to read replicas). It does not scale writes — every write still hits the single leader — and followers may lag.
Sharding splits the data across nodes by a shard key (e.g. customers A–M on shard 1, N–Z on shard 2). Each node owns a slice, so writes and storage scale horizontally. The cost: queries spanning shards (and joins across them) become expensive or impossible, and a bad shard key creates hotspots — pick a key that spreads load evenly, never a monotonically increasing one like a timestamp (all today’s writes land on one shard).
Partitioning is splitting a single dataset into parts. Horizontal partitioning splits by rows (often the same idea as sharding when the parts live on different servers); vertical partitioning splits by columns (rarely-used columns to a separate table). It improves manageability and lets the engine prune irrelevant partitions.

From the trenches Discord

Discord’s messages live in a wide-column store, and their public engineering posts are a clinic in why you pick that family — and where its sharp edges are. Their data shape is a textbook fit: messages keyed by (channel_id, bucket), append-heavy, read in time order, far too large and too write-hot for a single relational node. So they sharded across a Cassandra cluster. But two wide-column realities bit them. First, hot partitions: a handful of enormous channels concentrated reads and writes onto a few nodes while the rest sat idle — the hotspot failure mode of a skewed key. Second, Cassandra’s eventual consistency plus tombstones (deletes are markers, not removals) meant deleted messages lingered as tombstones that slowed reads, and JVM garbage-collection pauses caused latency spikes precisely when traffic peaked. Their fix wasn’t to abandon the wide-column model — it was the right model — but to migrate the engine to ScyllaDB (a C++ Cassandra-compatible rewrite, no GC pauses) and add a Rust data-services layer to coalesce duplicate concurrent reads of the same hot partition. The lesson: NoSQL at scale isn’t “no operations” — you trade joins and strong consistency for write throughput, and you inherit a different set of problems (hotspots, tombstones, compaction, eventual-consistency reads) you must design around from day one.

read the writeup ↗ discord.com

Key points

Document (MongoDB) for flexible nested entities; key-value (Redis/DynamoDB) for fastest by-key access; wide-column (Cassandra) for write-heavy massive scale; graph (Neo4j) for relationship traversal.
Match the family to the access pattern, not the trend — and a well-indexed relational DB is still the right default for most apps.
ACID = strong/immediate consistency (use for money/inventory); BASE = Basically Available, Soft state, Eventually consistent (use for scale/uptime).
Replication = copies → scales reads + HA, not writes, and lags. Sharding = split across servers by key → scales writes/storage, but cross-shard queries are hard and a bad key creates hotspots. Partitioning = split a dataset (rows = horizontal, columns = vertical).
Read replicas scale reads but serve potentially stale data — read from the leader when you need read-your-writes.

01 Learning objectives

0 / 6 done

02 Curated reading

Designing Data-Intensive Applications (book)
optional eng blog 60m — The definitive deep dive on replication, partitioning, consistency.

03 Knowledge check

knowledge check3 questions · pass ≥ 70%

01medium
Caching, sessions, and leaderboards are the sweet spot of which NoSQL family?
02medium
BASE (vs ACID) emphasises:
03hard
Sharding primarily scales ___, while replication primarily scales ___.

04 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

Commonly asked mid concept very common Name the four main NoSQL families and a use case where each beats a relational DB.
Document (MongoDB) — flexible JSON-like docs; content, catalogs, user profiles where the shape varies.
Key-value (Redis, DynamoDB) — fastest by-key access; caching, sessions, leaderboards, rate counters.
Wide-column (Cassandra, HBase) — massive distributed write scale; time-series, IoT, event logs.
Graph (Neo4j) — relationship-heavy traversals; social graphs, fraud rings, recommendations.
The through-line: each optimizes a specific access pattern that relational tables + joins serve poorly at scale.
Follow-ups they push on
- Why is a graph DB better than SQL recursive joins for 'friends of friends of friends'?
- Document vs wide-column — how do their data models differ?
Red flag Treating 'NoSQL' as one thing, or claiming it's 'schemaless so always better' — each family has a narrow sweet spot.
source: MongoDB — Types of NoSQL Databases ↗
Commonly asked mid concept very common In MongoDB, when do you embed a sub-document vs reference another collection?
Embed when the child is owned by and always read with the parent, the relationship is one-to-few, and the embedded data doesn't grow unbounded — e.g. a user's addresses inside the user document. One read fetches everything; no join.
Reference (store an ObjectId, join with $lookup or a second query) when the child is large, shared across parents (many-to-many), updated independently, or the array would grow without bound (a celebrity's millions of followers). This avoids the 16MB document cap and write amplification.
Rule: model around your access patterns, not entities — 'data that is accessed together should be stored together.'
Follow-ups they push on
- What's MongoDB's document size limit, and how does it force referencing?
- How would you model a comments-on-posts relationship?
Red flag Reflexively normalizing like a relational schema, or embedding an unbounded growing array that eventually hits the 16MB document limit.
source: MongoDB — Data Modeling Introduction ↗
Commonly asked mid concept common What is BASE and how does it differ from ACID?
BASE = Basically Available, Soft state, Eventual consistency. It's the consistency model many NoSQL/distributed stores choose: stay available and partition-tolerant, accept that replicas converge *eventually* rather than being instantly consistent.
Vs ACID, which insists every transaction leaves the DB strongly consistent and isolated. BASE relaxes that to gain availability and horizontal scale. It's the practical face of the CAP theorem: under a network partition you pick availability (BASE/AP) or consistency (ACID/CP). Use BASE where stale-by-seconds reads are fine (feeds, product views); use ACID where they aren't (payments).
Follow-ups they push on
- State the CAP theorem and which corner BASE sits in.
- Give a feature where eventual consistency is unacceptable.
Red flag Equating 'NoSQL' with 'no transactions' — many (MongoDB, DynamoDB) now offer ACID transactions; BASE is a choice, not an inherent limitation.
source: MongoDB — ACID Transactions / Database Consistency ↗
Commonly asked senior concept common Explain sharding vs replication vs partitioning. How are they different?
Replication — keep copies of the same data on multiple nodes (leader-follower). Goal: high availability + read scaling + durability. It does *not* increase write capacity (one leader takes writes).
Sharding — split the dataset into disjoint pieces across nodes by a shard key, each node owning a subset. Goal: scale writes and storage beyond one machine.
Partitioning — the general term for splitting a table: horizontal = rows split across partitions (sharding is horizontal partitioning across servers); vertical = columns split into separate tables.
In practice you combine them: shard for write scale, then replicate each shard for HA.
Follow-ups they push on
- Why does replication alone not scale writes?
- How do you choose a shard key, and what's a hot-shard / hotspot?
Red flag Using sharding and replication interchangeably — replicas are full copies (HA + reads), shards are disjoint subsets (write/storage scale).
source: MongoDB — Sharding ↗
Commonly asked senior design occasional How would you choose a shard key, and what goes wrong with a bad one?
A good shard key has high cardinality, even write distribution, and matches your query pattern so most queries hit one shard (targeted, not scatter-gather).
Failure modes: a monotonically increasing key (timestamp, auto-increment id) sends all new writes to one shard — a 'hot shard'. A low-cardinality key (country, status) can't split finely enough. A key that doesn't appear in queries forces every query to fan out to all shards (broadcast).
Mitigations: hashed shard keys to spread writes, or compound keys (e.g. user_id + time) to keep related data together while distributing load.
Follow-ups they push on
- Why is a hashed shard key better for write distribution but worse for range queries?
- What is a scatter-gather query and why is it slow?
Red flag Picking an auto-increment or timestamp shard key and creating a permanent hotspot, or a key absent from common queries forcing broadcasts.
source: MongoDB — Choose a Shard Key ↗
Commonly asked mid concept common What is the MongoDB aggregation pipeline, and how does it map to SQL?
The aggregation pipeline passes documents through ordered stages, each transforming the stream and feeding the next — like Unix pipes for data.
Rough SQL mapping: $match ~ WHERE, $group ~ GROUP BY (+ aggregates), $project ~ SELECT (shape columns), $sort ~ ORDER BY, $limit/$skip ~ LIMIT/OFFSET, $lookup ~ LEFT JOIN, $unwind ~ flatten an array into rows.
Stage order matters for performance: put $match and $sort early so they can use indexes and shrink the working set before expensive $group/$lookup.
Follow-ups they push on
- Why put $match as early as possible in the pipeline?
- What does $unwind do and when is it needed before $group?
Red flag Ordering stages so `$match` comes after a `$group`/`$lookup`, defeating index use and processing far more documents than necessary.
source: MongoDB — Aggregation Pipeline ↗
Commonly asked mid concept common Why use Redis for caching, and what are the main eviction/expiry concerns?
Redis is an in-memory key-value store, so reads/writes are microsecond-fast — ideal as a cache in front of a slower primary DB, plus sessions, rate limiters, and leaderboards (sorted sets).
Key concerns: set a TTL (EXPIRE) so stale data ages out; pick an eviction policy for when memory is full (allkeys-lru, allkeys-lfu, volatile-ttl, etc.); and have a cache-invalidation strategy on writes (write-through, or delete-on-update). Watch for stampede — many requests recomputing a hot key the instant it expires — mitigated by locks or jittered TTLs.
Follow-ups they push on
- Cache-aside vs write-through vs write-behind?
- What is a cache stampede / thundering herd, and how do you avoid it?
Red flag Caching without a TTL or invalidation plan (serving stale data forever), or ignoring eviction so the cache silently drops keys under memory pressure.
source: Redis — Key eviction (docs) ↗
Commonly asked mid concept common When would you NOT use NoSQL — i.e., when is a relational database still the right call?
Choose relational when you need strong multi-row transactions / ACID (money, inventory, bookings), flexible ad-hoc queries and joins across well-structured related data, constraints and referential integrity enforced by the DB, and a stable schema.
NoSQL earns its place for huge scale on a known access pattern, flexible/evolving document shapes, or relationship-traversal workloads. The honest senior answer is 'it depends on access patterns and consistency needs' — and modern Postgres (JSONB, partitioning, logical replication) covers many cases people reach for NoSQL for.
Follow-ups they push on
- How does Postgres JSONB blur the SQL/NoSQL line?
- Polyglot persistence — when is mixing both justified?
Red flag Picking NoSQL for hype/scale you don't have, then reimplementing joins and transactions in application code; or assuming relational 'can't scale'.
source: MongoDB — NoSQL vs SQL Databases ↗
★ must-know Commonly asked senior concept common State the CAP theorem and explain why 'CA' isn't a real choice for a distributed database.
CAP says that when a network partition (P) happens, a distributed system can preserve at most one of Consistency (every read sees the latest write) and Availability (every request gets a non-error response) — you must drop one.
'CA' isn't a meaningful pick because partitions *will* happen in any real network — you don't get to opt out of P. So the real choice during a partition is CP (refuse/error to stay consistent — e.g. a leader-based store rejecting writes it can't replicate) or AP (answer with possibly-stale data and reconcile later — Dynamo-style stores). When there's *no* partition, a good system gives both C and A; CAP only forces the trade *during* a partition. PACELC extends it: else (no partition) you still trade latency vs consistency.
What a strong answer covers
- Under a partition you choose Consistency or Availability, not both.
- Partitions are unavoidable in real networks, so P isn't optional — 'CA' is a non-choice.
- CP = stay consistent, reject/err during partition; AP = stay available, serve stale, reconcile.
- CAP only bites *during* a partition; PACELC adds the latency-vs-consistency trade for normal operation.
Quick self-check
During a network partition, a payment system that must never double-charge should behave as…
Follow-ups they push on
- What does PACELC add to CAP?
- Give a real CP store and a real AP store and the workload each suits.
Red flag Treating CAP as 'pick any two' (you can't drop P) or thinking it forces a permanent global trade rather than one that only applies during a partition.
source: Wikipedia — CAP theorem ↗
Amazon senior concept occasional Why is NoSQL data modeling driven by access patterns, and what does DynamoDB single-table design illustrate?
Relational modeling normalizes by entity and joins at read time. NoSQL stores (especially DynamoDB) have no joins and charge for every access, so you model queries first: list the access patterns, then design keys so each query is a single, indexed key lookup.
Single-table design takes this to the extreme — multiple entity types (users, orders, items) share one table, distinguished by a composite primary key (a generic partition key + sort key, often PK/SK with prefixes like USER#123 / ORDER#456). Related items share a partition so one query fetches them together without a join, and secondary indexes (GSIs) serve alternate patterns. The cost is a rigid, query-specific schema that's painful to change when access patterns evolve.
What a strong answer covers
- No joins + per-request cost -> design around queries, not entities.
- List access patterns first, then shape partition/sort keys so each is one key lookup.
- Single-table design co-locates related items in a partition via prefixed composite keys.
- Secondary indexes (GSIs) add alternate access patterns; the schema is rigid to new ones.
Follow-ups they push on
- How does a composite (partition + sort) key let one query return several related items?
- What's the downside when a brand-new access pattern appears later?
Red flag Modeling a NoSQL store like a normalized relational schema and then needing joins the database can't do, forcing N round-trips or client-side joins.
source: AWS docs — DynamoDB single-table design / data modeling ↗
Commonly asked senior concept common What is eventual consistency, and how do read-your-writes and quorum reads/writes fit in?
Eventual consistency: replicas may temporarily disagree, but with no new writes they all converge to the same value. The window means a read just after a write can return stale data.
Stronger guarantees layer on top. Read-your-writes ensures *you* always see your own latest write (route your reads to a replica known to have it, or to the leader). Quorum tunes consistency per operation: with N replicas, require W acks on write and R replicas on read; if R + W > N the read and write sets overlap, so a read is guaranteed to see the latest acknowledged write (strong consistency) — at the cost of latency/availability. Dynamo-style systems expose W/R so you trade consistency against speed per call.
What a strong answer covers
- Eventual consistency: replicas converge once writes stop; reads can be briefly stale.
- Read-your-writes: a session always sees its own latest write.
- Quorum: pick W (write acks) and R (read replicas) out of N.
- R + W > N guarantees overlap -> a read sees the latest committed write (strong consistency).
- Higher R/W means stronger consistency but more latency and less availability.
Quick self-check
With N=3 replicas, which (W, R) configuration guarantees strongly-consistent reads?
Follow-ups they push on
- Why does R + W > N guarantee a read sees the newest write?
- What's a tunable example — W=N for strong writes vs W=1 for fast writes?
Red flag Assuming eventual consistency means 'never consistent', or thinking any single quorum value is right — R/W are a per-workload latency-vs-consistency dial.
source: Wikipedia — Eventual consistency ↗
Commonly asked mid concept occasional When is a graph database the right tool, and why does it beat relational recursive joins for deep traversals?
Use a graph DB (Neo4j) when relationships are first-class and traversals are deep/variable-length: social graphs ('friends of friends of friends'), fraud rings, recommendation paths, dependency/permission graphs.
In a relational store, each 'hop' is another self-join, and a 4-hop query means 4 joins whose cost compounds with table size — the optimizer re-finds matching rows by index lookup each level. A graph DB uses index-free adjacency: each node directly stores pointers to its neighbors, so traversing one more hop is O(neighbors of the current node), independent of total graph size. That makes variable-depth path queries (shortest path, reachability) both fast and natural to express (Cypher's MATCH (a)-[:FRIEND*1..4]->(b)).
What a strong answer covers
- Graph DBs shine when relationships and multi-hop traversal are the core workload.
- Index-free adjacency: nodes point straight at neighbors, so a hop is local, not a global index lookup.
- Relational deep traversal = N self-joins whose cost compounds with table size.
- Variable-length paths (shortest path, reachability) are awkward in SQL, native in graph query languages.
Follow-ups they push on
- What is index-free adjacency, concretely?
- Could a recursive CTE handle this in SQL, and where does it fall down at scale?
Red flag Forcing a deeply-connected, variable-depth traversal into repeated SQL self-joins/recursive CTEs and watching it degrade as hop count and table size grow.
source: Neo4j — Graph Database Concepts (index-free adjacency) ↗
Commonly asked mid concept common Cache-aside vs write-through vs write-behind — compare the caching strategies.
Cache-aside (lazy): the app checks the cache; on a miss it reads the DB and populates the cache, and on writes it updates the DB and *invalidates* the key. Simple and resilient (cache down ≠ data loss), but the first read after a miss/eviction is slow and there's a brief staleness window.
Write-through: writes go to cache and DB synchronously, so the cache is always fresh — at the cost of higher write latency and caching data that may never be read.
Write-behind (write-back): writes hit the cache and are flushed to the DB asynchronously — lowest write latency, highest throughput, but risks data loss if the cache fails before flushing and adds complexity. Cache-aside is the common default for read-heavy web workloads.
What a strong answer covers
- Cache-aside: app-managed, populate on miss, invalidate on write — simple, resilient, can serve stale briefly.
- Write-through: write cache+DB together — always fresh, slower writes, may cache unread data.
- Write-behind: async flush to DB — fastest writes, but risks loss on cache failure.
- Default to cache-aside for read-heavy systems; reserve write-behind for write-heavy, loss-tolerant cases.
Quick self-check
Which strategy has the **lowest write latency** but the **highest risk of data loss**?
Follow-ups they push on
- Why does write-behind risk data loss, and how do you mitigate it?
- How do you avoid a cache stampede when a hot cache-aside key expires?
Red flag Choosing write-behind for data you can't afford to lose, or running cache-aside without an invalidation step so the cache serves stale data after every update.
source: AWS — Caching strategies (lazy loading / write-through) ↗
Commonly asked mid trick occasional A NoSQL store is 'schemaless' — what does that actually mean, and what's the catch?
'Schemaless' means the database doesn't enforce a fixed schema — different documents in a collection can have different fields, and you can add a field without a migration. It's better called schema-on-read: the structure is interpreted by the application when it reads, rather than enforced by the database on write.
The catch is the schema doesn't disappear — it moves into your application code, which must handle missing fields, mixed types, and old document shapes (versioning) forever. Without DB-level constraints you can silently write inconsistent data, so mature NoSQL stores add optional validation (MongoDB JSON Schema validators) and teams still enforce structure in code. 'Flexible' is the upside; 'no guardrails' is the downside.
What a strong answer covers
- Schemaless = the DB doesn't enforce structure; really 'schema-on-read'.
- The schema moves into application code, which must tolerate missing/old/variant shapes.
- Flexibility speeds iteration but removes the DB's data-integrity guardrails.
- Mitigate with optional validators (MongoDB schema validation) and explicit document versioning.
Quick self-check
A 'schemaless' document store most accurately means…
Follow-ups they push on
- How does schema-on-read differ from schema-on-write?
- How do you evolve millions of existing documents to a new shape?
Red flag Believing 'schemaless' means no schema to manage — the schema is just enforced (or not) in application code, where inconsistencies accumulate silently.
source: MongoDB — Schema Validation ↗