2.6 [A] 13 interview Q's

Messaging & event-driven architecture

Queues and pub/sub (Kafka, RabbitMQ, SQS), why at-least-once delivery forces idempotent consumers, and event-driven vs request/response tradeoffs.

Messaging decouples producers from consumers in time: the sender drops a message and moves on, the receiver processes when it can. That buys you resilience and scale — at the cost of having to design for at-least-once delivery, which is the source of most messaging bugs. Master one idea and the rest follows: the broker can deliver a message twice, so your consumer must be idempotent.

Key vocabulary

Queue vs pub/sub: A queue delivers each message to one consumer (work distribution). Pub/sub fans a message out to every subscriber (broadcast).
Log vs broker: Kafka is a durable, replayable log — consumers track an offset and can re-read. A traditional broker (RabbitMQ) routes a message and deletes it once acked.
At-least-once delivery: The broker guarantees a message arrives, but may deliver it more than once on retry/failure. This is the default — and the reason consumers must be idempotent.
Consumer group / offset: In a log like Kafka, a consumer group reads a partition and commits an offset marking how far it has processed. Crash before committing and it re-reads from the last commit — which is precisely why duplicates appear.
Idempotent consumer: Processing the same message twice has the same effect as processing it once — usually via a dedup key or an ‘already-processed’ check before the side effect.

Queue vs pub/sub: the two delivery shapes

FIG 1 · queue vs pub/sub A queue load-balances each message to exactly one worker; pub/sub fans every message out to all subscribers. Same producer, opposite delivery.

A queue is for work distribution: N workers compete, each message handled once — scale throughput by adding workers. Pub/sub is for broadcast: one event, many independent reactions (an order-placed event triggers billing, inventory, email, and analytics, each its own subscriber). Kafka blurs the line — a topic is durable, and each consumer group gets its own independent cursor, so it can act like a queue (one group, load-balanced across partitions) or like pub/sub (many groups, each reading the full stream).

Picking a broker

Broker	Use when	Avoid when	Gotcha
Kafka	high-throughput streaming, event sourcing, replayable log, multiple independent consumers	you need complex per-message routing or low operational overhead	ordering is per-partition only; consumers manage offsets
RabbitMQ	flexible routing, task queues, request/reply, priority	millions of msgs/sec or you need long-term replay	messages are gone after ack — no rewind
SQS (managed)	you want zero ops on AWS, simple durable queue	you need ordering everywhere or sub-ms latency	standard SQS is at-least-once + unordered (FIFO queues cost throughput)

Kafka = durable replayable log; RabbitMQ = rich routing broker; SQS = managed and operationally cheap.

The decisive axis is do you need to re-read history? Kafka retains the log, so a new consumer can replay from the beginning and you can re-derive state after a bug — the foundation of event sourcing. RabbitMQ and SQS consume-and-delete: once a message is acked it’s gone, which keeps them simple but means there’s no rewind. Kafka’s other superpower (and its main constraint) is per-partition ordering: messages are ordered within a partition but not across them, so you partition by a key (e.g. user_id) when order matters for that key.

Why at-least-once forces idempotent consumers

If a consumer crashes after doing the work but before acknowledging, the broker has no way to know the work succeeded — so it redelivers, and the same message gets processed twice. You generally can’t get true exactly-once across independent systems (the consumer and the broker can’t commit atomically together), so the pragmatic stance is: accept at-least-once delivery and make the consumer idempotent instead. “Exactly-once” in marketing usually means “at-least-once delivery plus idempotent processing” — which is exactly the pattern below.

An idempotent consumer with a dedup key

on message(msg):
  if seen_store.exists(msg.id):        # already processed this exact message?
      ack(msg); return                 # drop the duplicate, still ack

  process(msg)                         # the real side effect (charge, email, write)
  seen_store.put(msg.id, ttl=24h)      # remember we did it
  ack(msg)

The bug this fixes is the redelivery after a post-work crash. Even if the broker resends msg.id, the seen_store check makes the second attempt a no-op. There’s a subtle ordering issue, though: the read-then-write above can itself race if two copies of the message are processed concurrently. The robust version folds the dedup into the side effect atomically — for a DB write, an INSERT ... ON CONFLICT DO NOTHING keyed on msg.id is naturally idempotent and needs no separate store:

INSERT INTO processed_payments (msg_id, amount, charged_at)
VALUES (:msg_id, :amount, now())
ON CONFLICT (msg_id) DO NOTHING;   -- second delivery inserts nothing → no double charge

Event-driven vs request/response

Request/response is synchronous and coupled: the caller blocks and learns the outcome immediately — simple to reason about and debug, but a slow or down dependency stalls the caller, and load spikes propagate straight through. Event-driven is asynchronous and decoupled: the producer doesn’t wait or even know who consumes — resilient (a down consumer just means a growing backlog, not a failed request), naturally absorbs spikes (the queue buffers), and scales each consumer independently. The price is eventual consistency (the effect happens soon, not now) and harder traceability (a single user action fans out across services, so you need correlation IDs and distributed tracing to follow it).

From the trenches Netflix

Netflix runs one of the largest event-driven estates in the industry: trillions of events a day flow through Kafka pipelines that power everything from real-time recommendations to operational telemetry to the “Keystone” data platform. The architecture leans hard on the decoupling above — services emit events and move on; downstream consumers (personalization, analytics, billing) each read the stream at their own pace and scale independently.

The recurring lesson from their engineering writeups is that at-least-once is the operating assumption, not an edge case. At Netflix’s volume, brokers rebalance, consumers restart, and network blips happen constantly, so duplicates are routine rather than rare — and every consumer is built to tolerate them. Their work also surfaces the harder operational truths the happy path hides: back-pressure when a consumer falls behind, partition skew when one key dominates, and the need for replay tooling so a bad deploy can be re-processed from the durable log rather than losing data. It’s the clearest large-scale validation of “design idempotent consumers and embrace the log.”

read the writeup ↗ netflixtechblog.com

01 Learning objectives

0 / 4 done

02 Curated reading

Netflix Tech Blog — event-driven & streaming architecture
optional eng blog 20m — Real Kafka-scale streaming systems at Netflix.

03 Knowledge check

knowledge check3 questions · pass ≥ 70%

01medium
At-least-once delivery means consumers must be:
02medium
A durable, replayable, high-throughput event log is the sweet spot of:
03medium
“At-least-once” delivery means…

04 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

Commonly asked mid concept very common Kafka vs RabbitMQ vs SQS — what's the conceptual difference and when does each fit?
Kafka is a durable, append-only distributed log: consumers track an offset and can replay; built for high-throughput streaming and event sourcing, with retention so multiple consumer groups read the same stream independently. RabbitMQ is a traditional broker with smart routing (exchanges, queues, bindings) and per-message acks — great for complex routing and classic task queues, but messages typically vanish once consumed. SQS is a fully managed AWS queue — minimal ops, at-least-once delivery, near-infinite scale, but no replay and limited ordering (FIFO queues excepted).
Pick Kafka for streaming/replay/high-throughput, RabbitMQ for rich routing and work queues, SQS when you want managed simplicity on AWS.
Follow-ups they push on
- Why can Kafka replay events but RabbitMQ usually can't?
- What does a consumer offset give you that an ack doesn't?
Red flag Calling Kafka 'just a queue'. It's a retained log — consumers read by offset and can replay, which a delete-on-consume queue can't.
source: Hello Interview — Kafka deep dive ↗
Commonly asked mid concept very common Why does at-least-once delivery force you to build idempotent consumers?
Most brokers guarantee at-least-once delivery: if a consumer processes a message but crashes before acking, the broker redelivers it, so duplicates are inevitable. If processing isn't idempotent, a duplicate means double-charging, double-emailing, or double-incrementing.
Make the consumer idempotent: dedupe on a stable message id (record processed ids and skip repeats), or design the operation so reapplying it is a no-op (upsert, set-to-value instead of increment). Then redelivery is harmless.
Follow-ups they push on
- How would you dedupe by message id, and where do you store seen ids?
- Why is 'set status = SHIPPED' safer than 'increment count'?
Red flag Assuming each message arrives exactly once. At-least-once is the norm; build for duplicates.
source: AWS — SQS at-least-once delivery ↗
Commonly asked senior trick common Is exactly-once delivery real? Explain the nuance.
Exactly-once *network delivery* is generally impossible — you can't simultaneously guarantee no loss and no duplicates across an unreliable network (two-generals problem). What systems offer is exactly-once processing / effective-once: at-least-once delivery plus idempotent or transactional handling so the observable effect happens once.
Kafka's 'exactly-once semantics' works this way: idempotent producers and transactions tie the consume-process-produce cycle together so duplicates don't produce duplicate effects. The honest framing: dedup + transactions give exactly-once *effects*, not magically-once delivery.
Follow-ups they push on
- How does Kafka achieve its exactly-once semantics? (idempotent producer + transactions)
- Why is the consumer side still your responsibility for external side effects?
Red flag Claiming a broker delivers exactly once over the wire. Real systems get exactly-once *effects* via idempotency/transactions, not exactly-once delivery.
source: Confluent — Exactly-once semantics in Kafka ↗
Commonly asked mid concept common What is a dead-letter queue and when does a message land there?
A dead-letter queue (DLQ) is a side queue where messages go after they repeatedly fail to be processed (exceeding a max-receive/retry count) or can't be delivered. It stops a single 'poison' message from being redelivered forever and blocking the main queue.
Operationally you alert on DLQ depth, inspect the failed messages, fix the bug or bad data, and replay them back to the main queue. Without a DLQ, a permanently-failing message either loops endlessly or gets silently dropped.
Follow-ups they push on
- How do you decide the max-receive count before dead-lettering?
- What's a poison message and how does a DLQ contain it?
Red flag Having no DLQ, so a poison message either blocks the queue with infinite retries or is lost silently. Always have a parking lot.
source: AWS — Amazon SQS dead-letter queues ↗
Commonly asked senior concept common Event-driven vs request/response — what do you gain and what do you give up?
Request/response is synchronous and simple: the caller waits and gets an answer or an error, with a clear linear flow that's easy to reason about and debug. But it couples services temporally — if the callee is down, the caller fails — and it doesn't absorb spikes.
Event-driven publishes events and lets consumers react asynchronously: it decouples producers from consumers, buffers load (the queue absorbs spikes), and lets you add new consumers without touching the producer. The costs are eventual consistency, harder end-to-end debugging/tracing, and the need for idempotency and ordering handling. Use events for fan-out, decoupling, and load-leveling; use request/response when you need an immediate answer.
Follow-ups they push on
- How does a queue provide back-pressure / load-leveling?
- What new failure modes does async introduce?
Red flag Going event-driven everywhere and losing the simple synchronous read paths. Async adds eventual consistency and tracing complexity — use it where decoupling actually pays.
source: Netflix Tech Blog — event-driven architecture ↗
Commonly asked senior concept occasional How does Kafka preserve message ordering, and what's the catch?
Kafka guarantees ordering only within a partition, not across a topic. Messages with the same partition key (e.g. userId) always land in the same partition and are consumed in order, so per-key ordering holds.
The catch: you only get parallelism by having multiple partitions, and across partitions there's no global order. So you trade total ordering for throughput. If you need strict global ordering you're limited to one partition (no parallelism) — the usual move is to choose a partition key that makes per-key ordering sufficient.
Follow-ups they push on
- How do you pick a partition key for per-entity ordering?
- Why can't you both have many partitions and total ordering?
Red flag Assuming a Kafka topic is globally ordered. Ordering is per-partition; cross-partition order is undefined.
source: Hello Interview — Kafka deep dive ↗
Commonly asked senior design occasional How would you reliably publish an event after committing a DB write (the dual-write problem)?
The trap (a dual write) is committing to the DB and then publishing to the broker as two separate steps — a crash in between leaves them inconsistent (event lost, or published but DB rolled back).
The standard fix is the transactional outbox: in the same DB transaction as the business write, insert the event into an outbox table. A separate relay (polling or change-data-capture like Debezium) reads the outbox and publishes to the broker, marking rows sent. Because the write and the outbox insert commit atomically, the event is never lost; the relay gives at-least-once publishing, so consumers stay idempotent.
Follow-ups they push on
- Why not just publish then write, or write then publish?
- How does CDC / Debezium read the outbox?
Red flag Doing a naive dual write (commit DB, then send to Kafka). A failure between the two desynchronizes your DB and your event stream.
source: microservices.io — Transactional Outbox ↗
★ must-know Commonly asked mid concept very common Point-to-point queue vs publish/subscribe — what's the difference and when do you use each?
In a point-to-point queue, each message is delivered to exactly one consumer among possibly many competing workers — it's a work queue for distributing tasks (e.g. resize one image once, no matter how many workers are running). In publish/subscribe, each message is fanned out to *every* subscriber, so N independent services all react to the same event.
Use point-to-point to load-balance work across a pool (competing consumers); use pub/sub to broadcast an event to multiple independent consumers. Kafka models pub/sub via consumer groups: across groups it's fan-out, within a group it's point-to-point load balancing.
What a strong answer covers
- Queue (point-to-point): one message → exactly one of the competing consumers.
- Pub/sub: one message → every subscriber (fan-out).
- Queues load-balance work; pub/sub broadcasts events.
- Kafka consumer groups: fan-out across groups, load-balance within a group.
Quick self-check
You need an OrderPlaced event to trigger email, inventory, AND analytics services independently. Which model?
Follow-ups they push on
- How do Kafka consumer groups give you both models?
- What's the 'competing consumers' pattern?
Red flag Using a single shared queue when you actually need every service to see the event — only one consumer will get each message, and the others silently miss it.
source: AWS — Pub/sub messaging vs message queues ↗
Commonly asked mid concept common What is consumer lag in Kafka, and what does growing lag tell you?
Consumer lag is the gap between the latest offset produced to a partition (the log-end offset) and the offset the consumer group has committed — i.e. how many messages are produced-but-not-yet-processed. Steady or near-zero lag means consumers keep up; growing lag means the consumers can't process as fast as producers write.
It's a primary health/alerting signal. Remedies for chronic lag: add consumers (up to the partition count — that's the parallelism ceiling), add partitions, speed up per-message processing, or batch. Spiky lag that drains is fine; monotonically rising lag predicts an eventual backlog blowup.
What a strong answer covers
- Lag = latest produced offset − consumer's committed offset (unprocessed backlog).
- Rising lag = consumers slower than producers.
- Max parallelism is bounded by partition count — more consumers than partitions sit idle.
- A core metric to alert on for streaming health.
Follow-ups they push on
- Why can't you scale consumers beyond the partition count?
- How would you reduce lag without adding partitions?
Red flag Adding consumers beyond the number of partitions to cut lag. Extra consumers in a group just idle — you must increase partitions to raise parallelism.
source: Confluent — Monitoring consumer lag ↗
Commonly asked senior design occasional Choreography vs orchestration for the Saga pattern — how do you keep a multi-service transaction consistent?
With no distributed ACID transaction across microservices, a Saga breaks a business transaction into a sequence of local transactions, each publishing an event; if a step fails, compensating transactions undo the prior steps (e.g. refund a charge after inventory reservation fails).
Choreography: services react to each other's events with no central coordinator — loosely coupled but the end-to-end flow is implicit and hard to trace. Orchestration: a central orchestrator explicitly drives each step and triggers compensations — easier to reason about and monitor, but the orchestrator is a coupling point. Choose choreography for simple, few-step flows; orchestration as step count and error handling grow.
Follow-ups they push on
- What's a compensating transaction, and why isn't it the same as a rollback?
- When does choreography's implicit flow become a liability?
Red flag Trying to span microservices with one ACID transaction (e.g. distributed 2PC everywhere). Sagas with compensations are the practical model; 2PC scales and fails poorly across services.
source: microservices.io — Saga pattern ↗
Commonly asked senior concept occasional How do RabbitMQ acks and the prefetch (QoS) setting affect throughput and reliability?
With manual acks, RabbitMQ keeps a message 'unacked' until the consumer confirms it; if the consumer dies first, the message is requeued — that's how at-least-once delivery and crash safety work. Auto-ack trades that safety for speed (a crash mid-processing loses the message).
Prefetch (basic.qos) caps how many unacked messages a consumer may hold at once. Prefetch=1 gives the fairest load distribution (a slow consumer won't hoard a backlog) but adds round-trip overhead; a higher prefetch boosts throughput by pipelining but can let one consumer grab a big batch while others idle. Tune prefetch to balance fairness against throughput for your processing time.
What a strong answer covers
- Manual ack = message redelivered if the consumer dies before acking (at-least-once).
- Auto-ack is faster but loses in-flight messages on crash.
- Prefetch limits unacked messages per consumer.
- Low prefetch → fair distribution; high prefetch → throughput but possible hoarding.
Follow-ups they push on
- Why does prefetch=1 give the fairest distribution but lower throughput?
- What happens to unacked messages when a consumer connection drops?
Red flag Using auto-ack for work you can't afford to lose, or leaving prefetch unbounded so one consumer grabs the whole queue while others starve.
source: RabbitMQ — Consumer Acknowledgements and Publisher Confirms ↗
Commonly asked mid trick common Trick: does Kafka delete a message once a consumer reads it? What actually controls retention?
No — this is the key mental-model shift from traditional queues. Kafka is a durable log; reading a message does not remove it. The consumer just advances its offset (a bookmark), and the data stays on disk for everyone else to read. Multiple consumer groups can read the same messages independently, and a group can rewind its offset to replay.
Retention is governed by configured policy, not consumption: time-based (retention.ms, e.g. 7 days) or size-based (retention.bytes), or log compaction (keep the latest value per key). Messages age out by policy regardless of whether anyone consumed them.
What a strong answer covers
- Reading does NOT delete — consumers advance an offset (a bookmark).
- Data persists for all consumer groups; rewinding the offset replays.
- Retention is by time/size policy or log compaction, independent of consumption.
- Contrast: a traditional queue typically deletes on consume.
Quick self-check
What happens to a Kafka message after a consumer reads it?
Follow-ups they push on
- What is log compaction and when do you use it?
- How does offset-as-bookmark enable replay and reprocessing?
Red flag Treating Kafka like a delete-on-read queue. Messages persist until the retention policy expires them — consumption only moves an offset.
source: Confluent — Kafka topics and retention ↗
Commonly asked senior concept occasional How does a message queue provide back-pressure and load leveling, and what's the risk if you ignore queue depth?
A queue decouples producer rate from consumer rate: during a spike, messages buffer in the queue instead of overwhelming the downstream service, which keeps processing at its sustainable rate — that's load leveling (the queue-based load-leveling pattern). It smooths bursts into a steady drain.
But a queue is finite. If producers persistently outpace consumers, queue depth grows unbounded: latency climbs (messages wait longer), memory/disk fills, and you risk hitting limits or processing hours-stale data. Back-pressure is signaling producers to slow down (reject, throttle, or block) when depth crosses a threshold. Always monitor and alert on queue depth/age, cap the queue, and decide a shed/back-pressure policy — a queue defers overload, it doesn't eliminate it.
What a strong answer covers
- Queue buffers bursts so consumers drain at a sustainable rate (load leveling).
- Back-pressure = signaling producers to slow when the queue fills.
- Unbounded growth → rising latency, stale data, resource exhaustion.
- Monitor depth/age; cap the queue and define a shedding/back-pressure policy.
Follow-ups they push on
- How do you implement back-pressure when producers and consumers are decoupled?
- Why is a growing queue a latency problem even before it's a capacity problem?
Red flag Treating the queue as infinite elastic buffer. If consumers are chronically slower than producers, the queue just defers the overload while latency and staleness balloon.
source: Microsoft — Queue-Based Load Leveling pattern ↗