Messaging & event-driven architecture
Queues and pub/sub (Kafka, RabbitMQ, SQS), why at-least-once delivery forces idempotent consumers, and event-driven vs request/response tradeoffs.
Messaging decouples producers from consumers in time: the sender drops a message and moves on, the receiver processes when it can. That buys you resilience and scale — at the cost of having to design for at-least-once delivery, which is the source of most messaging bugs. Master one idea and the rest follows: the broker can deliver a message twice, so your consumer must be idempotent.
Queue vs pub/sub: the two delivery shapes
A queue is for work distribution: N workers compete, each message handled once — scale throughput by adding workers. Pub/sub is for broadcast: one event, many independent reactions (an order-placed event triggers billing, inventory, email, and analytics, each its own subscriber). Kafka blurs the line — a topic is durable, and each consumer group gets its own independent cursor, so it can act like a queue (one group, load-balanced across partitions) or like pub/sub (many groups, each reading the full stream).
Picking a broker
| Broker | Use when | Avoid when | Gotcha |
|---|---|---|---|
| Kafka | high-throughput streaming, event sourcing, replayable log, multiple independent consumers | you need complex per-message routing or low operational overhead | ordering is per-partition only; consumers manage offsets |
| RabbitMQ | flexible routing, task queues, request/reply, priority | millions of msgs/sec or you need long-term replay | messages are gone after ack — no rewind |
| SQS (managed) | you want zero ops on AWS, simple durable queue | you need ordering everywhere or sub-ms latency | standard SQS is at-least-once + unordered (FIFO queues cost throughput) |
The decisive axis is do you need to re-read history? Kafka retains the log, so a new consumer can replay from the beginning and you can re-derive state after a bug — the foundation of event sourcing. RabbitMQ and SQS consume-and-delete: once a message is acked it’s gone, which keeps them simple but means there’s no rewind. Kafka’s other superpower (and its main constraint) is per-partition ordering: messages are ordered within a partition but not across them, so you partition by a key (e.g. user_id) when order matters for that key.
Why at-least-once forces idempotent consumers
If a consumer crashes after doing the work but before acknowledging, the broker has no way to know the work succeeded — so it redelivers, and the same message gets processed twice. You generally can’t get true exactly-once across independent systems (the consumer and the broker can’t commit atomically together), so the pragmatic stance is: accept at-least-once delivery and make the consumer idempotent instead. “Exactly-once” in marketing usually means “at-least-once delivery plus idempotent processing” — which is exactly the pattern below.
on message(msg):
if seen_store.exists(msg.id): # already processed this exact message?
ack(msg); return # drop the duplicate, still ack
process(msg) # the real side effect (charge, email, write)
seen_store.put(msg.id, ttl=24h) # remember we did it
ack(msg)The bug this fixes is the redelivery after a post-work crash. Even if the broker resends msg.id, the seen_store check makes the second attempt a no-op. There’s a subtle ordering issue, though: the read-then-write above can itself race if two copies of the message are processed concurrently. The robust version folds the dedup into the side effect atomically — for a DB write, an INSERT ... ON CONFLICT DO NOTHING keyed on msg.id is naturally idempotent and needs no separate store:
INSERT INTO processed_payments (msg_id, amount, charged_at)
VALUES (:msg_id, :amount, now())
ON CONFLICT (msg_id) DO NOTHING; -- second delivery inserts nothing → no double chargeEvent-driven vs request/response
Request/response is synchronous and coupled: the caller blocks and learns the outcome immediately — simple to reason about and debug, but a slow or down dependency stalls the caller, and load spikes propagate straight through. Event-driven is asynchronous and decoupled: the producer doesn’t wait or even know who consumes — resilient (a down consumer just means a growing backlog, not a failed request), naturally absorbs spikes (the queue buffers), and scales each consumer independently. The price is eventual consistency (the effect happens soon, not now) and harder traceability (a single user action fans out across services, so you need correlation IDs and distributed tracing to follow it).
01 Learning objectives
0 / 4 done02 Curated reading
03 Knowledge check
- 01medium
At-least-once delivery means consumers must be:
- 02medium
A durable, replayable, high-throughput event log is the sweet spot of:
- 03medium
“At-least-once” delivery means…
04 Interview questions
browse all ↗What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.
-
Kafka vs RabbitMQ vs SQS — what's the conceptual difference and when does each fit?
Kafka is a durable, append-only distributed log: consumers track an offset and can replay; built for high-throughput streaming and event sourcing, with retention so multiple consumer groups read the same stream independently. RabbitMQ is a traditional broker with smart routing (exchanges, queues, bindings) and per-message acks — great for complex routing and classic task queues, but messages typically vanish once consumed. SQS is a fully managed AWS queue — minimal ops, at-least-once delivery, near-infinite scale, but no replay and limited ordering (FIFO queues excepted).
Pick Kafka for streaming/replay/high-throughput, RabbitMQ for rich routing and work queues, SQS when you want managed simplicity on AWS.
Follow-ups they push on- Why can Kafka replay events but RabbitMQ usually can't?
- What does a consumer offset give you that an ack doesn't?
Red flag Calling Kafka 'just a queue'. It's a retained log — consumers read by offset and can replay, which a delete-on-consume queue can't.
source: Hello Interview — Kafka deep dive ↗ -
Why does at-least-once delivery force you to build idempotent consumers?
Most brokers guarantee at-least-once delivery: if a consumer processes a message but crashes before acking, the broker redelivers it, so duplicates are inevitable. If processing isn't idempotent, a duplicate means double-charging, double-emailing, or double-incrementing.
Make the consumer idempotent: dedupe on a stable message id (record processed ids and skip repeats), or design the operation so reapplying it is a no-op (upsert, set-to-value instead of increment). Then redelivery is harmless.
Follow-ups they push on- How would you dedupe by message id, and where do you store seen ids?
- Why is 'set status = SHIPPED' safer than 'increment count'?
Red flag Assuming each message arrives exactly once. At-least-once is the norm; build for duplicates.
source: AWS — SQS at-least-once delivery ↗ -
Is exactly-once delivery real? Explain the nuance.
Exactly-once *network delivery* is generally impossible — you can't simultaneously guarantee no loss and no duplicates across an unreliable network (two-generals problem). What systems offer is exactly-once processing / effective-once: at-least-once delivery plus idempotent or transactional handling so the observable effect happens once.
Kafka's 'exactly-once semantics' works this way: idempotent producers and transactions tie the consume-process-produce cycle together so duplicates don't produce duplicate effects. The honest framing: dedup + transactions give exactly-once *effects*, not magically-once delivery.
Follow-ups they push on- How does Kafka achieve its exactly-once semantics? (idempotent producer + transactions)
- Why is the consumer side still your responsibility for external side effects?
Red flag Claiming a broker delivers exactly once over the wire. Real systems get exactly-once *effects* via idempotency/transactions, not exactly-once delivery.
source: Confluent — Exactly-once semantics in Kafka ↗ -
What is a dead-letter queue and when does a message land there?
A dead-letter queue (DLQ) is a side queue where messages go after they repeatedly fail to be processed (exceeding a max-receive/retry count) or can't be delivered. It stops a single 'poison' message from being redelivered forever and blocking the main queue.
Operationally you alert on DLQ depth, inspect the failed messages, fix the bug or bad data, and replay them back to the main queue. Without a DLQ, a permanently-failing message either loops endlessly or gets silently dropped.
Follow-ups they push on- How do you decide the max-receive count before dead-lettering?
- What's a poison message and how does a DLQ contain it?
Red flag Having no DLQ, so a poison message either blocks the queue with infinite retries or is lost silently. Always have a parking lot.
source: AWS — Amazon SQS dead-letter queues ↗ -
Event-driven vs request/response — what do you gain and what do you give up?
Request/response is synchronous and simple: the caller waits and gets an answer or an error, with a clear linear flow that's easy to reason about and debug. But it couples services temporally — if the callee is down, the caller fails — and it doesn't absorb spikes.
Event-driven publishes events and lets consumers react asynchronously: it decouples producers from consumers, buffers load (the queue absorbs spikes), and lets you add new consumers without touching the producer. The costs are eventual consistency, harder end-to-end debugging/tracing, and the need for idempotency and ordering handling. Use events for fan-out, decoupling, and load-leveling; use request/response when you need an immediate answer.
Follow-ups they push on- How does a queue provide back-pressure / load-leveling?
- What new failure modes does async introduce?
Red flag Going event-driven everywhere and losing the simple synchronous read paths. Async adds eventual consistency and tracing complexity — use it where decoupling actually pays.
source: Netflix Tech Blog — event-driven architecture ↗ -
How does Kafka preserve message ordering, and what's the catch?
Kafka guarantees ordering only within a partition, not across a topic. Messages with the same partition key (e.g.
userId) always land in the same partition and are consumed in order, so per-key ordering holds.The catch: you only get parallelism by having multiple partitions, and across partitions there's no global order. So you trade total ordering for throughput. If you need strict global ordering you're limited to one partition (no parallelism) — the usual move is to choose a partition key that makes per-key ordering sufficient.
Follow-ups they push on- How do you pick a partition key for per-entity ordering?
- Why can't you both have many partitions and total ordering?
Red flag Assuming a Kafka topic is globally ordered. Ordering is per-partition; cross-partition order is undefined.
source: Hello Interview — Kafka deep dive ↗ -
How would you reliably publish an event after committing a DB write (the dual-write problem)?
The trap (a dual write) is committing to the DB and then publishing to the broker as two separate steps — a crash in between leaves them inconsistent (event lost, or published but DB rolled back).
The standard fix is the transactional outbox: in the same DB transaction as the business write, insert the event into an
outboxtable. A separate relay (polling or change-data-capture like Debezium) reads the outbox and publishes to the broker, marking rows sent. Because the write and the outbox insert commit atomically, the event is never lost; the relay gives at-least-once publishing, so consumers stay idempotent.Follow-ups they push on- Why not just publish then write, or write then publish?
- How does CDC / Debezium read the outbox?
Red flag Doing a naive dual write (commit DB, then send to Kafka). A failure between the two desynchronizes your DB and your event stream.
source: microservices.io — Transactional Outbox ↗ -
Point-to-point queue vs publish/subscribe — what's the difference and when do you use each?
In a point-to-point queue, each message is delivered to exactly one consumer among possibly many competing workers — it's a work queue for distributing tasks (e.g. resize one image once, no matter how many workers are running). In publish/subscribe, each message is fanned out to *every* subscriber, so N independent services all react to the same event.
Use point-to-point to load-balance work across a pool (competing consumers); use pub/sub to broadcast an event to multiple independent consumers. Kafka models pub/sub via consumer groups: across groups it's fan-out, within a group it's point-to-point load balancing.
What a strong answer coversQueue (point-to-point): one message → exactly one of the competing consumers.
Pub/sub: one message → every subscriber (fan-out).
Queues load-balance work; pub/sub broadcasts events.
Kafka consumer groups: fan-out across groups, load-balance within a group.
Quick self-checkYou need an OrderPlaced event to trigger email, inventory, AND analytics services independently. Which model?
-
Wrong — each message goes to only one consumer; two services miss it.
-
Correct — every subscriber receives its own copy of the event.
-
Wrong — that distributes, not broadcasts; each event hits one service.
-
Wrong — doesn't fan out and recouples the producer.
Follow-ups they push on- How do Kafka consumer groups give you both models?
- What's the 'competing consumers' pattern?
Red flag Using a single shared queue when you actually need every service to see the event — only one consumer will get each message, and the others silently miss it.
source: AWS — Pub/sub messaging vs message queues ↗ -
What is consumer lag in Kafka, and what does growing lag tell you?
Consumer lag is the gap between the latest offset produced to a partition (the log-end offset) and the offset the consumer group has committed — i.e. how many messages are produced-but-not-yet-processed. Steady or near-zero lag means consumers keep up; growing lag means the consumers can't process as fast as producers write.
It's a primary health/alerting signal. Remedies for chronic lag: add consumers (up to the partition count — that's the parallelism ceiling), add partitions, speed up per-message processing, or batch. Spiky lag that drains is fine; monotonically rising lag predicts an eventual backlog blowup.
What a strong answer coversLag = latest produced offset − consumer's committed offset (unprocessed backlog).
Rising lag = consumers slower than producers.
Max parallelism is bounded by partition count — more consumers than partitions sit idle.
A core metric to alert on for streaming health.
Follow-ups they push on- Why can't you scale consumers beyond the partition count?
- How would you reduce lag without adding partitions?
Red flag Adding consumers beyond the number of partitions to cut lag. Extra consumers in a group just idle — you must increase partitions to raise parallelism.
source: Confluent — Monitoring consumer lag ↗ -
Choreography vs orchestration for the Saga pattern — how do you keep a multi-service transaction consistent?
With no distributed ACID transaction across microservices, a Saga breaks a business transaction into a sequence of local transactions, each publishing an event; if a step fails, compensating transactions undo the prior steps (e.g. refund a charge after inventory reservation fails).
Choreography: services react to each other's events with no central coordinator — loosely coupled but the end-to-end flow is implicit and hard to trace. Orchestration: a central orchestrator explicitly drives each step and triggers compensations — easier to reason about and monitor, but the orchestrator is a coupling point. Choose choreography for simple, few-step flows; orchestration as step count and error handling grow.
Follow-ups they push on- What's a compensating transaction, and why isn't it the same as a rollback?
- When does choreography's implicit flow become a liability?
Red flag Trying to span microservices with one ACID transaction (e.g. distributed 2PC everywhere). Sagas with compensations are the practical model; 2PC scales and fails poorly across services.
source: microservices.io — Saga pattern ↗ -
How do RabbitMQ acks and the prefetch (QoS) setting affect throughput and reliability?
With manual acks, RabbitMQ keeps a message 'unacked' until the consumer confirms it; if the consumer dies first, the message is requeued — that's how at-least-once delivery and crash safety work. Auto-ack trades that safety for speed (a crash mid-processing loses the message).
Prefetch (basic.qos) caps how many unacked messages a consumer may hold at once. Prefetch=1 gives the fairest load distribution (a slow consumer won't hoard a backlog) but adds round-trip overhead; a higher prefetch boosts throughput by pipelining but can let one consumer grab a big batch while others idle. Tune prefetch to balance fairness against throughput for your processing time.
What a strong answer coversManual ack = message redelivered if the consumer dies before acking (at-least-once).
Auto-ack is faster but loses in-flight messages on crash.
Prefetch limits unacked messages per consumer.
Low prefetch → fair distribution; high prefetch → throughput but possible hoarding.
Follow-ups they push on- Why does prefetch=1 give the fairest distribution but lower throughput?
- What happens to unacked messages when a consumer connection drops?
Red flag Using auto-ack for work you can't afford to lose, or leaving prefetch unbounded so one consumer grabs the whole queue while others starve.
source: RabbitMQ — Consumer Acknowledgements and Publisher Confirms ↗ -
Trick: does Kafka delete a message once a consumer reads it? What actually controls retention?
No — this is the key mental-model shift from traditional queues. Kafka is a durable log; reading a message does not remove it. The consumer just advances its offset (a bookmark), and the data stays on disk for everyone else to read. Multiple consumer groups can read the same messages independently, and a group can rewind its offset to replay.
Retention is governed by configured policy, not consumption: time-based (
retention.ms, e.g. 7 days) or size-based (retention.bytes), or log compaction (keep the latest value per key). Messages age out by policy regardless of whether anyone consumed them.What a strong answer coversReading does NOT delete — consumers advance an offset (a bookmark).
Data persists for all consumer groups; rewinding the offset replays.
Retention is by time/size policy or log compaction, independent of consumption.
Contrast: a traditional queue typically deletes on consume.
Quick self-checkWhat happens to a Kafka message after a consumer reads it?
-
Wrong — that's a traditional queue, not Kafka's log.
-
Correct — Kafka is a retained log; reading moves a bookmark.
-
Wrong — DLQs are for failures, not normal reads.
-
Wrong — retention is policy-based, not consumption-based.
Follow-ups they push on- What is log compaction and when do you use it?
- How does offset-as-bookmark enable replay and reprocessing?
Red flag Treating Kafka like a delete-on-read queue. Messages persist until the retention policy expires them — consumption only moves an offset.
source: Confluent — Kafka topics and retention ↗ -
How does a message queue provide back-pressure and load leveling, and what's the risk if you ignore queue depth?
A queue decouples producer rate from consumer rate: during a spike, messages buffer in the queue instead of overwhelming the downstream service, which keeps processing at its sustainable rate — that's load leveling (the queue-based load-leveling pattern). It smooths bursts into a steady drain.
But a queue is finite. If producers persistently outpace consumers, queue depth grows unbounded: latency climbs (messages wait longer), memory/disk fills, and you risk hitting limits or processing hours-stale data. Back-pressure is signaling producers to slow down (reject, throttle, or block) when depth crosses a threshold. Always monitor and alert on queue depth/age, cap the queue, and decide a shed/back-pressure policy — a queue defers overload, it doesn't eliminate it.
What a strong answer coversQueue buffers bursts so consumers drain at a sustainable rate (load leveling).
Back-pressure = signaling producers to slow when the queue fills.
Unbounded growth → rising latency, stale data, resource exhaustion.
Monitor depth/age; cap the queue and define a shedding/back-pressure policy.
Follow-ups they push on- How do you implement back-pressure when producers and consumers are decoupled?
- Why is a growing queue a latency problem even before it's a capacity problem?
Red flag Treating the queue as infinite elastic buffer. If consumers are chronically slower than producers, the queue just defers the overload while latency and staleness balloon.
source: Microsoft — Queue-Based Load Leveling pattern ↗