> cs·fundamentals
interview 0% 22m read

Messaging & event-driven architecture

Queues and pub/sub (Kafka, RabbitMQ, SQS), why at-least-once delivery forces idempotent consumers, and event-driven vs request/response tradeoffs.

Messaging decouples producers from consumers in time: the sender drops a message and moves on, the receiver processes when it can. That buys you resilience and scale — at the cost of having to design for at-least-once delivery, which is the source of most messaging bugs. Master one idea and the rest follows: the broker can deliver a message twice, so your consumer must be idempotent.

Queue vs pub/sub: the two delivery shapes

Left: a producer sends to a queue, which delivers each message to one of three workers. Right: a producer sends to a topic, which fans every message out to all three subscribers.QUEUE — one consumer eachproducerqueueworker 1worker 2worker 3PUB/SUB — all subscribersproducertopicsub Asub Bsub C
FIG 1 · queue vs pub/sub A queue load-balances each message to exactly one worker; pub/sub fans every message out to all subscribers. Same producer, opposite delivery.

A queue is for work distribution: N workers compete, each message handled once — scale throughput by adding workers. Pub/sub is for broadcast: one event, many independent reactions (an order-placed event triggers billing, inventory, email, and analytics, each its own subscriber). Kafka blurs the line — a topic is durable, and each consumer group gets its own independent cursor, so it can act like a queue (one group, load-balanced across partitions) or like pub/sub (many groups, each reading the full stream).

Picking a broker

BrokerUse whenAvoid whenGotcha
Kafkahigh-throughput streaming, event sourcing, replayable log, multiple independent consumersyou need complex per-message routing or low operational overheadordering is per-partition only; consumers manage offsets
RabbitMQflexible routing, task queues, request/reply, prioritymillions of msgs/sec or you need long-term replaymessages are gone after ack — no rewind
SQS (managed)you want zero ops on AWS, simple durable queueyou need ordering everywhere or sub-ms latencystandard SQS is at-least-once + unordered (FIFO queues cost throughput)
Kafka = durable replayable log; RabbitMQ = rich routing broker; SQS = managed and operationally cheap.

The decisive axis is do you need to re-read history? Kafka retains the log, so a new consumer can replay from the beginning and you can re-derive state after a bug — the foundation of event sourcing. RabbitMQ and SQS consume-and-delete: once a message is acked it’s gone, which keeps them simple but means there’s no rewind. Kafka’s other superpower (and its main constraint) is per-partition ordering: messages are ordered within a partition but not across them, so you partition by a key (e.g. user_id) when order matters for that key.

Why at-least-once forces idempotent consumers

If a consumer crashes after doing the work but before acknowledging, the broker has no way to know the work succeeded — so it redelivers, and the same message gets processed twice. You generally can’t get true exactly-once across independent systems (the consumer and the broker can’t commit atomically together), so the pragmatic stance is: accept at-least-once delivery and make the consumer idempotent instead. “Exactly-once” in marketing usually means “at-least-once delivery plus idempotent processing” — which is exactly the pattern below.

An idempotent consumer with a dedup key
on message(msg):
  if seen_store.exists(msg.id):        # already processed this exact message?
      ack(msg); return                 # drop the duplicate, still ack

  process(msg)                         # the real side effect (charge, email, write)
  seen_store.put(msg.id, ttl=24h)      # remember we did it
  ack(msg)

The bug this fixes is the redelivery after a post-work crash. Even if the broker resends msg.id, the seen_store check makes the second attempt a no-op. There’s a subtle ordering issue, though: the read-then-write above can itself race if two copies of the message are processed concurrently. The robust version folds the dedup into the side effect atomically — for a DB write, an INSERT ... ON CONFLICT DO NOTHING keyed on msg.id is naturally idempotent and needs no separate store:

INSERT INTO processed_payments (msg_id, amount, charged_at)
VALUES (:msg_id, :amount, now())
ON CONFLICT (msg_id) DO NOTHING;   -- second delivery inserts nothing → no double charge

Event-driven vs request/response

Request/response is synchronous and coupled: the caller blocks and learns the outcome immediately — simple to reason about and debug, but a slow or down dependency stalls the caller, and load spikes propagate straight through. Event-driven is asynchronous and decoupled: the producer doesn’t wait or even know who consumes — resilient (a down consumer just means a growing backlog, not a failed request), naturally absorbs spikes (the queue buffers), and scales each consumer independently. The price is eventual consistency (the effect happens soon, not now) and harder traceability (a single user action fans out across services, so you need correlation IDs and distributed tracing to follow it).

01 Learning objectives

0 / 4 done

02 Curated reading

03 Knowledge check

knowledge check3 questions · pass ≥ 70%
  1. 01medium

    At-least-once delivery means consumers must be:

  2. 02medium

    A durable, replayable, high-throughput event log is the sweet spot of:

  3. 03medium

    “At-least-once” delivery means…

04 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

  • Commonly asked mid concept very common Kafka vs RabbitMQ vs SQS — what's the conceptual difference and when does each fit?

    Kafka is a durable, append-only distributed log: consumers track an offset and can replay; built for high-throughput streaming and event sourcing, with retention so multiple consumer groups read the same stream independently. RabbitMQ is a traditional broker with smart routing (exchanges, queues, bindings) and per-message acks — great for complex routing and classic task queues, but messages typically vanish once consumed. SQS is a fully managed AWS queue — minimal ops, at-least-once delivery, near-infinite scale, but no replay and limited ordering (FIFO queues excepted).

    Pick Kafka for streaming/replay/high-throughput, RabbitMQ for rich routing and work queues, SQS when you want managed simplicity on AWS.

    Red flag Calling Kafka 'just a queue'. It's a retained log — consumers read by offset and can replay, which a delete-on-consume queue can't.

    source: Hello Interview — Kafka deep dive ↗
  • Commonly asked mid concept very common Why does at-least-once delivery force you to build idempotent consumers?

    Most brokers guarantee at-least-once delivery: if a consumer processes a message but crashes before acking, the broker redelivers it, so duplicates are inevitable. If processing isn't idempotent, a duplicate means double-charging, double-emailing, or double-incrementing.

    Make the consumer idempotent: dedupe on a stable message id (record processed ids and skip repeats), or design the operation so reapplying it is a no-op (upsert, set-to-value instead of increment). Then redelivery is harmless.

    Red flag Assuming each message arrives exactly once. At-least-once is the norm; build for duplicates.

    source: AWS — SQS at-least-once delivery ↗
  • Commonly asked senior trick common Is exactly-once delivery real? Explain the nuance.

    Exactly-once *network delivery* is generally impossible — you can't simultaneously guarantee no loss and no duplicates across an unreliable network (two-generals problem). What systems offer is exactly-once processing / effective-once: at-least-once delivery plus idempotent or transactional handling so the observable effect happens once.

    Kafka's 'exactly-once semantics' works this way: idempotent producers and transactions tie the consume-process-produce cycle together so duplicates don't produce duplicate effects. The honest framing: dedup + transactions give exactly-once *effects*, not magically-once delivery.

    Red flag Claiming a broker delivers exactly once over the wire. Real systems get exactly-once *effects* via idempotency/transactions, not exactly-once delivery.

    source: Confluent — Exactly-once semantics in Kafka ↗
  • Commonly asked mid concept common What is a dead-letter queue and when does a message land there?

    A dead-letter queue (DLQ) is a side queue where messages go after they repeatedly fail to be processed (exceeding a max-receive/retry count) or can't be delivered. It stops a single 'poison' message from being redelivered forever and blocking the main queue.

    Operationally you alert on DLQ depth, inspect the failed messages, fix the bug or bad data, and replay them back to the main queue. Without a DLQ, a permanently-failing message either loops endlessly or gets silently dropped.

    Red flag Having no DLQ, so a poison message either blocks the queue with infinite retries or is lost silently. Always have a parking lot.

    source: AWS — Amazon SQS dead-letter queues ↗
  • Commonly asked senior concept common Event-driven vs request/response — what do you gain and what do you give up?

    Request/response is synchronous and simple: the caller waits and gets an answer or an error, with a clear linear flow that's easy to reason about and debug. But it couples services temporally — if the callee is down, the caller fails — and it doesn't absorb spikes.

    Event-driven publishes events and lets consumers react asynchronously: it decouples producers from consumers, buffers load (the queue absorbs spikes), and lets you add new consumers without touching the producer. The costs are eventual consistency, harder end-to-end debugging/tracing, and the need for idempotency and ordering handling. Use events for fan-out, decoupling, and load-leveling; use request/response when you need an immediate answer.

    Red flag Going event-driven everywhere and losing the simple synchronous read paths. Async adds eventual consistency and tracing complexity — use it where decoupling actually pays.

    source: Netflix Tech Blog — event-driven architecture ↗
  • Commonly asked senior concept occasional How does Kafka preserve message ordering, and what's the catch?

    Kafka guarantees ordering only within a partition, not across a topic. Messages with the same partition key (e.g. userId) always land in the same partition and are consumed in order, so per-key ordering holds.

    The catch: you only get parallelism by having multiple partitions, and across partitions there's no global order. So you trade total ordering for throughput. If you need strict global ordering you're limited to one partition (no parallelism) — the usual move is to choose a partition key that makes per-key ordering sufficient.

    Red flag Assuming a Kafka topic is globally ordered. Ordering is per-partition; cross-partition order is undefined.

    source: Hello Interview — Kafka deep dive ↗
  • Commonly asked senior design occasional How would you reliably publish an event after committing a DB write (the dual-write problem)?

    The trap (a dual write) is committing to the DB and then publishing to the broker as two separate steps — a crash in between leaves them inconsistent (event lost, or published but DB rolled back).

    The standard fix is the transactional outbox: in the same DB transaction as the business write, insert the event into an outbox table. A separate relay (polling or change-data-capture like Debezium) reads the outbox and publishes to the broker, marking rows sent. Because the write and the outbox insert commit atomically, the event is never lost; the relay gives at-least-once publishing, so consumers stay idempotent.

    Red flag Doing a naive dual write (commit DB, then send to Kafka). A failure between the two desynchronizes your DB and your event stream.

    source: microservices.io — Transactional Outbox ↗
  • ★ must-know Commonly asked mid concept very common Point-to-point queue vs publish/subscribe — what's the difference and when do you use each?

    In a point-to-point queue, each message is delivered to exactly one consumer among possibly many competing workers — it's a work queue for distributing tasks (e.g. resize one image once, no matter how many workers are running). In publish/subscribe, each message is fanned out to *every* subscriber, so N independent services all react to the same event.

    Use point-to-point to load-balance work across a pool (competing consumers); use pub/sub to broadcast an event to multiple independent consumers. Kafka models pub/sub via consumer groups: across groups it's fan-out, within a group it's point-to-point load balancing.

    What a strong answer covers
    • Queue (point-to-point): one message → exactly one of the competing consumers.

    • Pub/sub: one message → every subscriber (fan-out).

    • Queues load-balance work; pub/sub broadcasts events.

    • Kafka consumer groups: fan-out across groups, load-balance within a group.

    Quick self-check

    You need an OrderPlaced event to trigger email, inventory, AND analytics services independently. Which model?

    Red flag Using a single shared queue when you actually need every service to see the event — only one consumer will get each message, and the others silently miss it.

    source: AWS — Pub/sub messaging vs message queues ↗
  • Commonly asked mid concept common What is consumer lag in Kafka, and what does growing lag tell you?

    Consumer lag is the gap between the latest offset produced to a partition (the log-end offset) and the offset the consumer group has committed — i.e. how many messages are produced-but-not-yet-processed. Steady or near-zero lag means consumers keep up; growing lag means the consumers can't process as fast as producers write.

    It's a primary health/alerting signal. Remedies for chronic lag: add consumers (up to the partition count — that's the parallelism ceiling), add partitions, speed up per-message processing, or batch. Spiky lag that drains is fine; monotonically rising lag predicts an eventual backlog blowup.

    What a strong answer covers
    • Lag = latest produced offset − consumer's committed offset (unprocessed backlog).

    • Rising lag = consumers slower than producers.

    • Max parallelism is bounded by partition count — more consumers than partitions sit idle.

    • A core metric to alert on for streaming health.

    Red flag Adding consumers beyond the number of partitions to cut lag. Extra consumers in a group just idle — you must increase partitions to raise parallelism.

    source: Confluent — Monitoring consumer lag ↗
  • Commonly asked senior design occasional Choreography vs orchestration for the Saga pattern — how do you keep a multi-service transaction consistent?

    With no distributed ACID transaction across microservices, a Saga breaks a business transaction into a sequence of local transactions, each publishing an event; if a step fails, compensating transactions undo the prior steps (e.g. refund a charge after inventory reservation fails).

    Choreography: services react to each other's events with no central coordinator — loosely coupled but the end-to-end flow is implicit and hard to trace. Orchestration: a central orchestrator explicitly drives each step and triggers compensations — easier to reason about and monitor, but the orchestrator is a coupling point. Choose choreography for simple, few-step flows; orchestration as step count and error handling grow.

    Red flag Trying to span microservices with one ACID transaction (e.g. distributed 2PC everywhere). Sagas with compensations are the practical model; 2PC scales and fails poorly across services.

    source: microservices.io — Saga pattern ↗
  • Commonly asked senior concept occasional How do RabbitMQ acks and the prefetch (QoS) setting affect throughput and reliability?

    With manual acks, RabbitMQ keeps a message 'unacked' until the consumer confirms it; if the consumer dies first, the message is requeued — that's how at-least-once delivery and crash safety work. Auto-ack trades that safety for speed (a crash mid-processing loses the message).

    Prefetch (basic.qos) caps how many unacked messages a consumer may hold at once. Prefetch=1 gives the fairest load distribution (a slow consumer won't hoard a backlog) but adds round-trip overhead; a higher prefetch boosts throughput by pipelining but can let one consumer grab a big batch while others idle. Tune prefetch to balance fairness against throughput for your processing time.

    What a strong answer covers
    • Manual ack = message redelivered if the consumer dies before acking (at-least-once).

    • Auto-ack is faster but loses in-flight messages on crash.

    • Prefetch limits unacked messages per consumer.

    • Low prefetch → fair distribution; high prefetch → throughput but possible hoarding.

    Red flag Using auto-ack for work you can't afford to lose, or leaving prefetch unbounded so one consumer grabs the whole queue while others starve.

    source: RabbitMQ — Consumer Acknowledgements and Publisher Confirms ↗
  • Commonly asked mid trick common Trick: does Kafka delete a message once a consumer reads it? What actually controls retention?

    No — this is the key mental-model shift from traditional queues. Kafka is a durable log; reading a message does not remove it. The consumer just advances its offset (a bookmark), and the data stays on disk for everyone else to read. Multiple consumer groups can read the same messages independently, and a group can rewind its offset to replay.

    Retention is governed by configured policy, not consumption: time-based (retention.ms, e.g. 7 days) or size-based (retention.bytes), or log compaction (keep the latest value per key). Messages age out by policy regardless of whether anyone consumed them.

    What a strong answer covers
    • Reading does NOT delete — consumers advance an offset (a bookmark).

    • Data persists for all consumer groups; rewinding the offset replays.

    • Retention is by time/size policy or log compaction, independent of consumption.

    • Contrast: a traditional queue typically deletes on consume.

    Quick self-check

    What happens to a Kafka message after a consumer reads it?

    Red flag Treating Kafka like a delete-on-read queue. Messages persist until the retention policy expires them — consumption only moves an offset.

    source: Confluent — Kafka topics and retention ↗
  • Commonly asked senior concept occasional How does a message queue provide back-pressure and load leveling, and what's the risk if you ignore queue depth?

    A queue decouples producer rate from consumer rate: during a spike, messages buffer in the queue instead of overwhelming the downstream service, which keeps processing at its sustainable rate — that's load leveling (the queue-based load-leveling pattern). It smooths bursts into a steady drain.

    But a queue is finite. If producers persistently outpace consumers, queue depth grows unbounded: latency climbs (messages wait longer), memory/disk fills, and you risk hitting limits or processing hours-stale data. Back-pressure is signaling producers to slow down (reject, throttle, or block) when depth crosses a threshold. Always monitor and alert on queue depth/age, cap the queue, and decide a shed/back-pressure policy — a queue defers overload, it doesn't eliminate it.

    What a strong answer covers
    • Queue buffers bursts so consumers drain at a sustainable rate (load leveling).

    • Back-pressure = signaling producers to slow when the queue fills.

    • Unbounded growth → rising latency, stale data, resource exhaustion.

    • Monitor depth/age; cap the queue and define a shedding/back-pressure policy.

    Red flag Treating the queue as infinite elastic buffer. If consumers are chronically slower than producers, the queue just defers the overload while latency and staleness balloon.

    source: Microsoft — Queue-Based Load Leveling pattern ↗