Deployment strategies

Blue-green, canary, and rolling deploys — define each and give the tradeoff.

Shipping a new version is rarely an instant flip — it’s a transition during which old and new code coexist, and your strategy decides who sees the new version, for how long, and how fast you can undo it. The three you must be able to define and contrast are blue-green, canary, and rolling. The single axis that organizes all of them: how big is the blast radius if the new version is broken, and how fast can you roll back?

Key vocabulary

Blue-green: Run two full environments — blue (live) and green (new). Deploy to green, test it, then flip the load balancer to send 100% of traffic to green at once. Rollback = flip back.
Canary: Release the new version to a small slice of traffic (1–5%), watch its metrics, then progressively widen the rollout if it's healthy — or roll back the slice if not.
Rolling: Replace instances of the old version with the new one a few at a time, in batches, until the whole fleet is updated. No second environment needed.
Blast radius: How many users/requests a bad release can hurt before you stop it. Canary minimizes it; blue-green exposes everyone the instant you flip.
Backward compatibility: The property that new and old code can run simultaneously against the same database and APIs. Every gradual strategy requires it — especially for schema changes.
Expand/contract: A two-phase schema migration: first expand (additively add the new shape so both versions work), migrate the code, then later contract (remove the old shape) once nothing needs it.

The three strategies side by side

All three exist to reduce the blast radius of a bad deploy; they trade off cost, rollback speed, and how exposed users are during the transition.

FIG 1 · three rollout shapes Blue-green flips everyone at once; canary ramps a small slice while watching metrics; rolling swaps instances batch by batch.

Strategy	How it rolls out	Rollback	Cost / tradeoff
Blue-green	deploy to idle env, flip all traffic at once	instant — flip back to blue	needs 2× infrastructure; all users move together so a missed bug hits everyone at once
Canary	1% → 10% → 50% → 100%, watching metrics	fast — only the small slice was exposed	smallest blast radius, but slowest and needs good metrics/automation to judge health
Rolling	replace instances batch by batch	slow — must roll the new version back out	no extra infra, but mixed versions serve traffic mid-deploy and rollback isn't instant

Blue-green buys instant rollback with double infra; canary buys the smallest blast radius with the most machinery; rolling is cheapest but exposes mixed versions.

The one-line summary to keep: blue-green = instant cutover and instant rollback, but you pay for two environments and everyone moves at once. Canary = expose the fewest users to a bad release, at the cost of a slower, metrics-driven rollout. Rolling = cheapest (the Kubernetes default), but you accept a window where both versions serve traffic and rollback means rolling again.

A canary ramp, step by step

A canary isn’t a single switch — it’s a sequence of gates. At each step you shift a bit more traffic, then pause to compare the new version’s health against the stable one before proceeding.

A metrics-gated canary ramp

# A canary advances through weighted steps; at each pause the
# automation compares the canary's error rate + latency to baseline.
strategy:
  canary:
    steps:
      - setWeight: 1      # 1% of traffic to v2
      - pause: { duration: 5m }          # bake; watch error rate & p99
      - analysis: { metric: error-rate, threshold: 1% }  # auto-abort if exceeded
      - setWeight: 10
      - pause: { duration: 10m }
      - setWeight: 50
      - pause: { duration: 10m }
      - setWeight: 100    # full promotion — v2 becomes stable

The promotion is gated on metrics, not a timer alone: if the analysis step sees the canary’s error rate cross the threshold, it auto-aborts and shifts traffic back to v2’s predecessor. At 1% weight, a bad release hurts 1% of users for five minutes — that’s the small blast radius canary is paying its complexity for.

Why mixed versions force backward compatibility

The two-version window during a rolling deploy

During a rolling update, the fleet temporarily looks like this:

Load Balancer
   ├─ pod (v1)   ← still serving real traffic
   ├─ pod (v1)
   ├─ pod (v2)   ← already serving real traffic
   └─ pod (v2)
        │
        ▼
   shared database  ← BOTH versions read & write it

Both versions hit the same database at the same time. If v2 ships a migration that renames a column, every v1 pod still querying the old name starts throwing errors — and so does v2 if you roll back after the rename. The fix is the expand/contract pattern: first deploy a schema that adds the new column (expand) so both versions work, migrate code to use it, and only later remove the old column (contract) once no running version needs it.

This is the deep point that separates a deploy strategy from a real one: gradual rollouts only work if the new and old code are mutually compatible against shared state. Canary and rolling both run mixed versions; blue-green avoids the mixed application tier but still shares the database, so schema changes must be backward-compatible regardless.

01 Learning objectives

0 / 1 done

02 Knowledge check

knowledge check1 questions · pass ≥ 70%

01medium
Which strategy lets you instantly roll the entire fleet back by flipping a switch?

03 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

Commonly asked mid concept very common Compare blue-green, canary, and rolling deployments — define each and give the tradeoff.
Blue-green: run two full environments; blue serves prod while green gets the new version, then flip all traffic at once. Tradeoff: instant rollback (flip back), but you pay for double infrastructure.
Canary: release to a small slice of traffic/users first, watch metrics, then ramp up. Tradeoff: limits blast radius and catches real-world bugs early, but needs good monitoring and automated rollback, and the rollout is slower.
Rolling: replace instances in batches in place until all run the new version. Tradeoff: no extra infrastructure and simple, but both versions run simultaneously during the roll, rollback is slower, and bugs surface gradually.
Choice comes down to risk tolerance, infra budget, and how fast you need to recover.
Follow-ups they push on
- Which strategies require your two versions to be backward/forward compatible at the same time?
- How does a canary differ from a rolling deploy mechanically?
Red flag Confusing canary with rolling — canary targets a *traffic/user* slice and is metric-gated; rolling replaces *instances* batch by batch regardless of who they serve.
source: Unleash — Comparing deployment strategies ↗
Commonly asked senior design common Your service uses blue-green deploys. A migration adds a NOT NULL column. Why is this dangerous, and how do you ship it safely?
During the flip (and any rollback) both the old and new code may run against the *same* database. An old instance does not know about the new column; if it is NOT NULL with no default, the old code's inserts fail. A destructive migration also makes rollback impossible.
Safe approach is expand/contract (a.k.a. parallel change):
1. Expand: add the column as nullable / with a default — old and new code both work.
2. Deploy code that writes (and backfills) the new column.
3. Backfill existing rows.
4. Contract: only after all code uses it, add the NOT NULL constraint and drop old paths.
The rule: schema changes must be backward-compatible with the version still running.
Follow-ups they push on
- How does the same problem bite a rolling deploy?
- Why should you never rename a column in a single migration?
Red flag Coupling a destructive/forward-only schema change to the same release as the code that needs it — it breaks the still-running old version and blocks rollback.
source: Martin Fowler — ParallelChange (expand/contract) ↗
Commonly asked senior concept common When would you choose blue-green over canary, and vice versa?
Blue-green suits big-bang releases where you want an instant, all-or-nothing cutover and the cleanest possible rollback — e.g. a major version where running both versions side by side for long is undesirable, and you can afford the duplicate environment.
Canary suits fast-evolving services where you want to validate a change against *real* production traffic before full exposure, and where a small percentage of affected users is an acceptable way to catch regressions monitoring can detect.
Real answer mentions constraints: canary needs solid metrics + automated rollback; blue-green needs budget for two environments and a story for shared state (DB, caches).
Follow-ups they push on
- What makes automated rollback feasible for a canary but trickier for blue-green?
- How do feature flags let you decouple deploy from release entirely?
Red flag Recommending canary without acknowledging it is useless without good observability to decide promote-vs-roll-back.
source: TechTarget — canary vs blue/green vs rolling ↗
Commonly asked senior concept occasional What's the difference between a deployment and a release, and why does the distinction matter?
Deploy = getting new code running in production. Release = exposing that behavior to users. Feature flags let you separate the two: you can deploy dark code that is off, then flip it on (release) independently — and turn it off without a redeploy.
Why it matters: it shrinks risk. Deploys become routine and frequent; releases become a business decision (flag on for 5%, then 50%, then all). Rollback of a feature is a config flip, not a redeploy. It also enables trunk-based development — unfinished work hides behind a flag instead of a long-lived branch.
Follow-ups they push on
- What is the operational cost of accumulating stale feature flags?
- How do flags enable canary-style releases without canary infrastructure?
Red flag Conflating deploy and release — assuming code is live to users the instant it is deployed, when a flag may gate it.
source: Martin Fowler — Feature Toggles ↗
Commonly asked mid concept occasional What is a deployment rollback, and why is 'roll forward' often preferred in practice?
Rollback restores the previous known-good version after a bad deploy. With blue-green it is a traffic flip; with rolling it means re-deploying the old image batch by batch.
Many mature teams prefer roll forward — ship a fix as a new deploy — because rollback can be unsafe when the bad version already wrote incompatible data or ran a forward-only migration. You cannot 'un-migrate' easily, and an old binary against a new schema can corrupt things.
Strong answer: keep deploys small and frequent so the diff to fix or revert is tiny, make migrations backward-compatible so rollback stays an option, and automate whichever path you choose.
Follow-ups they push on
- When is rollback strictly impossible?
- How do small, frequent deploys make both rollback and roll-forward safer?
Red flag Assuming rollback is always safe — irreversible migrations or data written by the new version can make rolling back worse than rolling forward.
source: Google SRE Book — Release Engineering ↗
Commonly asked senior concept occasional How does Kubernetes implement a rolling update, and what knobs control its safety?
A Deployment's RollingUpdate strategy spins up new-version Pods and tears down old ones gradually, governed by two knobs:
- maxUnavailable — how many Pods below the desired count you tolerate during the roll (availability floor).
- maxSurge — how many extra Pods above desired you allow (capacity ceiling).
Kubernetes only routes traffic to a Pod once its readiness probe passes, so a broken new version that never becomes ready stalls the rollout instead of taking traffic. kubectl rollout undo reverts to the prior ReplicaSet.
For canary/blue-green you layer in a service mesh or progressive-delivery controller (Argo Rollouts, Flagger) — vanilla Deployments only do rolling.
Follow-ups they push on
- Why is a correct readiness probe essential for a safe rolling update?
- What does maxSurge=0, maxUnavailable=0 do — and why is it a deadlock?
Red flag Forgetting readiness probes — without them Kubernetes sends traffic to Pods that are up but not actually ready to serve.
source: Kubernetes — Rolling updates ↗
Commonly asked senior concept common What's the operational cost of feature flags, and how do you keep them from becoming tech debt?
Flags decouple deploy from release and are powerful, but each one adds a branch to your code's runtime behavior. Costs: combinatorial explosion (N flags = 2^N possible states you can't all test), stale flags that linger long after a rollout completes and confuse readers, and the risk of a flag becoming a permanent, undocumented config knob.
The fix is treating flags as short-lived by default: a release toggle exists only to ramp a feature, and you delete it (and its dead branch) the moment the feature is 100% rolled out. Distinguish flag *kinds* — release toggles are transient; ops/kill-switches and permissioning toggles are long-lived and managed differently. Track flags in a registry with an owner and an expiry, and add cleanup to the definition of done.
What a strong answer covers
- Each flag doubles the runtime state space — 2^N combinations quickly become untestable.
- Stale release toggles are tech debt: dead branches that mislead readers and rot.
- Categorize toggles — release (short-lived) vs ops/kill-switch and permissioning (long-lived) — and manage each differently.
- Give every flag an owner and expiry; deleting the flag is part of finishing the feature.
Follow-ups they push on
- Why are short-lived release toggles managed differently from long-lived kill-switches?
- How does an unbounded set of flags undermine your test strategy?
Red flag Leaving release toggles in the code after the feature is fully rolled out — they accumulate into untested, confusing dead branches and a combinatorial test nightmare.
source: Martin Fowler — Feature Toggles (Managing toggles) ↗
Google mid concept common Why should deployments be automated and repeatable rather than a manual checklist?
Manual deploys are slow, error-prone, and unrepeatable — the same human running the same steps will eventually skip one under pressure, and the process lives in one person's head. Automation makes the deploy deterministic and self-documenting: the pipeline *is* the runbook.
The SRE principle is that releases should be hermetic and reproducible — build from a known, version-controlled source with pinned tools so the same inputs always produce the same artifact, independent of the machine running the build. Combined with automated tests as gates, this lets you deploy frequently and safely, and makes rollback a known, rehearsed action rather than improvisation during an incident.
Frequent small automated deploys also shrink each change's blast radius — easier to test, easier to bisect, easier to revert.
What a strong answer covers
- Manual steps are non-repeatable and fail under pressure; automation makes deploys deterministic.
- Builds should be hermetic/reproducible — pinned source and tools, same inputs → same artifact.
- Automated test gates let you deploy frequently and safely, with rehearsed rollback.
- Small, frequent, automated releases shrink each deploy's blast radius.
Follow-ups they push on
- What does a 'hermetic build' mean and why does it aid reproducibility?
- How does deploy frequency relate to the size of each change's risk?
Red flag Relying on a manual, tribal-knowledge deploy checklist — it doesn't scale, drifts from reality, and turns every release into a risk that only one person can run.
source: Google SRE Book — Release Engineering ↗
Commonly asked senior design occasional What metrics actually decide whether to promote or roll back a canary?
A canary is only as good as the signal you judge it by. The decision should be automated and metric-gated, comparing the canary against the baseline (the current stable version) over the same window — not against historical numbers, since traffic shifts.
Watch the user-facing signals: error rate, latency percentiles (p95/p99, not the mean), and request success/throughput, plus key business metrics where relevant (checkout completion, sign-ups). Saturation of the canary's resources is a secondary guard. If any guardrail metric on the canary is statistically worse than baseline beyond a threshold, auto-roll-back; otherwise ramp traffic up in stages.
The pitfalls to design around: too short a bake time (a slow leak or a cache that hasn't warmed won't show yet), too little canary traffic (no statistical power), and comparing against the wrong baseline.
What a strong answer covers
- Compare the canary against the concurrent baseline, over the same window — not against historical data.
- Gate on user-facing signals: error rate, latency percentiles, success/throughput, plus business KPIs.
- Automate the verdict: breach a guardrail → auto-roll-back; otherwise ramp up in stages.
- Give it enough bake time and traffic volume for the signal to be statistically meaningful.
Follow-ups they push on
- Why compare against a concurrent baseline rather than yesterday's numbers?
- What kind of bug would a 10-minute canary with 1% traffic still miss?
Red flag Promoting a canary too quickly or on too little traffic — slow leaks, cold caches, and rare-path errors don't surface in a short, low-volume bake, so the 'green' canary ships a latent bug.
source: Google SRE Workbook — Canarying Releases ↗
Commonly asked mid concept common What's the difference between continuous delivery and continuous deployment?
Both build on continuous integration (merge and test small changes frequently) and keep main in an always-releasable state. The difference is the last step.
Continuous delivery: every change that passes the pipeline is *ready* to deploy, but the actual push to production is a manual decision — a human clicks the button. You can release any time, on demand.
Continuous deployment: there is no manual gate — every change that passes all automated checks deploys to production automatically. It demands very strong test coverage, automated rollback, and good observability, because nothing stops a bad change but the pipeline itself.
So: continuous *delivery* makes release a one-click choice; continuous *deployment* removes the click. Many teams do CD-delivery and reserve full auto-deploy for services where they trust their safety nets.
What a strong answer covers
- Both rest on CI and an always-releasable main.
- Continuous delivery = always *ready* to ship, but a human triggers the production release.
- Continuous deployment = every passing change auto-ships to prod, with no manual gate.
- Continuous deployment requires strong automated tests, rollback, and observability to be safe.
Quick self-check
What is the single distinguishing feature of continuous *deployment* versus continuous *delivery*?
Follow-ups they push on
- What safety nets must be in place before you trust full continuous deployment?
- Where do feature flags fit in a continuous-deployment pipeline?
Red flag Using 'CD' loosely — interviewers care that you distinguish *delivery* (manual release trigger) from *deployment* (fully automatic), and know the latter's higher safety-net bar.
source: Atlassian — Continuous integration vs delivery vs deployment ↗
Commonly asked senior debug occasional Your canary shows no errors and gets promoted to 100%, then production falls over an hour later. What likely went wrong?
The canary passed because the failure mode wasn't *visible at canary scale or duration*. The usual suspects:
- A slow resource leak (memory, file descriptors, connection-pool exhaustion) that only crosses the limit after an hour of uptime — the short bake never reached it.
- A load/scale effect: at 1% traffic a new query or lock was fine; at 100% it saturates the database or a downstream dependency that the small canary never pressured.
- Cold→warm transitions: caches were warm on the old fleet but the new version's cache was cold under full load, or a thundering-herd on cutover.
- Time/cron-triggered behavior (a batch job, TTL expiry) that simply hadn't fired during the canary window.
Response: roll back (or roll forward a fix), then fix the *process* — longer bake time, load-aware canary analysis, and dependency/saturation metrics, not just error rate.
What a strong answer covers
- Canaries miss bugs that need time (leaks, scheduled jobs) or scale (DB/lock saturation at full traffic) to manifest.
- 1% traffic gives no statistical power for rare paths and no pressure on shared downstreams.
- Cold caches / thundering herd on full cutover can sink a version that looked fine warm.
- Fix the process: longer bake time, watch saturation and dependencies, not error rate alone.
Follow-ups they push on
- How would a memory leak escape a 15-minute canary but kill the fleet in an hour?
- Why can a query be fine at 1% traffic and lethal at 100%?
Red flag Trusting a short, low-traffic canary as proof of safety — error-rate-only, brief canaries are blind to leaks, scale effects, and time-triggered behavior.
source: Google SRE Workbook — Canarying Releases ↗
Commonly asked senior concept common Why does any zero-downtime deploy require old and new versions to be compatible, and what breaks if they aren't?
Rolling, canary, and blue-green (during the flip) all have a window where both versions serve traffic simultaneously against shared state — the same database, the same message formats, the same caches and clients. If the versions aren't mutually compatible, that window corrupts data or throws errors.
Concrete breakages: the new version writes a message/field the old version can't parse (or vice versa); the new schema drops or renames a column the old code still reads; a client gets a v2 response from the new instance then a v1 from the old one on the next request. Rollback is the same problem in reverse — the old version must tolerate data the new version already wrote.
The discipline is backward and forward compatibility via expand/contract: change in additive, tolerant steps (add before you read, deploy before you require, contract only after everything is upgraded) so any two adjacent versions can coexist.
What a strong answer covers
- Every zero-downtime strategy has a window where N and N+1 run together on shared state.
- Incompatibility there means corrupted data or runtime errors, not a clean failure.
- Rollback needs the same property in reverse: old code must tolerate data new code wrote.
- The cure is expand/contract / parallel change — additive, tolerant steps so adjacent versions coexist.
Follow-ups they push on
- How does expand/contract make a column rename safe across a rolling deploy?
- Why does message-queue schema evolution need both forward and backward compatibility?
Red flag Assuming a deploy is atomic — for the duration of any rolling/canary/blue-green cutover two versions coexist, so a breaking change to schema or wire format corrupts the in-flight overlap.
source: Martin Fowler — ParallelChange (expand/contract) ↗