Cloud fundamentals

Compute/storage/networking primitives, regions & availability zones, and IAM/least privilege.

Underneath every cloud provider’s hundreds of services sit three primitives — compute, storage, and networking — plus two cross-cutting concepts that govern everything: the geography of regions and availability zones (for resilience and latency), and IAM (who can do what, ideally as little as necessary). Learn these and a new service is usually just a managed variation on something you already understand: a queue, a database, a function runner.

Key vocabulary

Compute: Where your code runs — on a spectrum from most to least you-manage: VMs/instances (EC2, Compute Engine) → containers (ECS/EKS, Cloud Run) → serverless functions (Lambda, Cloud Functions). Less management = less control + more vendor coupling.
Storage: Three shapes: object (S3/GCS — flat key→blob, infinitely scalable, HTTP API, for files/media/backups), block (EBS — a raw virtual disk you attach to one VM), and file (EFS/NFS — a shared filesystem many machines mount).
Region vs Availability Zone (AZ): A region is a geographic area (us-east-1); an AZ is one isolated datacenter within it. AZs in a region have low-latency links but independent power/cooling/network — so spreading across AZs survives a single-datacenter failure.
VPC: Virtual Private Cloud — your own isolated network in the provider, carved into subnets (public = internet-reachable, private = internal-only). The boundary inside which your compute and databases talk.
IAM: Identity & Access Management — the system of identities (users, roles, service accounts) and policies that grant permissions to actions on resources. The control plane for who can do what.

The three primitives

Almost every workload decomposes into these. Pick the compute model by how much operational control you need versus how little you want to manage; pick the storage shape by the access pattern.

Primitive	Options (AWS example)	Reach for it when	Watch out
Compute — VM	EC2	Full OS control, long-running, legacy/stateful workloads	You patch + scale + manage the OS yourself
Compute — container	ECS / EKS, Fargate	Portable services, microservices, want orchestration	Cluster/orchestration overhead (see K8s)
Compute — serverless	Lambda	Spiky/event-driven, want zero idle cost + no servers	Cold starts, time/size limits, vendor lock-in
Storage — object	S3	Files, images, video, backups, static sites, data lakes	Eventually-consistent listings historically; not a filesystem
Storage — block	EBS	A boot/data disk for a single VM (databases on a VM)	Attaches to one instance/AZ at a time
Networking	VPC + subnets + SG	Isolating + connecting everything above	Mis-scoped security groups = exposed databases

Compute = where code runs; storage = how data is shaped; networking = how it all connects + who's reachable.

Regions and availability zones — designing for failure

The geography is an availability tool. Deploy across multiple AZs within a region and a single datacenter losing power doesn’t take your service down — the load balancer routes to the healthy AZ. Deploy across multiple regions for disaster recovery and to put data near distant users (lower latency), at the cost of cross-region replication complexity and data-residency considerations.

FIG 1 · region with three AZs A region is a geography; the AZs inside it are independent datacenters on low-latency links. Spreading replicas across AZs means one datacenter losing power doesn't take the service down.

IAM and least privilege

IAM ties identities (users, roles, service accounts) to policies that allow or deny specific actions on specific resources. The governing principle is least privilege: grant the minimum permissions needed for the task, nothing more — so a leaked credential or a compromised service can do limited damage.

A least-privilege IAM policy

This policy lets a service read and write objects in one bucket — and nothing else. No wildcard, no other buckets, no delete-bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadWriteAppAssets",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::myapp-assets-prod/*"
    }
  ]
}

Contrast with the anti-pattern "Action": "s3:*" on "Resource": "*" — that grants every S3 action on every bucket in the account. The scoped policy above means a compromise of this service exposes exactly one bucket’s objects, not your entire S3 footprint. Prefer assigning policies to roles that compute assumes (no long-lived keys to leak) over static access keys on a user.

Key points

Three primitives: compute (VM → container → serverless, trading control for less management), storage (object / block / file by access pattern), networking (VPC + subnets connecting it all).
Object storage (S3/GCS) is the default for files/media/backups — flat, HTTP, infinitely scalable; block storage is a disk for one VM.
A region is a geography; an AZ is an isolated datacenter within it. Span multiple AZs for standard HA; go multi-region only for DR, residency, or global latency.
IAM governs who can do what; apply least privilege — scope every policy to specific actions and resource ARNs, default to deny, and prefer short-lived roles over static keys.
Wildcard policies and public buckets are the classic cloud breaches — narrow grants and private-by-default are the fix.

01 Learning objectives

0 / 1 done

02 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

Commonly asked mid concept very common What is the difference between a region and an availability zone, and how do you use them for high availability?
A region is a geographic area (e.g. us-east-1). Inside each region are multiple availability zones (AZs) — physically separate data centers with independent power, cooling, and networking, connected by high-bandwidth, low-latency links (single-digit ms).
For high availability, spread your workload across multiple AZs in a region: if one AZ loses power, the others keep serving, and a load balancer routes around the failed zone. That protects against a data-center-level failure with negligible latency cost. Going multi-region adds protection against a whole-region outage and lets you serve users closer to them, but it is far more complex (cross-region replication, data consistency, higher latency between regions). The pragmatic default is multi-AZ within one region; reach for multi-region when you genuinely need regional fault tolerance or global low latency.
Follow-ups they push on
- Why is multi-AZ the common HA default rather than multi-region?
- What new problems does going multi-region introduce?
Red flag Confusing the two, or running everything in a single AZ and calling it 'in the cloud so it's highly available' — one AZ failure then takes the whole service down.
source: AWS — Regions and Availability Zones ↗
Commonly asked senior concept very common Explain the principle of least privilege in cloud IAM, with a concrete example.
Least privilege means every identity (user, role, service) gets exactly the permissions it needs to do its job and nothing more. The smaller the granted permission set, the smaller the blast radius if those credentials leak or the service is compromised.
Concrete example: a Lambda that only reads from one bucket should have a policy granting s3:GetObject scoped to that specific bucket's ARN — not s3:* on *. Wildcards like Action: * / Resource: * are the classic violation. In practice: prefer roles with temporary credentials over long-lived access keys, scope policies to specific actions and resource ARNs, start from deny and add only what is needed, and review/trim permissions over time. Pair it with separation of duties so no single role can both deploy and exfiltrate.
Follow-ups they push on
- Why prefer IAM roles with temporary credentials over static access keys?
- How do you discover and trim over-broad permissions after the fact?
Red flag Granting broad wildcard policies (`s3:*` on `*`) 'to get it working' and never tightening them — one leaked key then has the run of the whole account.
source: AWS — IAM security best practices (least privilege) ↗
Commonly asked mid concept common Walk me through the core cloud compute, storage, and networking primitives and when you'd reach for each.
Compute: VMs (EC2-style — full control, you manage the OS), containers (ECS/EKS — packaged apps, orchestrated), and serverless functions (Lambda — event-driven, no servers to manage, scales to zero). Move up that ladder as you want less operational overhead and more elasticity.
Storage: object storage (S3 — cheap, durable, infinite-scale blobs: images, backups, static assets), block storage (EBS — a virtual disk attached to one VM, for databases/filesystems), and file storage (EFS/NFS — a shared filesystem across many machines). Match the access pattern: blobs over HTTP -> object; a disk for one instance -> block; shared POSIX filesystem -> file.
Networking: a VPC is your isolated private network; subnets segment it (public vs private); security groups are instance-level firewalls; and a load balancer spreads traffic across instances. The skill is mapping a workload to the cheapest primitive that fits its access and durability needs.
Follow-ups they push on
- When would you pick object storage over block storage?
- When does serverless make sense vs a long-running container?
Red flag Reaching for a full VM you have to patch and babysit when a managed/serverless option fits, or using a database on object storage (wrong access pattern) instead of block storage.
source: AWS — Types of cloud computing / core services ↗
Commonly asked mid concept common What is the cloud shared responsibility model, and why does it matter?
Security is split between the provider and you. The provider is responsible for security OF the cloud — the physical data centers, hardware, the hypervisor, and the managed-service infrastructure. You are responsible for security IN the cloud — your data, IAM users and permissions, network config (security groups, public/private subnets), OS patching on VMs you run, and application-level security.
The line shifts with the service tier: with a raw VM you patch the OS; with a managed database the provider patches it but you still own access control and your data; with serverless even more moves to the provider, but IAM and data are always yours. It matters because most cloud breaches are customer-side misconfigurations — a public S3 bucket or an over-permissive IAM policy — not the provider being hacked.
Follow-ups they push on
- How does the responsibility line move between a self-managed VM and a managed service?
- Whose fault is a publicly exposed storage bucket under this model?
Red flag Assuming 'the cloud provider handles security' end to end — IAM, data, and network configuration are always the customer's responsibility, and that is where most breaches actually happen.
source: AWS — Shared Responsibility Model ↗
Commonly asked mid concept common What is the difference between vertical and horizontal scaling in the cloud, and which does the cloud make easy?
Vertical scaling (scale up) means giving one instance more resources — a bigger CPU/RAM tier. It is simple and needs no app changes, but you hit a hardware ceiling, usually need a restart/downtime to resize, and the single box is still a single point of failure.
Horizontal scaling (scale out) means adding more instances behind a load balancer. It scales effectively without limit and improves availability (lose one node, the rest serve), which is exactly what cloud auto-scaling groups automate — add instances when load rises, remove them when it falls. The catch is the app must be stateless (or externalize session state to a shared store like Redis) so any instance can handle any request. The cloud's elasticity is built around horizontal scaling; that is why 'make services stateless' is such a load-bearing design rule.
Follow-ups they push on
- Why does horizontal scaling require stateless services?
- What does an auto-scaling group buy you over manually resizing an instance?
Red flag Trying to scale a stateful, session-on-the-box service horizontally — requests landing on a different instance lose the session, so you are forced back into sticky sessions or a single big vertical box.
source: AWS — Auto Scaling / scaling concepts ↗
Commonly asked senior concept occasional Why might a company choose managed cloud services over self-hosting, and what are the tradeoffs?
Managed services (RDS instead of running your own Postgres, EKS instead of bootstrapping Kubernetes) shift operational burden to the provider: patching, backups, failover, scaling, and HA come built in, so a small team ships faster and pages less. You trade money and some control for time and reliability.
The tradeoffs: higher direct cost, less control over versions/tuning/internals, and vendor lock-in (managed offerings differ across clouds, raising switching cost). Self-hosting gives maximum control and can be cheaper at very large, steady scale, but you now own the on-call, the upgrades, and the failure modes. The senior answer weighs team size, scale, and how differentiating the capability is: do not burn your scarce engineers running undifferentiated infrastructure a managed service handles well.
Follow-ups they push on
- How does vendor lock-in factor into choosing a managed service?
- At what scale might self-hosting actually become the cheaper choice?
Red flag Defaulting to self-hosting core infrastructure 'to save money' on a small team — the hidden cost is the engineering time and on-call burden of operating it, which usually dwarfs the managed-service bill.
source: AWS — What is managed services / cloud value ↗
Commonly asked mid concept very common What is the difference between authentication and authorization in cloud IAM, and how do roles fit in?
Authentication answers 'who are you?' — proving identity (a user signing in, a service presenting credentials or a token). Authorization answers 'what are you allowed to do?' — evaluating policies to decide whether that proven identity may perform an action on a resource. Authn comes first; authz comes after. They're distinct: a correctly authenticated user can still be denied an action.
In cloud IAM, policies are the authorization rules (allow/deny on actions + resources), attached to identities. An IAM role is an identity with policies but no permanent credentials — instead, a trusted principal (an EC2 instance, a Lambda, another account, a federated user) assumes the role and receives temporary, auto-rotating credentials. That's why roles are the best-practice way to grant permissions to services: no long-lived access keys to leak.
So: authn = identity, authz = permissions (policies), and roles = a way to hand out scoped, temporary permissions to whoever/whatever assumes them.
What a strong answer covers
- Authentication = prove who you are; authorization = what you're allowed to do (policies).
- Authn happens first; an authenticated identity can still be denied by authorization.
- Policies encode authorization (allow/deny on actions + resources).
- An IAM role has no permanent credentials — principals assume it for temporary ones.
- Roles are best practice for services (EC2/Lambda): no long-lived keys to leak.
Quick self-check
An EC2 instance needs to read one S3 bucket. The best-practice way to grant this is:
Follow-ups they push on
- Why are IAM roles with temporary credentials safer than static access keys for a service?
- Can an authenticated identity ever be denied? Why?
- What does it mean for a principal to 'assume' a role?
Red flag Conflating authentication with authorization — proving identity (authn) does not grant any permission; access is still decided by the policies evaluated at the authorization step.
source: AWS — IAM identities (roles) / how IAM works ↗
Commonly asked mid concept common What is object storage (like S3), and why is it not a filesystem or a database?
Object storage stores data as objects — a blob of bytes plus metadata and a unique key — in a flat namespace (a bucket), accessed over HTTP APIs (GET/PUT), not a mounted disk. It's built for massive scale, very high durability (S3 famously targets eleven 9s by replicating across devices/AZs), and cheap capacity. Ideal for images, video, backups, logs, static website assets, and data-lake files.
Why it's not a filesystem: there are no real directories (the '/' in a key is cosmetic — it's a flat key space), you can't do partial in-place edits efficiently (you generally replace the whole object), and there's no POSIX file locking or low-latency random byte access like a block device. Why it's not a database: no transactions, no rich queries/joins, no secondary indexes — it's a key→blob store, not a query engine.
The skill is matching the access pattern: whole-blob read/write over HTTP, write-once-read-many, durability over mutability → object storage. Mutable structured records you query → a database. A disk for an OS/DB → block storage.
What a strong answer covers
- Objects = blob + metadata + key in a flat bucket namespace, accessed via HTTP APIs.
- Built for scale, extreme durability (S3 ~11 nines), and low cost — images, backups, logs, assets.
- Not a filesystem: no real directories, no efficient partial edits, no POSIX locking/random access.
- Not a database: no transactions, joins, or queries — it's key→blob.
- Match access pattern: whole-blob, write-once-read-many → object storage.
Quick self-check
Which workload is the BEST fit for object storage like S3?
Follow-ups they push on
- Why is the '/' in an S3 key not a real directory?
- When would block storage be the right choice over object storage?
- What makes object storage so durable?
Red flag Using object storage as a database or a mutable filesystem — there are no transactions/queries and no efficient in-place edits, so a workload needing those will be slow, awkward, or incorrect.
source: AWS — What is object storage? (S3) ↗
Commonly asked mid concept common Compare the IaaS, PaaS, and SaaS service models. Who manages what at each level?
It's a ladder of how much the provider manages vs you. IaaS (raw VMs, networking, storage — EC2) gives you the infrastructure; you still manage the OS, runtime, and app. Most control, most operational burden. PaaS (App Engine, Heroku, managed databases) hands you a platform — you push code and the provider runs the OS, runtime, scaling, and patching; you manage only your app and data. SaaS (Gmail, Salesforce) is finished software you just use; the provider manages essentially everything, you manage only your data and configuration.
The through-line is the shared responsibility line moving up as you go IaaS → PaaS → SaaS: you trade control and flexibility for less operational work. (Serverless/FaaS sits near PaaS — even the runtime instance is abstracted, scaling to zero.)
The senior framing: pick the highest level that still meets your control/customization needs, so you don't waste engineering effort managing layers a provider would handle for free.
What a strong answer covers
- IaaS (EC2): provider runs hardware/virtualization; you run OS, runtime, app — most control.
- PaaS (App Engine, managed DBs): push code; provider runs OS/runtime/scaling/patching.
- SaaS (Gmail, Salesforce): finished software; you manage only your data and config.
- The responsibility line moves up IaaS → PaaS → SaaS: less control, less ops burden.
- Pick the highest level that still meets your control needs to minimize wasted ops effort.
Quick self-check
On a managed PaaS, which layer are YOU still responsible for?
Follow-ups they push on
- Where does serverless / FaaS sit on this ladder?
- What do you give up moving from IaaS to PaaS?
- How does this map onto the shared responsibility model?
Red flag Defaulting to IaaS and hand-managing OS/runtime/scaling when a PaaS would handle it — you pay in engineering time for control you don't actually need.
source: AWS — Types of cloud computing (IaaS/PaaS/SaaS) ↗
Commonly asked senior concept common How do you control and reason about cloud cost? What's the difference between on-demand, reserved, and spot pricing?
Cloud's elasticity cuts both ways: pay-per-use is great until idle or oversized resources quietly bleed money. The compute pricing tiers trade flexibility for cost: on-demand is full price, no commitment — for spiky or unpredictable workloads; reserved instances / savings plans commit to 1–3 years for a big discount — for steady, predictable baseline load; spot uses spare capacity at up to ~90% off but can be reclaimed with little notice — for fault-tolerant, interruptible work (batch jobs, CI, stateless workers that can be killed and rescheduled).
The broader cost levers: right-size (most instances are over-provisioned), auto-scale so you pay for what you use and scale to zero where possible (serverless), watch egress/data-transfer (a sneaky cost), set lifecycle policies to tier cold data to cheaper storage, and tag resources so you can attribute spend. Set budgets and alerts so surprises page you, not finance.
Senior framing: match the pricing model to the workload's tolerance for interruption and predictability — steady baseline on reserved, bursts on on-demand, interruptible bulk on spot.
What a strong answer covers
- On-demand: full price, no commitment — spiky/unpredictable workloads.
- Reserved / savings plans: 1–3yr commit for big discount — steady baseline load.
- Spot: up to ~90% off spare capacity but reclaimable anytime — fault-tolerant, interruptible work.
- Levers: right-size, auto-scale/scale-to-zero, watch egress, tier cold data, tag for attribution.
- Set budgets + alerts so cost surprises page engineers early.
Follow-ups they push on
- What kind of workload is safe to run on spot instances, and what isn't?
- Why is data egress an easy cost to overlook?
- How does auto-scaling change your cost profile vs a fixed fleet?
Red flag Running interruptible bulk work on full-price on-demand (or worse, putting a stateful production service on spot) — the first wastes ~90% of the spend, the second gets reclaimed out from under you with little warning.
source: AWS — EC2 instance purchasing options (on-demand/reserved/spot) ↗
Commonly asked senior concept occasional What does it mean for an architecture to be 'cloud-native', and why design for failure?
Cloud-native means building for the cloud's actual characteristics rather than lifting a fixed on-prem server into a VM. Core ideas: treat servers as cattle, not pets (instances are disposable and replaceable, not hand-tended); make services stateless so they scale horizontally and any instance can handle any request; externalize state to managed stores; automate provisioning with IaC; and design for failure — assume any instance, AZ, or dependency can die at any moment.
Why design for failure: at cloud scale, hardware *will* fail constantly — it's a statistical certainty, not an edge case. So you build in redundancy (multi-AZ), health checks and auto-replacement (a dead instance is terminated and a new one launched automatically), retries with backoff and circuit breakers for flaky dependencies, and graceful degradation. The famous expression of this is Netflix's Chaos Monkey, which kills production instances on purpose to prove the system survives.
Senior framing: the cloud doesn't give you reliability for free — it gives you the *primitives* (multiple AZs, auto-scaling, managed failover) and you must architect to use them.
What a strong answer covers
- Cloud-native = build for the cloud's traits, not a lifted-and-shifted pet server.
- Cattle not pets: instances are disposable, replaced automatically, never hand-tended.
- Stateless services + externalized state enable horizontal scaling and easy replacement.
- Design for failure: at scale hardware *will* fail — redundancy, health checks, retries, circuit breakers.
- The cloud gives primitives (multi-AZ, auto-scale, failover); you must architect to use them.
Follow-ups they push on
- What does 'cattle not pets' mean for how you operate servers?
- Why is statelessness a prerequisite for treating instances as disposable?
- What is a circuit breaker protecting you from?
Red flag Lifting an on-prem 'pet' server into a single cloud VM and calling it cloud-native — without statelessness, redundancy, and automated replacement, you've just moved a single point of failure into someone else's data center.
source: AWS — Reliability pillar (Well-Architected Framework) ↗
Commonly asked senior debug occasional An EC2 instance in a private subnet can't reach the internet to pull package updates. How do you diagnose and fix it?
A private subnet by definition has no route to an internet gateway, so instances there can't make outbound internet calls directly — that's the intended design, not a bug. The fix for *outbound-only* access is a NAT gateway: place it in a public subnet, and add a route in the private subnet's route table sending 0.0.0.0/0 to the NAT gateway. The NAT allows egress (and the return traffic for connections it initiated) but blocks unsolicited inbound — so the instance can pull updates while staying unreachable from the internet.
Work the diagnosis like a checklist down the path: (1) the private subnet's route table — is there a 0.0.0.0/0 → nat-... route? (2) the NAT gateway itself — is it in a *public* subnet that routes to an internet gateway? (3) security group outbound rules — egress allowed? (4) NACL — does the subnet's stateless ACL allow both the outbound request and the inbound return traffic? (5) DNS resolution working?
The senior tell: knowing that a NAT gateway (not an internet gateway) is the correct egress mechanism for private subnets, and checking the stateless NACL return-traffic rule that bites people.
What a strong answer covers
- Private subnet = no internet-gateway route by design; direct outbound fails as intended.
- Fix outbound-only access with a NAT gateway in a public subnet + a 0.0.0.0/0 → NAT private route.
- NAT allows egress + return traffic but blocks unsolicited inbound — instance stays private.
- Diagnose down the path: route table → NAT placement → SG egress → NACL (return traffic!) → DNS.
- Stateless NACLs must explicitly allow the inbound return traffic, a common silent culprit.
Follow-ups they push on
- Why a NAT gateway rather than an internet gateway for a private-subnet instance?
- Why must the NAT gateway itself live in a public subnet?
- Which stateless rule on a NACL commonly breaks return traffic?
Red flag Attaching an internet gateway route to the private subnet 'to fix it' — that makes the subnet public and the instance internet-reachable, defeating the security design; the correct egress path is a NAT gateway.
source: AWS — NAT gateways ↗