> cs·fundamentals
interview 0% 16m read
6.2.5 [J][A] 12 interview Q's

Cloud fundamentals

Compute/storage/networking primitives, regions & availability zones, and IAM/least privilege.

Underneath every cloud provider’s hundreds of services sit three primitives — compute, storage, and networking — plus two cross-cutting concepts that govern everything: the geography of regions and availability zones (for resilience and latency), and IAM (who can do what, ideally as little as necessary). Learn these and a new service is usually just a managed variation on something you already understand: a queue, a database, a function runner.

The three primitives

Almost every workload decomposes into these. Pick the compute model by how much operational control you need versus how little you want to manage; pick the storage shape by the access pattern.

PrimitiveOptions (AWS example)Reach for it whenWatch out
Compute — VMEC2Full OS control, long-running, legacy/stateful workloadsYou patch + scale + manage the OS yourself
Compute — containerECS / EKS, FargatePortable services, microservices, want orchestrationCluster/orchestration overhead (see K8s)
Compute — serverlessLambdaSpiky/event-driven, want zero idle cost + no serversCold starts, time/size limits, vendor lock-in
Storage — objectS3Files, images, video, backups, static sites, data lakesEventually-consistent listings historically; not a filesystem
Storage — blockEBSA boot/data disk for a single VM (databases on a VM)Attaches to one instance/AZ at a time
NetworkingVPC + subnets + SGIsolating + connecting everything aboveMis-scoped security groups = exposed databases
Compute = where code runs; storage = how data is shaped; networking = how it all connects + who's reachable.

Regions and availability zones — designing for failure

The geography is an availability tool. Deploy across multiple AZs within a region and a single datacenter losing power doesn’t take your service down — the load balancer routes to the healthy AZ. Deploy across multiple regions for disaster recovery and to put data near distant users (lower latency), at the cost of cross-region replication complexity and data-residency considerations.

A large region box labeled us-east-1 contains three availability-zone boxes side by side, each holding an app instance; a load balancer above fans traffic to all three.region · us-east-1 (one geography)Load BalancerAZ-aown power/coolingappAZ-bown power/coolingappAZ-c · DOWNpower outageapp ✗LB stops routing here
FIG 1 · region with three AZs A region is a geography; the AZs inside it are independent datacenters on low-latency links. Spreading replicas across AZs means one datacenter losing power doesn't take the service down.

IAM and least privilege

IAM ties identities (users, roles, service accounts) to policies that allow or deny specific actions on specific resources. The governing principle is least privilege: grant the minimum permissions needed for the task, nothing more — so a leaked credential or a compromised service can do limited damage.

A least-privilege IAM policy

This policy lets a service read and write objects in one bucket — and nothing else. No wildcard, no other buckets, no delete-bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadWriteAppAssets",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::myapp-assets-prod/*"
    }
  ]
}

Contrast with the anti-pattern "Action": "s3:*" on "Resource": "*" — that grants every S3 action on every bucket in the account. The scoped policy above means a compromise of this service exposes exactly one bucket’s objects, not your entire S3 footprint. Prefer assigning policies to roles that compute assumes (no long-lived keys to leak) over static access keys on a user.

01 Learning objectives

0 / 1 done

02 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

  • Commonly asked mid concept very common What is the difference between a region and an availability zone, and how do you use them for high availability?

    A region is a geographic area (e.g. us-east-1). Inside each region are multiple availability zones (AZs) — physically separate data centers with independent power, cooling, and networking, connected by high-bandwidth, low-latency links (single-digit ms).

    For high availability, spread your workload across multiple AZs in a region: if one AZ loses power, the others keep serving, and a load balancer routes around the failed zone. That protects against a data-center-level failure with negligible latency cost. Going multi-region adds protection against a whole-region outage and lets you serve users closer to them, but it is far more complex (cross-region replication, data consistency, higher latency between regions). The pragmatic default is multi-AZ within one region; reach for multi-region when you genuinely need regional fault tolerance or global low latency.

    Red flag Confusing the two, or running everything in a single AZ and calling it 'in the cloud so it's highly available' — one AZ failure then takes the whole service down.

    source: AWS — Regions and Availability Zones ↗
  • Commonly asked senior concept very common Explain the principle of least privilege in cloud IAM, with a concrete example.

    Least privilege means every identity (user, role, service) gets exactly the permissions it needs to do its job and nothing more. The smaller the granted permission set, the smaller the blast radius if those credentials leak or the service is compromised.

    Concrete example: a Lambda that only reads from one bucket should have a policy granting s3:GetObject scoped to that specific bucket's ARN — not s3:* on *. Wildcards like Action: * / Resource: * are the classic violation. In practice: prefer roles with temporary credentials over long-lived access keys, scope policies to specific actions and resource ARNs, start from deny and add only what is needed, and review/trim permissions over time. Pair it with separation of duties so no single role can both deploy and exfiltrate.

    Red flag Granting broad wildcard policies (`s3:*` on `*`) 'to get it working' and never tightening them — one leaked key then has the run of the whole account.

    source: AWS — IAM security best practices (least privilege) ↗
  • Commonly asked mid concept common Walk me through the core cloud compute, storage, and networking primitives and when you'd reach for each.

    Compute: VMs (EC2-style — full control, you manage the OS), containers (ECS/EKS — packaged apps, orchestrated), and serverless functions (Lambda — event-driven, no servers to manage, scales to zero). Move up that ladder as you want less operational overhead and more elasticity.

    Storage: object storage (S3 — cheap, durable, infinite-scale blobs: images, backups, static assets), block storage (EBS — a virtual disk attached to one VM, for databases/filesystems), and file storage (EFS/NFS — a shared filesystem across many machines). Match the access pattern: blobs over HTTP -> object; a disk for one instance -> block; shared POSIX filesystem -> file.

    Networking: a VPC is your isolated private network; subnets segment it (public vs private); security groups are instance-level firewalls; and a load balancer spreads traffic across instances. The skill is mapping a workload to the cheapest primitive that fits its access and durability needs.

    Red flag Reaching for a full VM you have to patch and babysit when a managed/serverless option fits, or using a database on object storage (wrong access pattern) instead of block storage.

    source: AWS — Types of cloud computing / core services ↗
  • Commonly asked mid concept common What is the cloud shared responsibility model, and why does it matter?

    Security is split between the provider and you. The provider is responsible for security OF the cloud — the physical data centers, hardware, the hypervisor, and the managed-service infrastructure. You are responsible for security IN the cloud — your data, IAM users and permissions, network config (security groups, public/private subnets), OS patching on VMs you run, and application-level security.

    The line shifts with the service tier: with a raw VM you patch the OS; with a managed database the provider patches it but you still own access control and your data; with serverless even more moves to the provider, but IAM and data are always yours. It matters because most cloud breaches are customer-side misconfigurations — a public S3 bucket or an over-permissive IAM policy — not the provider being hacked.

    Red flag Assuming 'the cloud provider handles security' end to end — IAM, data, and network configuration are always the customer's responsibility, and that is where most breaches actually happen.

    source: AWS — Shared Responsibility Model ↗
  • Commonly asked mid concept common What is the difference between vertical and horizontal scaling in the cloud, and which does the cloud make easy?

    Vertical scaling (scale up) means giving one instance more resources — a bigger CPU/RAM tier. It is simple and needs no app changes, but you hit a hardware ceiling, usually need a restart/downtime to resize, and the single box is still a single point of failure.

    Horizontal scaling (scale out) means adding more instances behind a load balancer. It scales effectively without limit and improves availability (lose one node, the rest serve), which is exactly what cloud auto-scaling groups automate — add instances when load rises, remove them when it falls. The catch is the app must be stateless (or externalize session state to a shared store like Redis) so any instance can handle any request. The cloud's elasticity is built around horizontal scaling; that is why 'make services stateless' is such a load-bearing design rule.

    Red flag Trying to scale a stateful, session-on-the-box service horizontally — requests landing on a different instance lose the session, so you are forced back into sticky sessions or a single big vertical box.

    source: AWS — Auto Scaling / scaling concepts ↗
  • Commonly asked senior concept occasional Why might a company choose managed cloud services over self-hosting, and what are the tradeoffs?

    Managed services (RDS instead of running your own Postgres, EKS instead of bootstrapping Kubernetes) shift operational burden to the provider: patching, backups, failover, scaling, and HA come built in, so a small team ships faster and pages less. You trade money and some control for time and reliability.

    The tradeoffs: higher direct cost, less control over versions/tuning/internals, and vendor lock-in (managed offerings differ across clouds, raising switching cost). Self-hosting gives maximum control and can be cheaper at very large, steady scale, but you now own the on-call, the upgrades, and the failure modes. The senior answer weighs team size, scale, and how differentiating the capability is: do not burn your scarce engineers running undifferentiated infrastructure a managed service handles well.

    Red flag Defaulting to self-hosting core infrastructure 'to save money' on a small team — the hidden cost is the engineering time and on-call burden of operating it, which usually dwarfs the managed-service bill.

    source: AWS — What is managed services / cloud value ↗
  • Commonly asked mid concept very common What is the difference between authentication and authorization in cloud IAM, and how do roles fit in?

    Authentication answers 'who are you?' — proving identity (a user signing in, a service presenting credentials or a token). Authorization answers 'what are you allowed to do?' — evaluating policies to decide whether that proven identity may perform an action on a resource. Authn comes first; authz comes after. They're distinct: a correctly authenticated user can still be denied an action.

    In cloud IAM, policies are the authorization rules (allow/deny on actions + resources), attached to identities. An IAM role is an identity with policies but no permanent credentials — instead, a trusted principal (an EC2 instance, a Lambda, another account, a federated user) assumes the role and receives temporary, auto-rotating credentials. That's why roles are the best-practice way to grant permissions to services: no long-lived access keys to leak.

    So: authn = identity, authz = permissions (policies), and roles = a way to hand out scoped, temporary permissions to whoever/whatever assumes them.

    What a strong answer covers
    • Authentication = prove who you are; authorization = what you're allowed to do (policies).

    • Authn happens first; an authenticated identity can still be denied by authorization.

    • Policies encode authorization (allow/deny on actions + resources).

    • An IAM role has no permanent credentials — principals assume it for temporary ones.

    • Roles are best practice for services (EC2/Lambda): no long-lived keys to leak.

    Quick self-check

    An EC2 instance needs to read one S3 bucket. The best-practice way to grant this is:

    Red flag Conflating authentication with authorization — proving identity (authn) does not grant any permission; access is still decided by the policies evaluated at the authorization step.

    source: AWS — IAM identities (roles) / how IAM works ↗
  • Commonly asked mid concept common What is object storage (like S3), and why is it not a filesystem or a database?

    Object storage stores data as objects — a blob of bytes plus metadata and a unique key — in a flat namespace (a bucket), accessed over HTTP APIs (GET/PUT), not a mounted disk. It's built for massive scale, very high durability (S3 famously targets eleven 9s by replicating across devices/AZs), and cheap capacity. Ideal for images, video, backups, logs, static website assets, and data-lake files.

    Why it's not a filesystem: there are no real directories (the '/' in a key is cosmetic — it's a flat key space), you can't do partial in-place edits efficiently (you generally replace the whole object), and there's no POSIX file locking or low-latency random byte access like a block device. Why it's not a database: no transactions, no rich queries/joins, no secondary indexes — it's a key→blob store, not a query engine.

    The skill is matching the access pattern: whole-blob read/write over HTTP, write-once-read-many, durability over mutability → object storage. Mutable structured records you query → a database. A disk for an OS/DB → block storage.

    What a strong answer covers
    • Objects = blob + metadata + key in a flat bucket namespace, accessed via HTTP APIs.

    • Built for scale, extreme durability (S3 ~11 nines), and low cost — images, backups, logs, assets.

    • Not a filesystem: no real directories, no efficient partial edits, no POSIX locking/random access.

    • Not a database: no transactions, joins, or queries — it's key→blob.

    • Match access pattern: whole-blob, write-once-read-many → object storage.

    Quick self-check

    Which workload is the BEST fit for object storage like S3?

    Red flag Using object storage as a database or a mutable filesystem — there are no transactions/queries and no efficient in-place edits, so a workload needing those will be slow, awkward, or incorrect.

    source: AWS — What is object storage? (S3) ↗
  • Commonly asked mid concept common Compare the IaaS, PaaS, and SaaS service models. Who manages what at each level?

    It's a ladder of how much the provider manages vs you. IaaS (raw VMs, networking, storage — EC2) gives you the infrastructure; you still manage the OS, runtime, and app. Most control, most operational burden. PaaS (App Engine, Heroku, managed databases) hands you a platform — you push code and the provider runs the OS, runtime, scaling, and patching; you manage only your app and data. SaaS (Gmail, Salesforce) is finished software you just use; the provider manages essentially everything, you manage only your data and configuration.

    The through-line is the shared responsibility line moving up as you go IaaS → PaaS → SaaS: you trade control and flexibility for less operational work. (Serverless/FaaS sits near PaaS — even the runtime instance is abstracted, scaling to zero.)

    The senior framing: pick the highest level that still meets your control/customization needs, so you don't waste engineering effort managing layers a provider would handle for free.

    What a strong answer covers
    • IaaS (EC2): provider runs hardware/virtualization; you run OS, runtime, app — most control.

    • PaaS (App Engine, managed DBs): push code; provider runs OS/runtime/scaling/patching.

    • SaaS (Gmail, Salesforce): finished software; you manage only your data and config.

    • The responsibility line moves up IaaS → PaaS → SaaS: less control, less ops burden.

    • Pick the highest level that still meets your control needs to minimize wasted ops effort.

    Quick self-check

    On a managed PaaS, which layer are YOU still responsible for?

    Red flag Defaulting to IaaS and hand-managing OS/runtime/scaling when a PaaS would handle it — you pay in engineering time for control you don't actually need.

    source: AWS — Types of cloud computing (IaaS/PaaS/SaaS) ↗
  • Commonly asked senior concept common How do you control and reason about cloud cost? What's the difference between on-demand, reserved, and spot pricing?

    Cloud's elasticity cuts both ways: pay-per-use is great until idle or oversized resources quietly bleed money. The compute pricing tiers trade flexibility for cost: on-demand is full price, no commitment — for spiky or unpredictable workloads; reserved instances / savings plans commit to 1–3 years for a big discount — for steady, predictable baseline load; spot uses spare capacity at up to ~90% off but can be reclaimed with little notice — for fault-tolerant, interruptible work (batch jobs, CI, stateless workers that can be killed and rescheduled).

    The broader cost levers: right-size (most instances are over-provisioned), auto-scale so you pay for what you use and scale to zero where possible (serverless), watch egress/data-transfer (a sneaky cost), set lifecycle policies to tier cold data to cheaper storage, and tag resources so you can attribute spend. Set budgets and alerts so surprises page you, not finance.

    Senior framing: match the pricing model to the workload's tolerance for interruption and predictability — steady baseline on reserved, bursts on on-demand, interruptible bulk on spot.

    What a strong answer covers
    • On-demand: full price, no commitment — spiky/unpredictable workloads.

    • Reserved / savings plans: 1–3yr commit for big discount — steady baseline load.

    • Spot: up to ~90% off spare capacity but reclaimable anytime — fault-tolerant, interruptible work.

    • Levers: right-size, auto-scale/scale-to-zero, watch egress, tier cold data, tag for attribution.

    • Set budgets + alerts so cost surprises page engineers early.

    Red flag Running interruptible bulk work on full-price on-demand (or worse, putting a stateful production service on spot) — the first wastes ~90% of the spend, the second gets reclaimed out from under you with little warning.

    source: AWS — EC2 instance purchasing options (on-demand/reserved/spot) ↗
  • Commonly asked senior concept occasional What does it mean for an architecture to be 'cloud-native', and why design for failure?

    Cloud-native means building for the cloud's actual characteristics rather than lifting a fixed on-prem server into a VM. Core ideas: treat servers as cattle, not pets (instances are disposable and replaceable, not hand-tended); make services stateless so they scale horizontally and any instance can handle any request; externalize state to managed stores; automate provisioning with IaC; and design for failure — assume any instance, AZ, or dependency can die at any moment.

    Why design for failure: at cloud scale, hardware *will* fail constantly — it's a statistical certainty, not an edge case. So you build in redundancy (multi-AZ), health checks and auto-replacement (a dead instance is terminated and a new one launched automatically), retries with backoff and circuit breakers for flaky dependencies, and graceful degradation. The famous expression of this is Netflix's Chaos Monkey, which kills production instances on purpose to prove the system survives.

    Senior framing: the cloud doesn't give you reliability for free — it gives you the *primitives* (multiple AZs, auto-scaling, managed failover) and you must architect to use them.

    What a strong answer covers
    • Cloud-native = build for the cloud's traits, not a lifted-and-shifted pet server.

    • Cattle not pets: instances are disposable, replaced automatically, never hand-tended.

    • Stateless services + externalized state enable horizontal scaling and easy replacement.

    • Design for failure: at scale hardware *will* fail — redundancy, health checks, retries, circuit breakers.

    • The cloud gives primitives (multi-AZ, auto-scale, failover); you must architect to use them.

    Red flag Lifting an on-prem 'pet' server into a single cloud VM and calling it cloud-native — without statelessness, redundancy, and automated replacement, you've just moved a single point of failure into someone else's data center.

    source: AWS — Reliability pillar (Well-Architected Framework) ↗
  • Commonly asked senior debug occasional An EC2 instance in a private subnet can't reach the internet to pull package updates. How do you diagnose and fix it?

    A private subnet by definition has no route to an internet gateway, so instances there can't make outbound internet calls directly — that's the intended design, not a bug. The fix for *outbound-only* access is a NAT gateway: place it in a public subnet, and add a route in the private subnet's route table sending 0.0.0.0/0 to the NAT gateway. The NAT allows egress (and the return traffic for connections it initiated) but blocks unsolicited inbound — so the instance can pull updates while staying unreachable from the internet.

    Work the diagnosis like a checklist down the path: (1) the private subnet's route table — is there a 0.0.0.0/0 → nat-... route? (2) the NAT gateway itself — is it in a *public* subnet that routes to an internet gateway? (3) security group outbound rules — egress allowed? (4) NACL — does the subnet's stateless ACL allow both the outbound request and the inbound return traffic? (5) DNS resolution working?

    The senior tell: knowing that a NAT gateway (not an internet gateway) is the correct egress mechanism for private subnets, and checking the stateless NACL return-traffic rule that bites people.

    What a strong answer covers
    • Private subnet = no internet-gateway route by design; direct outbound fails as intended.

    • Fix outbound-only access with a NAT gateway in a public subnet + a 0.0.0.0/0 → NAT private route.

    • NAT allows egress + return traffic but blocks unsolicited inbound — instance stays private.

    • Diagnose down the path: route table → NAT placement → SG egress → NACL (return traffic!) → DNS.

    • Stateless NACLs must explicitly allow the inbound return traffic, a common silent culprit.

    Red flag Attaching an internet gateway route to the private subnet 'to fix it' — that makes the subnet public and the instance internet-reachable, defeating the security design; the correct egress path is a NAT gateway.

    source: AWS — NAT gateways ↗