AWS Roadmap
Gain practical AWS experience to confidently discuss cloud architecture, deploy Spring Boot applications, and handle AWS-focused interview questions — without relying on memorized answers.
Work through phases at your own pace. The PDF suggests 8–10 weeks but move faster where you already have context.
EC2
IAM
VPC
RDS
S3
CloudWatch
Docker
ECR
ECS
EKS
Lambda
API Gateway
DynamoDB
CloudFormation
Terraform
1
AWS Foundations
0%
Before You Start — Prerequisites
This roadmap assumes the following. If any are missing, cover them first — they are assumed throughout every phase.
- Linux command line basics: navigating directories (
cd,ls), reading/writing files, running commands withsudo. You will SSH into EC2 instances from Phase 2 onward. - Java 17 + Maven or Gradle: you should be able to build a runnable Spring Boot JAR locally before Phase 2.
- Git: committing and pushing to a remote repository. Required for the CI/CD pipeline in the capstone.
- Basic networking concepts: what an IP address and port number are, what TCP/IP means. You do not need to be a network engineer — Phase 1 teaches the AWS-specific networking layer on top of this.
- A brand-new AWS account: use a dedicated email address. Do not use a corporate or shared account — you need root access for initial setup and need free tier eligibility.
Architecture this phase
VPC → Public Subnet (has route to Internet Gateway) · VPC → Private Subnet (no internet route, outbound via NAT Gateway)
Draw this yourself for better retention:
draw.io ·
Official AWS Icons
Topics
AWS Global Infrastructure
Regions
What
A Region is a geographic area that contains multiple Availability Zones. Examples:
us-east-1 (N. Virginia), eu-west-1 (Ireland), ap-southeast-1 (Singapore). AWS has 33+ regions as of 2024.
Why
You pick a Region based on three things: data residency (legal requirements to keep data in a country), latency (pick closest to your users), and service availability (not all services launch in all regions simultaneously).
Gotcha
- Resources are region-scoped. An EC2 instance in us-east-1 has nothing to do with one in eu-west-1 — they're completely separate.
us-east-1(N. Virginia) is where AWS launches new services first. Good for learning, but for production choose a region close to your actual users.
Availability Zones (AZs)
What
AZs are one or more physically separate data centres within a Region, each with independent power, cooling, and network connectivity. They are connected to each other within a Region via low-latency links. A Region typically has 3 AZs (minimum 2).
Why
Deploy your application across 2+ AZs so a data centre failure (fire, power outage, network issue) does not take down your service. This is the foundational pattern for high availability in AWS.
Gotcha
- AZ names are randomly mapped per AWS account. Your
us-east-1amay point to a different physical data centre than someone else'sus-east-1a. This is intentional to spread load. - Each subnet lives in exactly one AZ. To span AZs, create one subnet per AZ.
Edge Locations
What
Edge locations are AWS Points of Presence (PoPs) — servers positioned close to end users worldwide. There are 400+ edge locations globally, far more than the number of regions. They are used exclusively by CloudFront (CDN) and Route 53 (DNS).
Why
When a user in Tokyo requests content from your S3 bucket in us-east-1, CloudFront serves it from the nearest Tokyo edge location instead — dramatically reducing latency.
Gotcha
- Edge locations are not for compute. You cannot run EC2 instances or containers here. They only cache CloudFront content and resolve Route 53 DNS queries.
Networking
VPC (Virtual Private Cloud)
What
A VPC is your own isolated virtual network within AWS. You define an IP address range using CIDR notation (e.g.,
10.0.0.0/16 gives you 65,536 IP addresses), then carve it into subnets, configure routing, and attach gateways. A VPC is regional — it spans all AZs in a region.
Why
Without a VPC, your resources would share a flat network with no isolation. VPC is the foundation of all AWS network security — everything runs inside one.
Gotcha
- Every AWS account has a default VPC (
172.31.0.0/16) in every region. Do not delete it — getting it back requires an AWS support ticket. - For production and learning, always create a custom VPC. The default VPC has public subnets by default, which is not ideal for production.
- Two VPCs that need to communicate via VPC Peering cannot have overlapping CIDR ranges. Plan CIDR blocks upfront.
Public Subnet
What
A subnet whose associated route table has a route directing internet-bound traffic (
0.0.0.0/0) to an Internet Gateway. Resources in this subnet with a public IP can send and receive internet traffic.
Why
Internet-facing resources belong here: Application Load Balancers, NAT Gateways, and bastion hosts. You generally do not put application servers or databases here.
Gotcha
- Being in a public subnet does not automatically expose a resource. The resource also needs a public IP address, and the Security Group must allow the traffic.
Private Subnet
What
A subnet with no direct route to the internet. There is no
0.0.0.0/0 → IGW route in its route table. Resources here cannot receive inbound connections from the internet. Outbound internet access (e.g., for OS updates) goes through a NAT Gateway that sits in a public subnet.
Why
Defense in depth. Your RDS database or internal service belongs here. Even if a Security Group rule is accidentally misconfigured, there is no network path from the internet to the resource.
Gotcha
- Private subnet resources can still initiate outbound connections (to download packages, call external APIs) via NAT Gateway. NAT is outbound-only — it does not allow inbound connections from the internet.
Route Tables
What
A route table is a set of rules that tells the VPC where to send network traffic. Every subnet is associated with exactly one route table. A route of
0.0.0.0/0 → igw-xxx means: "send all internet-bound traffic to the Internet Gateway."
Why
The route table is what makes a subnet "public" or "private." Adding the IGW route is the single configuration change that gives a subnet internet access.
Gotcha
- The local route (e.g.,
10.0.0.0/16 → local) is automatically added and cannot be removed. This allows all resources within the VPC to communicate with each other. - Multiple subnets can share one route table, but a subnet can only be associated with one route table at a time.
Internet Gateway (IGW)
What
A horizontally-scaled, redundant, highly-available gateway that enables communication between your VPC and the internet. You attach it to the VPC itself (not to a subnet). Then you reference it in a subnet's route table to make that subnet public.
Why
Without an IGW, your VPC is completely isolated. Nothing can reach the internet, and the internet cannot reach anything in the VPC.
Gotcha
- One VPC can only have one IGW attached at a time.
- The IGW itself has no cost — charges come from data transfer through it.
NAT Gateway
What
A managed AWS service that allows resources in private subnets to initiate outbound internet connections (OS patches, external API calls) while blocking unsolicited inbound connections. The NAT Gateway itself lives in a public subnet and routes through the IGW. You put a route of
0.0.0.0/0 → nat-gw-xxx in the private subnet's route table.
Why
Your EC2 in a private subnet needs to run
yum update or call a third-party API. NAT Gateway enables this without exposing the instance to inbound internet traffic.
Gotcha
- NAT Gateway costs money: ~$0.045/hr per gateway plus data processing charges. For a learning account, delete it when not in use to avoid surprise bills.
- For high availability, deploy one NAT Gateway per AZ and have each AZ's private subnets route to their local NAT Gateway. Routing all AZs through one NAT Gateway means an AZ failure takes down all outbound internet access.
Security (IAM)
IAM User
What
An IAM User represents a person or application that needs long-term AWS credentials. There are two credential types: a username + password for the AWS Console, and an Access Key ID + Secret Access Key for CLI/API access. The root account (your signup email) is not an IAM User — it has unrestricted access to everything.
Why
You should never use the root account for daily work. Create an IAM User with only the permissions needed. If credentials are compromised, you can disable the user without losing the account.
Gotcha
- Access Keys are long-term credentials. If they are leaked (e.g., committed to GitHub), rotate them immediately and assume they were used maliciously.
- Never embed Access Keys in source code or Docker images. Use IAM Roles for EC2/Lambda and environment variables or a secrets manager for external systems.
IAM Group
What
A collection of IAM Users. You attach permission policies to the group once, and all users in the group inherit those permissions. Groups cannot contain other groups — they are flat.
Why
Managing permissions at the group level avoids attaching the same policies to dozens of individual users. When a new developer joins, add them to the "Developers" group and they get all required permissions immediately.
Gotcha
- A user can belong to multiple groups. Their effective permissions are the union of all policies from all groups they belong to, plus any policies attached directly to the user.
IAM Role
What
An IAM Role provides temporary credentials (via AWS STS) to whoever assumes it — AWS services (EC2, Lambda, ECS tasks), other AWS accounts, or federated users. A role has two policies: a Trust Policy (who is allowed to assume this role) and one or more Permission Policies (what the role can do). EC2 instances use a wrapper called an Instance Profile to assume a role.
Why
Roles are the correct way for AWS services to access other AWS services. An EC2 instance running your Spring Boot app should have an IAM Role with S3 read permission — no embedded access keys required. The AWS SDK automatically picks up the temporary credentials from the EC2 instance metadata.
Gotcha
- Temporary credentials from a role automatically expire (15 minutes to 12 hours) and are automatically rotated. A leaked set of temporary credentials has a built-in time limit — far safer than long-term Access Keys.
- If you SSH into an EC2 and run
aws s3 lswithout configuring credentials, it works because the Instance Profile role is used automatically. This is the intended behaviour.
IAM Policies
What
JSON documents that define what is allowed or denied. Each policy contains one or more Statements. A Statement has:
Effect (Allow or Deny), Action (e.g., s3:GetObject), and Resource (e.g., arn:aws:s3:::my-bucket/*). Policies are attached to Users, Groups, or Roles.
Why
Policies enforce principle of least privilege — give identities only the minimum permissions they need to do their job, nothing more.
Gotcha
- An explicit Deny always overrides an Allow, regardless of which policy it comes from. If any policy denies an action, that action is denied even if another policy allows it.
- AWS Managed Policies (like
AmazonS3ReadOnlyAccess) are maintained by AWS. Prefer these for standard roles. Avoid usingAdministratorAccessfor anything other than your admin user.
Security Groups
What
Virtual firewalls at the resource level (EC2 instances, RDS databases, ECS tasks, Lambda in VPC). Rules are stateful: if you allow inbound TCP port 8080, the response traffic is automatically allowed without a separate outbound rule. You can only add Allow rules — there is no explicit Deny at the Security Group level.
Why
Security Groups are your primary tool for controlling which traffic can reach which resource. Every resource in a VPC has at least one Security Group.
Gotcha
- Security Groups can reference other Security Groups as the traffic source. Example: allow port 5432 from "App-SG" — any EC2 in that Security Group can connect to RDS, regardless of its IP address. This is better than using IP-based rules.
- Security Group changes take effect immediately. No instance restart needed.
NACLs (Network Access Control Lists)
What
Stateless firewalls at the subnet level. Unlike Security Groups, NACLs: are stateless (you must explicitly allow both inbound AND return outbound traffic), support both Allow and Deny rules, and evaluate rules in numeric order (lowest number first, first match wins).
Why
NACLs are a second firewall layer at the subnet boundary. Useful for blanket-blocking a specific IP range across all resources in a subnet — something you cannot do with Security Groups (which only allow, never deny).
Gotcha
- Stateless means: if you allow inbound port 80, you must also allow outbound ephemeral ports 1024–65535 so the HTTP response can leave the subnet. Forgetting this is a classic misconfiguration.
- The default NACL allows all traffic in both directions. If you create a custom NACL, it denies everything by default — you must add explicit allow rules.
- In practice, most teams keep the default NACLs and rely on Security Groups. Know the difference for interviews but don't over-engineer NACLs in practice.
Hands-on Tasks
Interview Q&A — Expand each to see the answer
What is a VPC and why does every AWS deployment need one?
A VPC (Virtual Private Cloud) is your own isolated virtual network within AWS. You define the IP address range using CIDR notation (e.g.,
10.0.0.0/16), create subnets within that range, configure route tables to control traffic flow, and attach gateways to connect to the internet or other networks. Every resource you launch in AWS — EC2, RDS, Lambda in a VPC — runs inside a VPC. It is the networking foundation that provides isolation, routing control, and the ability to define security boundaries.
Why use private subnets? What does it actually protect against?
Private subnets have no direct route to the internet — there is no
0.0.0.0/0 → IGW entry in their route table. A database or internal service placed in a private subnet is network-unreachable from the internet, even if you accidentally open a Security Group rule. This is defense in depth: you are not relying on a single misconfigurable firewall rule to protect sensitive resources. The internet has no network path to the resource, period. In contrast, a public subnet resource could be exposed if a Security Group is misconfigured — private subnets eliminate that risk.
What is the difference between Security Groups and NACLs?
Security Groups are stateful firewalls at the resource level (EC2, RDS, ECS task). Stateful means: allow inbound port 8080 → the response traffic is automatically allowed. Security Groups support Allow rules only. They are the primary tool teams use.
NACLs are stateless firewalls at the subnet level. Stateless means: you must explicitly allow both inbound request traffic AND outbound response traffic (on ephemeral ports 1024–65535). NACLs support both Allow and Deny rules, evaluated in numeric order. They act as a second layer of defense at the subnet boundary.
In practice: customise Security Groups for every resource. NACLs are usually left at default (allow all) unless you need to block a specific IP range at the subnet level.
NACLs are stateless firewalls at the subnet level. Stateless means: you must explicitly allow both inbound request traffic AND outbound response traffic (on ephemeral ports 1024–65535). NACLs support both Allow and Deny rules, evaluated in numeric order. They act as a second layer of defense at the subnet boundary.
In practice: customise Security Groups for every resource. NACLs are usually left at default (allow all) unless you need to block a specific IP range at the subnet level.
Why should databases never be publicly accessible?
A publicly accessible database is reachable from every IP address on the internet. This exposes it to: automated credential brute-force attacks, exploitation of known database engine vulnerabilities (CVEs), and accidental data exposure if credentials are weak or reused. Databases should only be reachable from your application tier (EC2, ECS, Lambda) within the same VPC, enforced by both private subnet placement (no internet route) and a Security Group that only allows connections from the application's Security Group on the DB port. There is no legitimate reason for a production database to have a public IP or be in a public subnet.
What is the difference between an IAM User and an IAM Role?
IAM User: long-term credentials. A person (or application) with a fixed username/password and/or Access Key ID + Secret. The credentials persist until explicitly rotated or deleted. If leaked, they remain valid indefinitely until you act.
IAM Role: temporary credentials issued by AWS STS when the role is assumed. They have an expiry (15 minutes to 12 hours) and are automatically renewed by the AWS SDK. Roles are assumed by AWS services (EC2 Instance Profile, Lambda execution role, ECS task role) or by federated users.
Best practice: EC2 instances should use an IAM Role via Instance Profile — never embed Access Keys in code or config files. The AWS SDK automatically uses the role credentials from the EC2 instance metadata endpoint. If the instance is compromised, the temporary credentials expire on their own.
IAM Role: temporary credentials issued by AWS STS when the role is assumed. They have an expiry (15 minutes to 12 hours) and are automatically renewed by the AWS SDK. Roles are assumed by AWS services (EC2 Instance Profile, Lambda execution role, ECS task role) or by federated users.
Best practice: EC2 instances should use an IAM Role via Instance Profile — never embed Access Keys in code or config files. The AWS SDK automatically uses the role credentials from the EC2 instance metadata endpoint. If the instance is compromised, the temporary credentials expire on their own.
My Notes
Saved to browser storage automatically as you type.
2
Deploy Spring Boot on EC2
0%
Topics
EC2 Core Concepts
EC2 (Elastic Compute Cloud)
What
Virtual machines in AWS. You choose an instance type (determines CPU and RAM), an AMI (the OS), storage (EBS), and which VPC/subnet to place it in. You manage everything from the OS upward — AWS manages the physical hardware underneath.
Why
EC2 is the most direct way to run a server in AWS. Full SSH access, install anything, configure however you want. The right mental model before moving to containers or serverless.
Gotcha
- Instance type naming:
t3.micro—t= burstable family,3= generation,micro= size. Free tier:t3.micro(2 vCPU, 1 GB RAM, 750 hrs/month). Common families: t (burstable), m (general), c (compute-optimised), r (memory-optimised). - Burstable instances (t-series) earn CPU credits at idle and spend them under load. If credits run out, CPU is throttled to a baseline rate. Fine for dev; watch out for sustained load in production.
- Stop ≠ Terminate. Stopped: instance paused, EBS persists, compute billing stops (EBS still billed). Terminated: instance deleted, EBS deleted by default. You cannot un-terminate.
AMI (Amazon Machine Image)
What
A pre-built OS image used to launch EC2 instances. Think of it as a disk snapshot that becomes your instance's root volume. AWS provides AMIs for Amazon Linux 2023, Ubuntu, Windows Server, and more. You can create custom AMIs from your own configured instances.
Why
Every EC2 instance starts from an AMI. The AMI determines the OS, pre-installed packages, and starting configuration. Custom AMIs let you bake in your Java runtime and config so new instances are ready immediately.
Gotcha
- AMIs are region-specific. An AMI from us-east-1 cannot be directly used in eu-west-1 — you must copy it first.
- Use Amazon Linux 2023 (AL2023), not Amazon Linux 2 which is approaching end-of-life. AL2023 uses
dnfas its package manager (thoughyumstill works as an alias).
EBS (Elastic Block Store)
What
Network-attached persistent storage for EC2. Every instance has a root EBS volume (the OS disk). Default root volume for AL2023 is 8 GB (gp3 type). EBS volumes can be detached from one instance and attached to another.
Why
Data written to EBS persists when you stop and restart the instance — unlike instance store (ephemeral) storage which is wiped. Your Spring Boot JAR and logs live on EBS.
Gotcha
- By default the root EBS volume is deleted when the instance is terminated. Change "Delete on Termination" to false if you want to keep the volume after termination.
- EBS volumes are AZ-specific. An EBS volume in us-east-1a cannot be attached to an instance in us-east-1b.
- You are billed for EBS storage even while the instance is stopped.
SSH Key Pairs
What
When launching an EC2 instance, AWS places the public key on the instance (in
~/.ssh/authorized_keys for the default user). You download the private key file (.pem) once. This is the only way to SSH in — there is no password login by default.
Why
SSH key auth is more secure than passwords. The private key never leaves your machine. AWS never stores it — if you lose the
.pem file, the key cannot be recovered.
Gotcha
- Set permissions immediately after download:
chmod 400 key.pem. Without this, SSH refuses to use the key: "WARNING: UNPROTECTED PRIVATE KEY FILE". - Default usernames:
ec2-user(Amazon Linux),ubuntu(Ubuntu),admin(Debian). Notroot. - Connect command:
ssh -i /path/to/key.pem ec2-user@<public-ip>
Elastic IP
What
A static public IPv4 address you allocate to your account and associate with an EC2 instance. By default, a public IP assigned on launch changes every time the instance is stopped and restarted. An Elastic IP stays fixed regardless of instance state.
Why
Useful when you need a predictable IP — e.g., if you whitelist it in a firewall rule or point a DNS A record to it.
Gotcha
- Elastic IPs are free when associated with a running instance. You are charged ~$0.005/hr when the IP is allocated but not associated (i.e., you're holding an IP without using it). Release IPs you're not using.
Linux & Deployment
Linux Administration on Amazon Linux 2023
What
AL2023 uses
dnf (RPM-based, like RHEL/Fedora). AWS provides Amazon Corretto — a free, production-ready OpenJDK build — in the AL2023 package repos.
Why
You need to install Java, Git, and other dependencies before deploying your app. Knowing basic Linux commands is essential for EC2 work.
Key commands
- Install Java 17:
sudo dnf install java-17-amazon-corretto -y - Install Git:
sudo dnf install git -y - Verify Java:
java -version→ should showCorretto-17.x.x - Copy JAR from laptop:
scp -i key.pem app.jar ec2-user@<ip>:~/ - Check disk space:
df -h· Check memory:free -h· Check processes:ps aux | grep java
Running Spring Boot as a Background Service (systemd)
What
Running
java -jar app.jar directly dies when your SSH session closes. systemd is the Linux init system that manages long-running services — starting them on boot and restarting them on failure.
Why
For any persistent deployment: your app must survive reboots and SSH disconnects, and recover automatically from crashes.
Service file
Create
Check logs:
/etc/systemd/system/springapp.service:
[Unit] Description=Spring Boot Application After=network.target [Service] User=ec2-user WorkingDirectory=/home/ec2-user ExecStart=/usr/bin/java -jar /home/ec2-user/app.jar SuccessExitStatus=143 Restart=on-failure RestartSec=10 [Install] WantedBy=multi-user.targetThen:
sudo systemctl daemon-reload && sudo systemctl enable springapp && sudo systemctl start springappCheck logs:
sudo journalctl -u springapp -f
Security Groups for a Spring Boot App
What
For a Spring Boot app exposed directly on EC2, the Security Group needs: inbound SSH (port 22) from your IP, inbound app traffic (port 8080) from wherever users connect, all outbound open. Spring Boot's default port is 8080; change with
server.port in application.properties.
Why
Without adding port 8080, your app runs correctly on the instance but is completely unreachable from outside — the Security Group silently blocks it.
Gotcha
- Never open SSH (port 22) to
0.0.0.0/0. Use "My IP" in the console. Bots scan for open SSH ports constantly. - Opening port 8080 to
0.0.0.0/0is acceptable for initial testing, but the tasks below will have you lock this down behind an ALB — which is how production always looks.
Application Load Balancer (ALB)
ALB, Target Groups, and Listeners
What
An Application Load Balancer sits in public subnets and distributes incoming HTTP/HTTPS traffic to backend targets (EC2 instances, ECS tasks, Lambda). It operates at Layer 7 — it can route requests based on URL path, hostname, and HTTP headers. Key components: Listener (port + protocol the ALB listens on), Rules (how traffic is routed), Target Group (the set of backends with health checks).
Why
In production, application servers should never be directly internet-facing. The ALB lives in the public subnet; your EC2 instances live behind it. The ALB handles health checks (removing unhealthy instances from rotation automatically), SSL termination, and is the attachment point for WAF, access logs, and sticky sessions.
Gotcha
- An ALB requires subnets in at least two AZs. Even with one EC2 target, you must specify two subnets in different AZs when creating the ALB.
- Health check path matters. If your Spring Boot app has Spring Actuator enabled, use
/actuator/health. If the path returns anything other than HTTP 2xx, targets are marked unhealthy and receive no traffic — your app appears "down" even though it's running. - The ALB has its own Security Group. Pattern: ALB SG allows 80/443 inbound from internet. EC2 SG allows 8080 inbound only from the ALB SG — not from
0.0.0.0/0. This is the correct production security model.
HTTPS with ACM (AWS Certificate Manager)
What
ACM provides free SSL/TLS certificates for use with AWS services (ALB, CloudFront, API Gateway). You request a certificate for your domain, verify ownership via DNS (recommended) or email, and attach it to an ALB HTTPS listener. ACM auto-renews certificates — no manual renewal or private key management required.
Why
All production traffic must be HTTPS. The ALB terminates TLS — your EC2 and Spring Boot app receive plain HTTP internally on port 8080, and the ALB handles the encryption layer. You do not need to configure SSL in Spring Boot or install certificates on EC2.
Gotcha
- ACM certificates are free when attached to ALB, CloudFront, or API Gateway. You cannot export or download them for use on your own EC2 server.
- DNS validation is preferred: it adds a CNAME record to your domain's DNS and auto-renews without human action. Email validation requires you to click a link before the cert expires.
- If you do not own a domain yet, skip the HTTPS task — use the ALB's default DNS name (
my-alb-123.us-east-1.elb.amazonaws.com) with HTTP for now. The ALB + Target Group pattern is what matters here.
Hands-on Tasks
Interview Q&A — Expand each to see the answer
Walk me through provisioning an EC2 instance for a Spring Boot application.
Start by choosing an AMI — Amazon Linux 2023 for a Java app. Pick the instance type based on expected load;
t3.micro for dev/learning, t3.small or m5.large
What's the difference between stopping and terminating an EC2 instance?
Stopping an instance pauses it. The EBS root volume is preserved and the instance can be restarted. Compute billing stops, but EBS storage billing continues. The public IP (if not an Elastic IP) is released and a new one is assigned on next start.
Terminating an instance deletes it. By default, the root EBS volume is also deleted ("Delete on Termination" = true). This action is irreversible — a terminated instance cannot be recovered. If you need the data, create a snapshot or change the Delete on Termination setting before terminating.
Terminating an instance deletes it. By default, the root EBS volume is also deleted ("Delete on Termination" = true). This action is irreversible — a terminated instance cannot be recovered. If you need the data, create a snapshot or change the Delete on Termination setting before terminating.
How do you run a Spring Boot app on EC2 so it starts automatically and restarts on failure?
Use systemd, the Linux service manager. Create a unit file at
A quick alternative is
/etc/systemd/system/springapp.service that defines the app's start command, working directory, user, and restart policy (Restart=on-failure). Run sudo systemctl enable springapp to register it for automatic start on boot, and sudo systemctl start springapp to start it now. View logs with sudo journalctl -u springapp -f.A quick alternative is
nohup java -jar app.jar > app.log 2>&1 &, but systemd is correct for anything production-like because it handles restarts and boot integration.
What Security Group rules does a Spring Boot app on EC2 need?
Minimum rules: inbound SSH (port 22) restricted to your IP (never 0.0.0.0/0 in production), and inbound TCP on port 8080 (Spring Boot's default) from wherever users connect. Leave outbound traffic fully open so the instance can download packages, call external APIs, etc.
In a production setup with an Application Load Balancer in front, the EC2 Security Group should allow port 8080 only from the ALB's Security Group — not from the internet directly. This way, all traffic is routed through the ALB, which handles HTTPS termination and health checks.
In a production setup with an Application Load Balancer in front, the EC2 Security Group should allow port 8080 only from the ALB's Security Group — not from the internet directly. This way, all traffic is routed through the ALB, which handles HTTPS termination and health checks.
What is an AMI and when would you create a custom one?
An AMI (Amazon Machine Image) is a snapshot of an EC2 instance's state that you use to launch new instances. It includes the OS, installed packages, and configuration baked in at the time the AMI was created.
You'd create a custom AMI when you want new instances to start already configured — for example, with Java, your application dependencies, and monitoring agents pre-installed. This speeds up Auto Scaling: new instances are ready in seconds rather than running a lengthy bootstrap script on every launch. Custom AMIs are also used for Golden Image pipelines — a security-hardened base image that teams build their application AMIs from.
You'd create a custom AMI when you want new instances to start already configured — for example, with Java, your application dependencies, and monitoring agents pre-installed. This speeds up Auto Scaling: new instances are ready in seconds rather than running a lengthy bootstrap script on every launch. Custom AMIs are also used for Golden Image pipelines — a security-hardened base image that teams build their application AMIs from.
My Notes
Saved to browser storage automatically as you type.
3
RDS Integration
0%
Topics
RDS Core Concepts
RDS (Relational Database Service)
What A managed database service. AWS handles: OS patching, DB software installation and upgrades, automated backups, Multi-AZ failover, and storage scaling. You manage only the schema and queries. Supports PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Aurora.
Why Running PostgreSQL on a plain EC2 means you own backups, patching, HA, and failover yourself. In production, that operational burden is enormous. RDS trades cost for time.
Gotcha
db.t3.microis free tier eligible (750 hrs/month for MySQL, PostgreSQL, or MariaDB — not Aurora). Good enough for this phase.- RDS must go in private subnets. It should never have a public IP. Access it only from within the VPC via Security Groups.
- The RDS endpoint looks like:
my-db.abc123.us-east-1.rds.amazonaws.com. Use this as your JDBC host.
Multi-AZ Deployment
What RDS maintains a synchronous standby replica in a different AZ. Every write to the primary is simultaneously written to the standby. If the primary fails, RDS automatically promotes the standby. The CNAME endpoint automatically points to the new primary. Typical failover time: 60–120 seconds.
Why Protects against an AZ outage or hardware failure. Without Multi-AZ, a failed DB instance means downtime until you restore from backup.
Gotcha
- The standby replica is not readable. It exists purely for failover. You cannot send read traffic to it to reduce load on the primary — that is what Read Replicas are for.
- Multi-AZ roughly doubles the RDS cost. In dev/learning environments, leave it off.
Read Replicas
What Asynchronous copies of the primary RDS instance. You can create up to 5 read replicas per source instance. Your application explicitly directs read-heavy queries (reports, analytics) to a replica's endpoint, reducing load on the primary. Replicas can be in the same region, a different region, or promoted to a standalone DB.
Why Scale read throughput horizontally without upgrading the primary instance size.
Gotcha
- Replication is asynchronous — replicas may lag behind the primary by seconds. Never use a replica for anything requiring up-to-date data (e.g., immediately after a write).
- Multi-AZ ≠ Read Replica. Multi-AZ is for availability (automatic failover). Read Replicas are for scalability (read offloading). These are separate features and can both be enabled simultaneously.
Automated Backups & Snapshots
What Automated backups: RDS takes a daily snapshot of the DB during a maintenance window, plus continuously backs up transaction logs to S3 (not visible in your S3 bucket — managed by RDS). This enables point-in-time recovery to any second within the retention window (1–35 days, default 7). Manual snapshots: you trigger these yourself and they persist until you delete them.
Why Automated backups are your safety net for data corruption and accidental deletion. Take a manual snapshot before any major schema migration.
Gotcha
- Automated backups are deleted when you delete the RDS instance (unless you take a final snapshot). Manual snapshots survive instance deletion.
- Restoring a snapshot creates a new RDS instance — it does not restore in place. Plan for the new endpoint in your app config.
AWS Secrets Manager
What A managed service for securely storing, rotating, and retrieving secrets (database passwords, API keys, connection strings). Secrets are encrypted with KMS. Your application retrieves the secret at runtime via the AWS SDK — no plaintext credentials in config files, environment variables hardcoded in systemd units, or Docker images.
Why The most common AWS security mistake: DB credentials hardcoded in
application.properties and committed to Git. Secrets Manager solves this: your EC2 IAM Role has secretsmanager:GetSecretValue permission, the app fetches the secret at startup, and credentials never appear in source control or process environment dumps.Retrieving a secret via AWS SDK v2
SecretsManagerClient client = SecretsManagerClient.create();
GetSecretValueResponse r = client.getSecretValue(
GetSecretValueRequest.builder()
.secretId("prod/myapp/db")
.build());
// r.secretString() → JSON: {"username":"dbadmin","password":"..."}
// Parse with ObjectMapper, inject into DataSource
Alternatively, Spring Cloud AWS (io.awspring.cloud:spring-cloud-aws-starter-secrets-manager) lets you reference secrets directly in application.properties via ${sm:/prod/myapp/db:password} — no boilerplate SDK code needed.
Gotcha
- Secrets Manager costs $0.40 per secret per month + $0.05 per 10,000 API calls. Negligible at learning scale.
- IAM permission required:
secretsmanager:GetSecretValueon the specific secret ARN, not on*. - For local development, use environment variables or a local
.envfile (excluded from Git). Never configure Secrets Manager with hardcoded access keys locally — that defeats the purpose.
Connecting Spring Boot to RDS
What From Spring Boot's perspective, RDS PostgreSQL is just a PostgreSQL instance. The JDBC URL uses the RDS endpoint. The Security Group on the RDS instance must allow port 5432 from the EC2 instance's Security Group.
Why Understanding the connectivity model (SG → SG, not IP → IP) is important for troubleshooting and for interviews.
Config in application.properties
spring.datasource.url=jdbc:postgresql://my-db.abc123.us-east-1.rds.amazonaws.com:5432/mydb
spring.datasource.username=dbadmin
spring.datasource.password=${DB_PASSWORD}
spring.datasource.hikari.maximum-pool-size=10
Store the password in an environment variable or AWS Secrets Manager — never hardcode it in source files.
Hands-on Tasks
Interview Q&A
Why use RDS instead of running PostgreSQL yourself on EC2?
With RDS, AWS handles automated backups, point-in-time recovery, OS and engine patching, Multi-AZ failover, and storage auto-scaling. On a self-managed EC2 PostgreSQL setup, your team owns all of that — which means writing backup scripts, managing cron jobs, handling failover manually, and staying on top of security patches. For most product teams, that operational burden is not the product they're building. RDS trades higher cost for lower operational overhead. The trade-off shifts when you have very specific PostgreSQL configuration requirements or extreme cost constraints at scale.
What's the difference between Multi-AZ and a Read Replica?
Multi-AZ is for availability. It maintains a synchronous standby replica in a different AZ. If the primary fails, RDS automatically fails over to the standby (60–120 seconds). The standby cannot serve read traffic — it exists solely for failover. Your app needs no changes; the endpoint CNAME switches automatically.
Read Replica is for scalability. It is an asynchronous copy of the primary. Your application must explicitly connect to the replica's separate endpoint for read queries. Replication lag means replicas may not have the most recent writes. Read Replicas can be promoted to standalone databases if needed.
You can (and often do) have both enabled at the same time on the same instance.
Read Replica is for scalability. It is an asynchronous copy of the primary. Your application must explicitly connect to the replica's separate endpoint for read queries. Replication lag means replicas may not have the most recent writes. Read Replicas can be promoted to standalone databases if needed.
You can (and often do) have both enabled at the same time on the same instance.
How do you connect Spring Boot to RDS securely?
Three layers: Network — RDS is in a private subnet with no public IP. The RDS Security Group allows port 5432 only from the application server's Security Group (not from an IP address range). Credentials — database username and password are stored in AWS Secrets Manager or SSM Parameter Store, not in application.properties or source code. The EC2 IAM Role grants permission to retrieve the secret at startup. Transport — enable SSL/TLS on the JDBC connection for encryption in transit (
ssl=true&sslmode=require in the JDBC URL). RDS provides a CA certificate for verification.What is your backup and disaster recovery strategy for RDS?
Three components: Automated backups with a 7-day retention window provide point-in-time recovery to any second within that window — covers data corruption and accidental deletion. Manual snapshots taken before schema migrations or major releases persist indefinitely and can be used to create a new instance if a migration goes wrong. Multi-AZ handles the infrastructure failure case — if the primary AZ goes down, failover happens automatically without restoring from backup.
For critical systems, also consider cross-region read replicas which can be promoted during a regional outage, giving a lower RTO than restoring from a cross-region snapshot copy.
For critical systems, also consider cross-region read replicas which can be promoted during a regional outage, giving a lower RTO than restoring from a cross-region snapshot copy.
My Notes
Saved to browser storage automatically as you type.
4
S3 Storage
0%
Topics
S3 Fundamentals
Buckets and Objects
What S3 is object storage — not a file system. You store objects (any file up to 5 TB) inside buckets. A bucket is a top-level container. Objects are identified by a key (a string like
uploads/2024/report.pdf) — there are no real folders, just naming conventions. S3 is designed for 99.999999999% (11 nines) durability by redundantly storing data across multiple AZs.Why S3 is the standard for file/blob storage in AWS. Infinitely scalable, no capacity planning, pay only for what you store.
Gotcha
- Bucket names are globally unique across all AWS accounts worldwide. If someone else already has
my-app-uploads, you cannot use it. Use a prefix like your company name or account ID. - Bucket names must be 3–63 characters, lowercase, no underscores.
- Buckets are created in a specific region. Data stays in that region unless you explicitly replicate it.
Versioning
What When versioning is enabled on a bucket, every upload creates a new version of the object instead of overwriting the previous one. Each version has a unique version ID. Deleted objects get a delete marker — the object is not actually gone and can be recovered by removing the marker.
Why Protection against accidental overwrites and deletions. Essential for any bucket that stores important data. Also required for S3 replication.
Gotcha
- Once enabled, versioning cannot be fully disabled — only suspended. Suspended means new uploads no longer create versions, but existing versions are preserved.
- Old versions accumulate storage costs. Pair versioning with a lifecycle rule to expire non-current versions after N days.
Storage Classes
What S3 offers multiple storage tiers with different cost/access-speed trade-offs:
- Standard (~$0.023/GB): frequent access, lowest latency. Default.
- Standard-IA (Infrequent Access, ~$0.0125/GB + retrieval fee): for files accessed less than once a month.
- Glacier Instant Retrieval (~$0.004/GB): archival, millisecond retrieval.
- Glacier Flexible Retrieval (~$0.0036/GB): archival, minutes to hours retrieval.
- Glacier Deep Archive (~$0.00099/GB): cheapest, 12-hour retrieval. For compliance archives.
- Intelligent-Tiering: automatically moves objects between tiers based on access patterns. Monitoring fee per object.
Why Storage classes let you cut costs significantly for data that is not accessed frequently. A lifecycle policy can automate transitions.
Gotcha
- IA classes have a minimum storage duration (30 days for Standard-IA, 90 days for Glacier). Objects deleted before the minimum are still billed for the full duration.
Lifecycle Policies
What Rules that automatically transition objects between storage classes or delete them after a set number of days. Example: transition to Standard-IA after 30 days → Glacier after 90 days → delete after 365 days. Applied at the bucket or prefix level.
Why Lifecycle policies are the hands-off way to manage storage costs. Without them, old objects accumulate in Standard storage indefinitely.
Presigned URLs
What A presigned URL is a time-limited URL generated server-side that grants temporary access to a private S3 object without making the bucket or object public. The URL contains an embedded signature with an expiry. Anyone with the URL can GET (download) or PUT (upload) the object until expiry — no AWS credentials needed.
Why Standard pattern for user file uploads and downloads: your backend generates the presigned URL and hands it to the client. The client transfers directly to/from S3 — your backend never touches the file bytes, saving bandwidth and compute.
Gotcha
- Presigned URLs inherit the permissions of the IAM identity that generated them. If your EC2 IAM Role has
s3:GetObjecton the bucket, it can generate presigned GET URLs. Without the permission, the URL will fail. - Maximum expiry: 7 days (604800 seconds) for IAM user or role credentials. For temporary session credentials (e.g., from STS), the URL expires no later than when the session expires — even if you set a longer duration.
Access Control — Block Public Access
What S3 has four "Block Public Access" settings at the account and bucket level. Enabling all four prevents any object in the bucket from being made publicly readable — regardless of bucket policy or object ACLs. This is the default for new buckets as of 2023.
Why Public S3 buckets have caused high-profile data breaches. "Block Public Access" is a safety net. Enable it on every bucket that is not intentionally serving public content (e.g., a static website).
Gotcha
- Bucket policies still control which IAM identities (your EC2 role, Lambda function) can access objects — Block Public Access only prevents public (unauthenticated) access.
- Never use bucket ACLs — they are a legacy mechanism. Use bucket policies and IAM policies instead.
Spring Boot Integration (AWS SDK v2)
Using AWS SDK v2 for Java with S3
What AWS SDK v2 (
software.amazon.awssdk) is the current Java SDK. Use S3Client for synchronous operations or S3AsyncClient for async. The SDK automatically picks up credentials from the EC2 Instance Profile (IAM Role) — no hardcoded keys needed.pom.xml dependency
<dependency> <groupId>software.amazon.awssdk</groupId> <artifactId>s3</artifactId> <version>2.25.x</version> </dependency>Key operations:
s3Client.putObject(), s3Client.getObject(), s3Presigner.presignGetObject(). The SDK uses the default credential provider chain — on EC2 with an IAM Role, it reads temporary credentials from the instance metadata endpoint automatically.
Hands-on Tasks
Interview Q&A
Why store files in S3 instead of in the database?
Databases are optimised for structured data and queries — not binary blobs. Storing large files in a DB increases backup size, slows queries unrelated to those files, and doesn't scale economically. S3 is purpose-built for object storage: it costs a fraction of DB storage (~$0.023/GB vs ~$0.115/GB for RDS gp2), scales infinitely, and offers 11 nines of durability. Files stored in S3 can also be served directly to clients via presigned URLs, bypassing your application servers entirely and saving bandwidth and compute.
What are presigned URLs and when do you use them?
A presigned URL is a time-limited, signature-embedded URL that grants temporary access to a private S3 object. Your backend generates it using the AWS SDK (requires IAM permission on the bucket) and returns it to the client. The client then uploads or downloads directly from S3 using that URL — no AWS credentials needed, and your backend never handles the file bytes.
Use them for: user file uploads (presigned PUT URL → client uploads directly), file downloads (presigned GET URL → client downloads directly), and sharing private documents with external parties for a limited time. The key benefit is that large file transfers bypass your application servers completely.
Use them for: user file uploads (presigned PUT URL → client uploads directly), file downloads (presigned GET URL → client downloads directly), and sharing private documents with external parties for a limited time. The key benefit is that large file transfers bypass your application servers completely.
How do you prevent an S3 bucket from being accidentally made public?
Two layers: first, enable all four "Block Public Access" settings on the bucket (and ideally at the account level) — this acts as a guardrail that prevents any public bucket policies or public ACLs from taking effect, even if someone accidentally adds one. Second, use AWS Config or IAM SCPs (Service Control Policies) to enforce that Block Public Access stays enabled across all buckets in the account. Never grant
s3:PutBucketPolicy to application IAM roles — only administrators should be able to modify bucket policies.My Notes
Saved to browser storage automatically as you type.
5
Monitoring and Observability
0%
Topics
CloudWatch
CloudWatch Metrics
What Time-series data points representing the health of your AWS resources. Metrics have a namespace (e.g.,
AWS/EC2), a name (e.g., CPUUtilization), and dimensions (e.g., InstanceId=i-xxx). Data is kept at 1-second to 1-day resolution depending on retention settings.Why Metrics are the foundation of all monitoring. Without them you are flying blind — you cannot know if an instance is struggling until users report problems.
What EC2 sends by default (free, 5-min intervals)
CPUUtilization,NetworkIn,NetworkOut,DiskReadOps,DiskWriteOps- NOT included by default: memory usage, disk space used. These require the CloudWatch Agent.
- Detailed monitoring (1-minute intervals): costs extra. Enable per instance in the console.
CloudWatch Agent
What A software agent you install on EC2 that collects metrics and logs beyond what AWS sends by default. Collects: memory usage (
mem_used_percent), disk space (disk_used_percent), and any log files you point it at (e.g., your Spring Boot log file).Why Memory and disk space are the two most common causes of production outages. Without the agent, CloudWatch has no visibility into either.
Setup on Amazon Linux 2023
sudo dnf install amazon-cloudwatch-agent -y # Use the wizard to generate config: sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard # Start the agent: sudo systemctl enable amazon-cloudwatch-agent sudo systemctl start amazon-cloudwatch-agentThe EC2 IAM Role must have the
CloudWatchAgentServerPolicy managed policy attached.
CloudWatch Logs
What Centralised log storage and search. Logs are organised into Log Groups (one per application or service) and Log Streams (one per instance or task). The CloudWatch Agent ships log files from EC2. Lambda, ECS, and EKS can push logs automatically.
Why Without centralised logging, you must SSH into each instance to read logs — impossible when you have multiple instances or when an instance has crashed.
Gotcha
- Log retention defaults to Never Expire. This accumulates storage costs indefinitely. Always set a retention policy (e.g., 30 or 90 days) on every log group.
- CloudWatch Logs Insights: SQL-like query language for searching and analysing logs. Example:
filter @message like /ERROR/ | stats count(*) by bin(5m)
CloudWatch Alarms
What An alarm watches a single metric over a time window and changes state when the metric crosses a threshold. Three states: OK, ALARM, INSUFFICIENT_DATA. When an alarm enters ALARM state, it can trigger: an SNS notification (→ email, Slack, PagerDuty), an EC2 action (stop, reboot), or an Auto Scaling action.
Why Alarms turn metrics into actionable notifications. Without alarms you would have to watch dashboards constantly — impractical at scale.
Gotcha
- Set the evaluation period and datapoints carefully. CPU > 80% for 1 out of 1 datapoints at 1-minute intervals will fire on transient spikes. 2 out of 3 datapoints reduces false positives.
- An alarm in
INSUFFICIENT_DATAstate means CloudWatch is not receiving metric data — which itself can indicate a problem (agent stopped, instance down).
CloudWatch Dashboards
What Custom visualisations of metrics and alarm states on a single screen. Add widgets for line graphs, stacked area charts, numbers, and alarm status. Dashboards are shared across the team and are the first thing oncall checks during an incident.
Why At-a-glance system health. One dashboard should show the key health signals for your entire application stack.
Gotcha
- Dashboards cost $3/month per dashboard (first 3 are free). Not a budget concern at learning scale.
Hands-on Tasks
Interview Q&A
What metrics does EC2 send to CloudWatch by default, and what requires the CloudWatch Agent?
By default (free, 5-minute intervals), EC2 sends:
Not included by default: memory usage and disk space utilisation. These require the CloudWatch Agent installed on the instance. Memory and disk are the two metrics most responsible for production incidents, so installing the agent is a baseline operational requirement — not optional.
CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps, and DiskReadBytes/DiskWriteBytes.Not included by default: memory usage and disk space utilisation. These require the CloudWatch Agent installed on the instance. Memory and disk are the two metrics most responsible for production incidents, so installing the agent is a baseline operational requirement — not optional.
How do you monitor a Spring Boot application on EC2?
Three layers: Infrastructure metrics via CloudWatch Agent (CPU, memory, disk — the instance health). Application logs via CloudWatch Agent shipping the Spring Boot log file to CloudWatch Logs — search with Logs Insights for errors and latency patterns. Application metrics via Spring Boot Actuator + Micrometer: expose custom metrics (request count, latency, DB pool size) and push them to CloudWatch using the Micrometer CloudWatch registry. Then build a dashboard combining all three and set alarms on the signals that indicate real user impact (high error rate, high latency) rather than just infrastructure metrics.
How would you investigate a production incident using CloudWatch?
Start with the time the alarm fired. Open the CloudWatch Dashboard and look for correlated metric spikes at that time — CPU, memory, network, error count. Then go to CloudWatch Logs Insights and query the application log group for ERROR-level messages in that time window. Check if a deployment happened around the same time (correlate with your CI/CD pipeline). Look at RDS metrics (CPU, database connections, read/write latency) if the issue could be DB-related. The goal is to narrow from "something went wrong" to "this specific component hit this specific limit at this time." Document findings and consider adding more targeted alarms for the root cause so the same issue triggers an alert faster next time.
My Notes
Saved to browser storage automatically as you type.
6
Containers and ECS
0%
Topics
Docker
Images and Containers
What A Docker image is a read-only, layered template built from a Dockerfile. It packages your OS base, runtime, and application into a single artefact. A container is a running instance of an image — an isolated process with its own filesystem, network, and process space. Images are immutable; you don't patch them, you build new ones.
Why Containers eliminate "it works on my machine" — the same image runs identically in dev, staging, and production. They start in seconds (unlike VMs), use less RAM, and pack many containers onto one host.
Gotcha
- Container filesystem is ephemeral. Anything written inside the container is lost when the container stops. Use volumes or external storage (S3, RDS) for persistence.
- Each layer in a Docker image is cached. Put infrequently-changing layers (base image, dependencies) early in the Dockerfile, and your app code last — this speeds up rebuilds significantly.
Dockerfile for Spring Boot
What A text file with instructions to build a Docker image. For Spring Boot, the recommended approach is a multi-stage build or a simple single-stage build using an official JRE base image.
Recommended Dockerfile
FROM eclipse-temurin:17-jre-alpine WORKDIR /app COPY target/app.jar app.jar ENTRYPOINT ["java", "-jar", "app.jar"]
eclipse-temurin:17-jre-alpine is the Eclipse Foundation's OpenJDK build on Alpine Linux — small (~175 MB) and production-safe. Build: docker build -t my-app . Run: docker run -p 8080:8080 my-app
Gotcha
- Use a JRE image (not JDK) in production containers — JDK includes the compiler which you don't need at runtime and adds unnecessary image size.
- Spring Boot's layered JAR feature (
spring-boot:build-info) creates separate Docker layers for dependencies and app code, making rebuilds faster. Worth exploring after basics are solid.
AWS Container Services
ECR (Elastic Container Registry)
What AWS's private Docker image registry. Like Docker Hub but private, integrated with IAM, and in your AWS account. ECS and EKS pull images from ECR automatically using the task/pod's IAM Role — no registry credentials to manage.
Why You need a place to store your Docker images that ECS can pull from. Public Docker Hub images should not be used in production (rate limits, supply chain risk).
Auth + push commands
# Authenticate Docker to ECR aws ecr get-login-password --region us-east-1 | \ docker login --username AWS --password-stdin \ 123456789.dkr.ecr.us-east-1.amazonaws.com # Tag and push docker tag my-app:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
ECS (Elastic Container Service)
What AWS's container orchestration service. You define what to run (Task Definition) and how many copies to keep running (Service). ECS handles placement, health checks, rolling deployments, and integration with load balancers. No Kubernetes knowledge required.
Why On raw EC2, you manually manage deployment scripts, restart logic, and load balancer registration. ECS handles all of that. It is significantly simpler than EKS for teams that do not need Kubernetes portability.
Gotcha
- There is no SSH into Fargate containers. Debugging uses CloudWatch Logs (stdout/stderr from the container) and ECS Exec (an optional feature that provides a shell into a running task).
- ECS tasks are not persistent. A stopped task is replaced by a new one. Any state must be in external storage (RDS, S3, ElastiCache).
Task Definitions
What The blueprint for your container in ECS. Defines: Docker image URI (ECR), CPU and memory allocation, port mappings, environment variables, log configuration, and the IAM Task Role (permissions the container has). Task Definitions are versioned — every change creates a new revision.
CPU and memory units
- Fargate CPU units: 256 (0.25 vCPU), 512 (0.5 vCPU), 1024 (1 vCPU), up to 16384 (16 vCPU)
- Memory: 512 MB minimum, must be compatible with chosen CPU. E.g., 512 CPU → 1–2 GB memory.
- Spring Boot typically needs at least 512 MB; 1 GB is comfortable for a small service.
ECS Fargate vs EC2 Launch Type
What Fargate: serverless compute for containers. AWS provisions and manages the underlying EC2 instances. You pay per vCPU-second and GB-second of memory while the task runs. No instances to manage, patch, or right-size. EC2 launch type: you manage a cluster of EC2 instances. ECS places containers on them. More control, potentially cheaper at high sustained utilisation, but more operational overhead.
Why Fargate removes EC2 management entirely. For most Spring Boot microservice deployments, Fargate is the right starting point.
Gotcha
- Fargate has a slightly slower cold start than EC2 launch type (~10–30 seconds to start a new task). Not usually a problem for long-running services, but relevant for burst scaling.
- Fargate does not support GPU workloads or privileged containers.
Hands-on Tasks
Interview Q&A
Why use containers instead of deploying a JAR directly on EC2?
Containers package the application with its runtime and all dependencies into a single immutable artefact. This eliminates environment drift — the same container runs identically in dev, CI, staging, and production. They enable faster deployments (push a new image, ECS rolls it out), easier rollbacks (redeploy the previous image tag), and consistent behaviour regardless of what else is installed on the host. Containers also allow multiple services to run on shared infrastructure with isolation, making better use of resources compared to one-app-per-EC2.
What's the difference between ECS Fargate and the EC2 launch type?
Fargate: serverless. AWS provisions the underlying compute invisibly. You pay per vCPU-second and GB-second of memory while tasks run. No instances to manage, patch, or monitor. Simpler operationally, slightly more expensive at high sustained load, and slightly slower to scale (task startup takes ~10–30 seconds).
EC2 launch type: you manage a cluster of EC2 instances that ECS places containers on. More operational overhead (patching, right-sizing, capacity management), but more control and potentially cheaper at consistently high utilisation. Also required for GPU workloads or containers that need privileged mode.
Starting point for most teams: Fargate. Move to EC2 launch type when you have a concrete cost or capability reason.
EC2 launch type: you manage a cluster of EC2 instances that ECS places containers on. More operational overhead (patching, right-sizing, capacity management), but more control and potentially cheaper at consistently high utilisation. Also required for GPU workloads or containers that need privileged mode.
Starting point for most teams: Fargate. Move to EC2 launch type when you have a concrete cost or capability reason.
How does ECS handle a rolling deployment?
When you update an ECS Service (e.g., new Task Definition revision), ECS starts new tasks with the updated version while keeping old tasks running. The deployment is controlled by two parameters:
minimumHealthyPercent (e.g., 100 — never go below 100% of desired count) and maximumPercent (e.g., 200 — allow up to double the tasks temporarily). ECS waits for new tasks to pass health checks before stopping old ones. If new tasks fail health checks, the deployment stops and old tasks remain running. If you have a load balancer attached, traffic drains from old tasks before they are stopped.My Notes
Saved to browser storage automatically as you type.
7
Kubernetes with EKS
0%
Topics
Kubernetes Core Objects
Pod
What The smallest deployable unit in Kubernetes. A pod wraps one or more containers that share the same network namespace (same IP, same localhost) and storage volumes. In practice, most pods contain exactly one container. Pods are ephemeral — they are created and destroyed constantly.
Why You never create pods directly in production. You define a Deployment and Kubernetes manages the pods for you, ensuring the desired number are always running.
Gotcha
- Pods have dynamic IPs. Never hardcode a pod's IP — use a Service to get a stable endpoint.
- When a pod crashes, Kubernetes restarts it (according to the
restartPolicy). It does not move to a new IP or name unless it is rescheduled to a different node.
Deployment
What A Deployment declares the desired state: "run 3 replicas of this container image." Kubernetes continuously reconciles actual state to desired state. A Deployment manages a ReplicaSet, which manages the pods. Rolling updates and rollbacks are built in.
Minimal Deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 2
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
ports:
- containerPort: 8080
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
Service
What A stable network endpoint that load-balances traffic to a set of pods matching a label selector. Three main types: ClusterIP (default, internal-only, for service-to-service traffic), NodePort (exposes on each node's IP at a static port, mostly for testing), LoadBalancer (provisions an AWS Network Load Balancer automatically, for internet-facing exposure).
Why Services decouple callers from pods. Even as pods are replaced during deployments, the Service endpoint stays stable.
Gotcha
- A
LoadBalancerService in EKS creates an AWS NLB, which costs money. For HTTP routing to multiple services, use an Ingress with the AWS Load Balancer Controller instead — it creates a single ALB shared across services.
ConfigMap and Secret
What ConfigMap: stores non-sensitive configuration (e.g., database URL, feature flags) as key-value pairs. Injected into pods as environment variables or mounted as files. Secret: same structure but for sensitive data (passwords, API keys). Base64-encoded (not encrypted) by default in etcd.
Gotcha
- Kubernetes Secrets are not encrypted at rest by default in etcd. For production, integrate with AWS Secrets Manager using the External Secrets Operator, or use the AWS Secrets and Configuration Provider (ASCP) with the Secrets Store CSI Driver.
- Base64 is encoding, not encryption. Anyone with kubectl access can decode a Secret trivially:
kubectl get secret my-secret -o jsonpath='{.data.password}' | base64 -d
Ingress
What HTTP/HTTPS routing rules that direct external traffic to different Services based on host or path. Example:
api.example.com/users → user-service, api.example.com/orders → order-service. Requires an Ingress Controller to implement the rules. In EKS, the AWS Load Balancer Controller creates an ALB from Ingress resources.Why Instead of one LoadBalancer per service (one NLB each = expensive), a single ALB can route to all services based on path — much more cost-efficient.
EKS (Elastic Kubernetes Service)
EKS Cluster and Node Groups
What EKS is a managed Kubernetes control plane. AWS runs and maintains etcd, the API server, controller manager, and scheduler. You manage (or let AWS manage) the worker nodes via Managed Node Groups — EC2 instances that EKS provisions, registers, and drains during updates automatically.
Cost warning
- EKS control plane: $0.10/hour (~$72/month) per cluster. Plus the EC2 cost of worker nodes. Delete the cluster when not actively learning — this is the biggest cost trap in this roadmap.
- Alternatively, use EKS with Fargate profiles to avoid managing EC2 worker nodes entirely. Pay per pod CPU/memory instead.
IRSA (IAM Roles for Service Accounts)
What In EKS, pods cannot use the node's EC2 Instance Profile — every pod on the node would share the same AWS permissions, completely breaking least privilege. IRSA is the solution: you annotate a Kubernetes Service Account with an IAM Role ARN. Only pods that reference that Service Account receive temporary credentials for that specific role. Credentials are injected as environment variables and rotated automatically — no code changes needed, the AWS SDK picks them up via the standard credential provider chain.
Why This is the mandatory production pattern for giving AWS permissions to pods. Without IRSA, teams fall back to embedding Access Keys in Kubernetes Secrets or on ConfigMaps — a serious security violation. IRSA is a day-1 requirement on any real EKS deployment and a common interview question.
Setup with eksctl
# Step 1: Associate OIDC provider with the cluster (once per cluster) eksctl utils associate-iam-oidc-provider \ --cluster my-cluster --region us-east-1 --approve # Step 2: Create an IAM service account in your cluster namespace eksctl create iamserviceaccount \ --cluster my-cluster \ --namespace default \ --name my-app-sa \ --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \ --approve # Step 3: Reference it in your Deployment manifest # spec.template.spec.serviceAccountName: my-app-sa
Gotcha
- The OIDC provider must be associated with the cluster before creating IRSA service accounts. Without this, the IAM role trust policy cannot validate the pod's identity token.
- The AWS SDK inside the pod automatically uses IRSA credentials. On EC2, it uses the Instance Profile. Both flow through the same credential provider chain — your Spring Boot code works identically in both environments without any changes.
- When a second team member tries to
kubectlinto your cluster and gets "Unauthorized", they need to be added to theaws-authConfigMap:kubectl edit configmap aws-auth -n kube-systemand add their IAM user ARN.
eksctl and kubectl
What eksctl: CLI tool by Weaveworks (officially supported by AWS) for creating and managing EKS clusters. Abstracts the complex CloudFormation stacks that EKS requires. kubectl: the standard Kubernetes CLI for all cluster operations — deploying, scaling, inspecting, and debugging.
Key commands
# Create cluster (takes ~15-20 minutes) eksctl create cluster --name my-cluster --region us-east-1 \ --nodegroup-name workers --node-type t3.small --nodes 2 kubectl get nodes # verify workers are Ready kubectl get pods -A # all pods in all namespaces kubectl apply -f app.yaml # deploy from manifest kubectl get svc # list services and their external IPs kubectl logs pod-name -f # tail pod logs kubectl rollout undo deployment/my-app # rollback
Hands-on Tasks
Interview Q&A
When would you choose ECS over EKS, and vice versa?
Choose ECS when: your team is AWS-focused and doesn't need multi-cloud portability, you want simpler operations with less overhead, you're running a small number of services, or Kubernetes' learning curve isn't justified by your team's size and complexity.
Choose EKS when: you need Kubernetes-native tooling (Helm charts, Argo CD, Istio, Karpenter), your organisation is standardising on Kubernetes across clouds, you need fine-grained scheduling or custom operators, or you're migrating from an on-premises Kubernetes cluster.
EKS adds operational complexity (control plane cost, version upgrades, node group management) and a steeper learning curve. ECS is simpler and tightly integrated with AWS services but less portable. For a new Spring Boot project with a small team: ECS Fargate is often the pragmatic choice.
Choose EKS when: you need Kubernetes-native tooling (Helm charts, Argo CD, Istio, Karpenter), your organisation is standardising on Kubernetes across clouds, you need fine-grained scheduling or custom operators, or you're migrating from an on-premises Kubernetes cluster.
EKS adds operational complexity (control plane cost, version upgrades, node group management) and a steeper learning curve. ECS is simpler and tightly integrated with AWS services but less portable. For a new Spring Boot project with a small team: ECS Fargate is often the pragmatic choice.
What happens when a Kubernetes pod crashes?
Kubernetes restarts the container within the pod according to the pod's
The Deployment controller ensures the desired number of healthy pods always runs. If a pod on a node crashes repeatedly, the scheduler may reschedule it on a different node. Liveness probes detect unhealthy pods (e.g., deadlocked app) and kill them for restart. Readiness probes prevent traffic from reaching pods that are not yet ready to serve requests.
restartPolicy (default: Always for Deployments). After repeated failures, Kubernetes enters CrashLoopBackOff — it restarts the container with exponential backoff (10s, 20s, 40s... up to 5 minutes between restarts) to avoid thrashing.The Deployment controller ensures the desired number of healthy pods always runs. If a pod on a node crashes repeatedly, the scheduler may reschedule it on a different node. Liveness probes detect unhealthy pods (e.g., deadlocked app) and kill them for restart. Readiness probes prevent traffic from reaching pods that are not yet ready to serve requests.
How does a Kubernetes rolling update work?
When you update a Deployment (e.g., change the image tag), Kubernetes creates a new ReplicaSet with the new version. It then scales up the new ReplicaSet and scales down the old one gradually, controlled by
New pods must pass readiness probes before old pods are terminated — ensuring no traffic goes to pods that aren't ready. If the new pods fail readiness probes, the rollout pauses and old pods remain running. Rollback is instant:
maxSurge (how many extra pods can exist during rollout, default 25%) and maxUnavailable (how many pods can be unavailable during rollout, default 25%).New pods must pass readiness probes before old pods are terminated — ensuring no traffic goes to pods that aren't ready. If the new pods fail readiness probes, the rollout pauses and old pods remain running. Rollback is instant:
kubectl rollout undo deployment/my-app switches back to the previous ReplicaSet without rebuilding anything.My Notes
Saved to browser storage automatically as you type.
8
Serverless
0%
Topics
AWS Lambda
Lambda Function Lifecycle
What Lambda runs your code in response to events (HTTP requests, S3 uploads, DynamoDB streams, SQS messages, etc.) without you provisioning or managing servers. You upload code (ZIP or container image), configure a trigger, set memory (128 MB–10 GB), and pay per invocation + duration (GB-seconds). Maximum execution time: 15 minutes.
Why No servers to patch or scale. Lambda auto-scales from 0 to thousands of concurrent executions. Cost is zero when idle — you pay only for actual compute used, down to the millisecond.
Gotcha
- Default concurrent execution limit: 1000 per account per region (soft limit, can be increased). One Lambda function hitting the limit can throttle all other functions in the account. Use reserved concurrency to isolate critical functions.
- Lambda is not suitable for long-running processes (batch jobs, video encoding) or anything requiring persistent local state.
Writing a Java Lambda Handler (plain Java, no Spring)
What The handler is the entry point Lambda invokes. Implement
RequestHandler<Input, Output> from the aws-lambda-java-core library. Input/output types are automatically serialised from/to JSON by the Lambda runtime. Keep the handler class small — initialise expensive objects (SDK clients, DB connections) as instance variables so they survive across warm invocations on the same container.Minimal working handler (Maven dependency:
aws-lambda-java-core + aws-lambda-java-events)
public class UserHandler
implements RequestHandler<APIGatewayProxyRequestEvent,
APIGatewayProxyResponseEvent> {
// Initialised ONCE per container — reused on warm starts
private final DynamoDbClient dynamo = DynamoDbClient.create();
@Override
public APIGatewayProxyResponseEvent handleRequest(
APIGatewayProxyRequestEvent event, Context ctx) {
String userId = event.getPathParameters().get("id");
// query dynamo, build response...
return new APIGatewayProxyResponseEvent()
.withStatusCode(200)
.withBody("{\"userId\":\"" + userId + "\"}");
}
}
Handler config in Lambda console: com.example.UserHandler::handleRequest
Gotcha
- Do not use Spring Boot as a Lambda runtime. Spring's application context initialises in 3–8 seconds — that's a multi-second cold start on every scale-out event. Use plain Java for Lambda. If your team requires Spring, use Spring Cloud Function + SnapStart, which snapshots the initialised context to cut cold starts to ~1 second.
- Static initialisers (
static { ... }) run during the init phase — counted as cold start time. Move heavyweight setup into instance variables (initialised once) or lazy-init them on first invocation.
Cold Starts and SnapStart
What A cold start happens when Lambda must provision a new container to handle an invocation — no warm container is available. The sequence: provision container → download code → initialise runtime → run static initialiser code → run handler. For Java, this takes 2–10 seconds. A warm start reuses an existing container and only runs the handler (~10–100ms).
Why Cold starts add latency for the first request and after periods of inactivity. Critical for customer-facing APIs.
Java-specific mitigations
- Lambda SnapStart (available for Java 11+ managed runtimes): AWS takes a snapshot of the initialised execution environment after the init phase, then restores from it on cold starts. Reduces Java cold starts from seconds to ~1 second. Enable in the function configuration → "Edit" → SnapStart.
- Provisioned Concurrency: keeps N containers pre-warmed and ready. Eliminates cold starts completely but you pay for the reserved capacity even when idle. Use for latency-sensitive APIs.
- Avoid heavy Spring Boot in Lambda — Spring's context initialisation is slow. Use plain Java, Quarkus (with native compilation), or Micronaut instead. If you need Spring, use Spring Cloud Function with GraalVM native image.
Memory and Performance
What Lambda CPU allocation scales proportionally with memory. A function at 1024 MB gets approximately 2x the CPU of a function at 512 MB. This means allocating more memory can make a function faster and potentially cheaper overall (faster execution = fewer GB-seconds billed).
Gotcha
- The right memory setting is not always the minimum. Use AWS Lambda Power Tuning (an open source Step Functions state machine) to find the optimal memory-to-cost ratio for your function.
- Ephemeral storage (
/tmp): 512 MB by default, configurable up to 10 GB. Files in /tmp persist between warm invocations on the same container — useful for caching, but don't assume it's always available.
API Gateway
HTTP API vs REST API
What API Gateway offers two main API types for Lambda integrations: HTTP API (newer, simpler, ~70% cheaper, lower latency) and REST API (more features, more expensive). Both create managed HTTP endpoints backed by Lambda.
Feature comparison
- HTTP API: JWT authorisers, Lambda proxy integration, CORS, lower cost (~$1/million requests). Missing: caching, API keys/usage plans, per-method throttling, AWS WAF integration.
- REST API: all HTTP API features plus response caching, usage plans, API keys, resource policies, AWS WAF, mock integrations, request/response transformations. Costs ~$3.50/million requests.
- Default recommendation: use HTTP API unless you need a specific REST API feature.
Throttling and Security
What API Gateway throttles requests to protect your backend. Account-level default: 10,000 requests/second burst, 5,000 steady-state. Returns HTTP 429 (Too Many Requests) when exceeded. For REST API, you can set per-route throttle limits via usage plans.
Gotcha
- The API Gateway endpoint is public by default. Anyone can call it. Add an authoriser (JWT/Cognito for HTTP API, Lambda authoriser for custom logic) or at minimum use an API key to prevent open access in production.
DynamoDB
Partition Keys and Sort Keys
What DynamoDB is a fully managed NoSQL key-value and document database. Every item has a primary key. Simple primary key: partition key only (must be unique per item). Composite primary key: partition key + sort key (the combination must be unique; the partition key alone can repeat). DynamoDB uses the partition key to determine which physical partition stores the item (sharding). The sort key enables range queries within a partition.
Why DynamoDB provides single-digit millisecond performance at any scale, with no schema to define beyond the key structure. Perfect for high-throughput, simple access patterns.
Gotcha
- Choose a high-cardinality partition key (e.g., user ID, order ID — values that are unique and evenly distributed). A low-cardinality partition key (e.g., status = "active"/"inactive") creates hot partitions that throttle performance.
- DynamoDB has no joins. Design your table access patterns upfront — think in queries, not entities. One table design (putting multiple entity types in one table) is the advanced but optimal pattern.
- Free tier: 25 GB storage + 25 RCUs + 25 WCUs per month — always free, not just 12 months.
Capacity Modes
What On-Demand: pay per request. No capacity planning. Scales instantly to any traffic level. ~$1.25 per million write request units, $0.25 per million read request units. Provisioned: you specify Read Capacity Units (RCUs) and Write Capacity Units (WCUs). Cheaper at predictable, sustained load. Auto Scaling adjusts RCU/WCU automatically within min/max bounds.
Gotcha
- 1 RCU = 1 strongly consistent read per second for items up to 4 KB, or 2 eventually consistent reads per second. 1 WCU = 1 write per second for items up to 1 KB.
- On-Demand is ~5–7x more expensive than Provisioned at equivalent sustained throughput. Use On-Demand for unpredictable or new workloads; switch to Provisioned once traffic patterns are known.
Global Secondary Index (GSI)
What A GSI lets you query a DynamoDB table on attributes other than the primary key. A GSI has its own partition key (and optional sort key) that can be any table attribute. Data is replicated asynchronously from the base table to the GSI. You can have up to 20 GSIs per table. You pay separately for GSI storage and throughput.
Gotcha
- GSI reads are eventually consistent — the GSI may lag behind the base table by milliseconds to seconds. Never use a GSI for reads that must reflect the very latest write.
Hands-on Tasks
Interview Q&A
What is a Lambda cold start and how do you mitigate it for Java?
A cold start occurs when Lambda provisions a new execution environment — it must download code, start the JVM, and run initialisation code before handling the request. For Java, this typically adds 2–10 seconds of latency. Subsequent requests to the same warm container run in milliseconds.
Mitigations in order of effectiveness: (1) Lambda SnapStart — enable on Java 11+ functions; AWS snapshots the initialised environment and restores from it, reducing cold starts to ~1 second. (2) Provisioned Concurrency — keeps N containers pre-initialised, eliminating cold starts at extra cost. (3) Avoid Spring Boot — Spring's context initialisation is heavy; use Quarkus with native compilation or plain Java. (4) Keep init code minimal — defer expensive initialisation until first use rather than in static blocks.
Mitigations in order of effectiveness: (1) Lambda SnapStart — enable on Java 11+ functions; AWS snapshots the initialised environment and restores from it, reducing cold starts to ~1 second. (2) Provisioned Concurrency — keeps N containers pre-initialised, eliminating cold starts at extra cost. (3) Avoid Spring Boot — Spring's context initialisation is heavy; use Quarkus with native compilation or plain Java. (4) Keep init code minimal — defer expensive initialisation until first use rather than in static blocks.
When would you choose DynamoDB over PostgreSQL (RDS)?
Choose DynamoDB when: you need single-digit millisecond latency at massive scale, your access patterns are simple and known upfront (get by ID, query by partition key), you need near-infinite scalability without manual sharding, or you're building session storage, leaderboards, IoT data, or gaming backends.
Stick with PostgreSQL/RDS when: your data has complex relationships requiring joins, you need ad-hoc queries or reporting, you have transactional requirements (ACID across multiple entities), or your team thinks in relational terms — the operational overhead of DynamoDB table design is significant for teams not familiar with NoSQL access patterns. DynamoDB requires you to design your data model around your queries upfront, not the other way around.
Stick with PostgreSQL/RDS when: your data has complex relationships requiring joins, you need ad-hoc queries or reporting, you have transactional requirements (ACID across multiple entities), or your team thinks in relational terms — the operational overhead of DynamoDB table design is significant for teams not familiar with NoSQL access patterns. DynamoDB requires you to design your data model around your queries upfront, not the other way around.
What's the difference between API Gateway REST API and HTTP API?
HTTP API is the newer, simpler, and cheaper option (~$1/million requests). It covers the majority of use cases: Lambda proxy integration, JWT authentication, CORS, and custom routes. Lower latency than REST API.
REST API (~$3.50/million requests) adds features not in HTTP API: response caching (reduces Lambda invocations and latency for cacheable responses), usage plans and API keys (for partner/developer APIs with rate limits per key), AWS WAF integration (web application firewall), request/response transformation without Lambda, and resource-level policies.
Default choice: HTTP API. Use REST API specifically when you need caching, usage plans, or WAF.
REST API (~$3.50/million requests) adds features not in HTTP API: response caching (reduces Lambda invocations and latency for cacheable responses), usage plans and API keys (for partner/developer APIs with rate limits per key), AWS WAF integration (web application firewall), request/response transformation without Lambda, and resource-level policies.
Default choice: HTTP API. Use REST API specifically when you need caching, usage plans, or WAF.
My Notes
Saved to browser storage automatically as you type.
9
Infrastructure as Code
0%
Topics
Terraform
Core Concepts: Providers, Resources, Variables
What Terraform is an open-source IaC tool by HashiCorp using HCL (HashiCorp Configuration Language). Providers are plugins that interact with APIs (the
aws provider calls AWS APIs). Resources are the infrastructure you declare (aws_s3_bucket, aws_instance). Variables parameterise your configuration so the same code can provision dev, staging, and prod.Minimal example
terraform {
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
}
provider "aws" {
region = var.region
}
variable "region" {
default = "us-east-1"
}
resource "aws_s3_bucket" "my_bucket" {
bucket = "my-app-uploads-${var.region}"
}
resource "aws_s3_bucket_versioning" "my_bucket" {
bucket = aws_s3_bucket.my_bucket.id
versioning_configuration { status = "Enabled" }
}
Terraform Workflow: init → plan → apply
What The standard Terraform workflow is three commands:
terraform init: download providers and modules, initialise the backend. Run once per project or after backend/provider changes.terraform plan: show what will be created, changed, or destroyed. Reads current state and compares to your code. Always review this output before applying.terraform apply: execute the plan. Prompts for confirmation. Writes results to state file.terraform destroy: destroys all resources managed by the current state. Use carefully.
Gotcha
- Never run
terraform applyin CI without first reviewing theplanoutput. A-replaceor-destroyflag in the wrong hands deletes production resources. terraform import: bring an existing manually-created resource under Terraform management without recreating it.
Remote State (S3 + DynamoDB)
What Terraform state (
terraform.tfstate) is a JSON file that maps your code to real-world resources. By default it is stored locally. For teams, store it remotely in S3 (for shared access and versioning) with a DynamoDB table for state locking (prevents two people from applying simultaneously and corrupting state).backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
Gotcha
- State files can contain sensitive values (RDS passwords, private keys). Enable S3 bucket encryption and restrict access with bucket policies.
- Never edit the state file manually. Use
terraform state mv,terraform state rm, orterraform importfor state surgery.
Modules
What Reusable groups of Terraform resources. A module is a directory of
.tf files. Call it with a module block and pass variables. Modules enforce consistency — define your VPC pattern once and call it for dev, staging, and prod. The Terraform Registry has community modules for common patterns (AWS VPC, EKS, RDS).Gotcha
- Avoid deeply nested modules — they make debugging hard. Flat module structures are more readable.
- Pin module versions (
version = "5.5.0") to avoid unexpected breaking changes from upstream updates.
AWS CloudFormation
Templates and Stacks
What CloudFormation is AWS's native IaC service. You define infrastructure in YAML or JSON templates. Deploying a template creates a stack — a collection of AWS resources managed as a unit. AWS determines the correct creation order based on dependencies between resources. CloudFormation is free — you pay only for the resources it creates.
Template structure
AWSTemplateFormatVersion: '2010-09-09'
Description: My application stack
Parameters:
InstanceType:
Type: String
Default: t3.micro
Resources: # required
MyInstance:
Type: AWS::EC2::Instance
Properties:
InstanceType: !Ref InstanceType
ImageId: ami-0abcdef1234567890
Outputs:
InstanceId:
Value: !Ref MyInstance
Gotcha
- If a stack update fails, CloudFormation automatically rolls back to the previous state. This is generally helpful but means you must investigate why the update failed before retrying.
- Deleting a stack deletes all resources in it (by default). Use DeletionPolicy: Retain on critical resources like RDS and S3 to protect them.
Change Sets and Drift Detection
What A Change Set is a preview of what CloudFormation will do when you update a stack — equivalent to
terraform plan. Always create a change set and review it before updating production stacks. Drift detection identifies resources that have been modified outside of CloudFormation (e.g., someone changed a Security Group in the console). Drift detection does not auto-remediate — it reports differences so you can decide what to do.Gotcha
- Manual changes to CloudFormation-managed resources cause drift. If you update a resource manually and then run CloudFormation, it may overwrite your manual change or fail with a conflict. Always make changes through CloudFormation or Terraform — never manually in the console for IaC-managed resources.
Hands-on Tasks
Interview Q&A
Why use Infrastructure as Code instead of clicking in the console?
Four reasons: Reproducibility — run the same code and get identical environments. No manual steps that differ between dev, staging, and prod. Version control — infrastructure changes are committed to Git, code-reviewed in PRs, and have a full audit trail. You know what changed, when, and why. Disaster recovery — if an environment is destroyed, recreate it from code in minutes rather than days of manual work. Drift prevention — IaC is the source of truth; manual console changes are detected as drift and can be corrected. The alternative — clicking in the console — produces "snowflake servers" that are impossible to reproduce exactly.
What is Terraform state and why does it need to be stored remotely?
Terraform state (
Remote state (in S3) is required for team use: if each engineer has their own local state file, there is no shared understanding of what is deployed and two people applying simultaneously can corrupt state or create duplicate resources. DynamoDB state locking prevents concurrent applies by creating a lock entry before any operation and releasing it after. Encryption on the S3 bucket is important because state files often contain sensitive values like database passwords.
terraform.tfstate) is a JSON file mapping your code to real AWS resources. Without it, Terraform cannot know what it already created — it would try to create everything again on the next apply. The state file also tracks resource attributes (IDs, ARNs) needed to create dependencies.Remote state (in S3) is required for team use: if each engineer has their own local state file, there is no shared understanding of what is deployed and two people applying simultaneously can corrupt state or create duplicate resources. DynamoDB state locking prevents concurrent applies by creating a lock entry before any operation and releasing it after. Encryption on the S3 bucket is important because state files often contain sensitive values like database passwords.
What is a CloudFormation change set and why should you always use one?
A change set is CloudFormation's preview of what will happen when you update a stack — it lists which resources will be Added, Modified, or Replaced. Replaced is critical: some property changes require CloudFormation to delete and recreate a resource (e.g., changing an RDS instance's engine version or renaming a resource). Without reviewing the change set, you could accidentally delete a production database by making what looks like a minor configuration change.
Always create and review a change set before updating any production stack. The workflow is: create change set → review (specifically look for any "Replacement: True" entries) → execute if safe, or modify the template if not. This is the IaC equivalent of code review before deployment.
Always create and review a change set before updating any production stack. The workflow is: create change set → review (specifically look for any "Replacement: True" entries) → execute if safe, or modify the template if not. This is the IaC equivalent of code review before deployment.
My Notes
Saved to browser storage automatically as you type.
★
Capstone Project
0%
Full target architecture
Internet → ALB → EKS (Spring Boot Microservices) → RDS PostgreSQL (private subnet)
EKS → S3 (document storage) · EKS → ECR (image registry)
All services → CloudWatch (logs + metrics + alarms)
GitHub → GitHub Actions → ECR → EKS (CI/CD pipeline)
Infrastructure provisioned via Terraform
This is the architecture to draw on paper first. Use draw.io with AWS Architecture Icons. Drawing it forces you to think through the connectivity, security groups, and IAM roles before building.
EKS → S3 (document storage) · EKS → ECR (image registry)
All services → CloudWatch (logs + metrics + alarms)
GitHub → GitHub Actions → ECR → EKS (CI/CD pipeline)
Infrastructure provisioned via Terraform
Cost-conscious alternative: ECS Fargate
The target architecture uses EKS, which costs $0.10/hr (~$72/month) for the control plane before any EC2 worker nodes. If that's a concern, or if Phase 7 felt overwhelming, this entire capstone works equally well with ECS Fargate:
Replace EKS Deployments → ECS Task Definitions + ECS Services. Replace
Replace EKS Deployments → ECS Task Definitions + ECS Services. Replace
kubectl apply in CI/CD → aws ecs update-service --force-new-deployment. The architecture (VPC, private subnets, ALB, RDS, S3, CloudWatch, IAM roles) and CI/CD concepts are identical. EKS adds Kubernetes portability; ECS Fargate is simpler, cheaper, and AWS-native. The AWS skills you build are the same either way.
What you're building
User Service
Spring Boot service handling user registration and login. Stores users in RDS PostgreSQL. Issues JWTs. Deployed on EKS.
Member Service
Spring Boot CRUD REST API for member management. Connected to same RDS instance (separate schema). Deployed on EKS.
Document Service
Spring Boot service for file upload/download. Stores files in S3. Returns presigned URLs to clients. Deployed on EKS.
Observability
CloudWatch Container Insights for EKS metrics and logs. Dashboards covering all three services. Alarms on error rate and latency.
CI/CD Pipeline
GitHub Actions workflow: on merge to main → build JAR → build Docker image → push to ECR → update EKS Deployment image tag → rolling deploy.
Capstone Tasks
Interview Q&A — Architecture-level questions
Walk me through the production architecture you built in this project.
The architecture starts at the edge with an Application Load Balancer that routes HTTP/HTTPS traffic into an EKS cluster running inside a VPC. The cluster has worker nodes across two Availability Zones in private subnets for resilience. Three Spring Boot microservices run as Kubernetes Deployments — User Service (auth), Member Service (CRUD), and Document Service (file storage).
The data layer: RDS PostgreSQL in a private subnet, reachable only from the EKS pods via Security Group rules. The Document Service stores files in S3 and returns presigned URLs to clients so file transfers bypass the application servers entirely.
Observability: CloudWatch Container Insights collects pod metrics and logs, with alarms feeding into SNS for on-call notifications. Infrastructure is provisioned via Terraform with remote state in S3. CI/CD runs in GitHub Actions — merge to main triggers a build, ECR push, and rolling deployment to EKS automatically.
The data layer: RDS PostgreSQL in a private subnet, reachable only from the EKS pods via Security Group rules. The Document Service stores files in S3 and returns presigned URLs to clients so file transfers bypass the application servers entirely.
Observability: CloudWatch Container Insights collects pod metrics and logs, with alarms feeding into SNS for on-call notifications. Infrastructure is provisioned via Terraform with remote state in S3. CI/CD runs in GitHub Actions — merge to main triggers a build, ECR push, and rolling deployment to EKS automatically.
How would you handle a database migration with zero downtime?
Use an expand-contract pattern with Flyway (or Liquibase): first deploy a migration that adds the new column or table without removing anything (expand). The old application version ignores the new column; the new version uses it. After all instances are running the new version (confirmed via rolling deploy), deploy a second migration that removes the old column (contract).
For high-risk migrations: take a manual RDS snapshot immediately before the migration window. Enable RDS Multi-AZ if not already on. Have a tested rollback plan — Flyway supports undo scripts, and the RDS snapshot is the last resort. Run the migration during low-traffic hours. Monitor error rates in CloudWatch during and after. Never run a migration that locks a table for minutes in production without testing its impact on a clone of the production database first.
For high-risk migrations: take a manual RDS snapshot immediately before the migration window. Enable RDS Multi-AZ if not already on. Have a tested rollback plan — Flyway supports undo scripts, and the RDS snapshot is the last resort. Run the migration during low-traffic hours. Monitor error rates in CloudWatch during and after. Never run a migration that locks a table for minutes in production without testing its impact on a clone of the production database first.
How would you scale this architecture to handle 10× traffic?
At the application layer: Kubernetes Horizontal Pod Autoscaler (HPA) scales pods based on CPU/memory — the EKS cluster already handles pod-level scaling. For node-level scaling, add Karpenter (AWS's node autoscaler) to provision new EC2 worker nodes automatically when pods are unschedulable.
At the database layer: RDS read replicas for read-heavy traffic (direct reporting queries to replicas). For write-heavy load, consider Aurora PostgreSQL — it separates storage from compute and scales reads across up to 15 replicas with automatic failover. RDS Proxy handles connection pooling if Lambda or many pods are hammering the DB.
At the network edge: CloudFront CDN in front of the ALB caches API responses where possible (rare for a CRUD API but useful for read-heavy public endpoints). S3 files served via CloudFront edge locations rather than presigned S3 URLs directly — reduces latency globally.
For the Document Service specifically: large file uploads should use S3 multipart upload initiated directly from the client (presigned multipart upload URLs), bypassing the application entirely for file bytes.
At the database layer: RDS read replicas for read-heavy traffic (direct reporting queries to replicas). For write-heavy load, consider Aurora PostgreSQL — it separates storage from compute and scales reads across up to 15 replicas with automatic failover. RDS Proxy handles connection pooling if Lambda or many pods are hammering the DB.
At the network edge: CloudFront CDN in front of the ALB caches API responses where possible (rare for a CRUD API but useful for read-heavy public endpoints). S3 files served via CloudFront edge locations rather than presigned S3 URLs directly — reduces latency globally.
For the Document Service specifically: large file uploads should use S3 multipart upload initiated directly from the client (presigned multipart upload URLs), bypassing the application entirely for file bytes.
My Notes
Saved to browser storage automatically as you type.