AWS Roadmap
Gain practical AWS experience to confidently discuss cloud architecture, deploy Spring Boot applications, and handle AWS-focused interview questions — without relying on memorized answers.
Work through phases at your own pace. The PDF suggests 8–10 weeks but move faster where you already have context.
EC2
IAM
VPC
RDS
S3
CloudWatch
Docker
ECR
ECS
EKS
Lambda
API Gateway
DynamoDB
CloudFormation
Terraform
1
AWS Foundations
0%
Before You Start — Prerequisites
This roadmap assumes the following. If any are missing, cover them first — they are assumed throughout every phase.
- Linux command line basics: navigating directories (
cd,ls), reading/writing files, running commands withsudo. You will SSH into EC2 instances from Phase 2 onward. - Java 17 + Maven or Gradle: you should be able to build a runnable Spring Boot JAR locally before Phase 2.
- Git: committing and pushing to a remote repository. Required for the CI/CD pipeline in the capstone.
- Basic networking concepts: what an IP address and port number are, what TCP/IP means. You do not need to be a network engineer — Phase 1 teaches the AWS-specific networking layer on top of this.
- A brand-new AWS account: use a dedicated email address. Do not use a corporate or shared account — you need root access for initial setup and need free tier eligibility.
Architecture this phase
VPC → Public Subnet (has route to Internet Gateway) · VPC → Private Subnet (no internet route, outbound via NAT Gateway)
Draw this yourself for better retention:
draw.io ·
Official AWS Icons
Topics
AWS Global Infrastructure
Regions
What
A Region is a geographic area that contains multiple Availability Zones. Examples:
us-east-1 (N. Virginia), eu-west-1 (Ireland), ap-southeast-1 (Singapore). AWS has 33+ regions as of 2024.
Why
You pick a Region based on three things: data residency (legal requirements to keep data in a country), latency (pick closest to your users), and service availability (not all services launch in all regions simultaneously).
Gotcha
- Resources are region-scoped. An EC2 instance in us-east-1 has nothing to do with one in eu-west-1 — they're completely separate.
us-east-1(N. Virginia) is where AWS launches new services first. Good for learning, but for production choose a region close to your actual users.
Availability Zones (AZs)
What
AZs are one or more physically separate data centres within a Region, each with independent power, cooling, and network connectivity. They are connected to each other within a Region via low-latency links. A Region typically has 3 AZs (minimum 2).
Why
Deploy your application across 2+ AZs so a data centre failure (fire, power outage, network issue) does not take down your service. This is the foundational pattern for high availability in AWS.
Gotcha
- AZ names are randomly mapped per AWS account. Your
us-east-1amay point to a different physical data centre than someone else'sus-east-1a. This is intentional to spread load. - Each subnet lives in exactly one AZ. To span AZs, create one subnet per AZ.
Edge Locations
What
Edge locations are AWS Points of Presence (PoPs) — servers positioned close to end users worldwide. There are 400+ edge locations globally, far more than the number of regions. They are used exclusively by CloudFront (CDN) and Route 53 (DNS).
Why
When a user in Tokyo requests content from your S3 bucket in us-east-1, CloudFront serves it from the nearest Tokyo edge location instead — dramatically reducing latency.
Gotcha
- Edge locations are not for compute. You cannot run EC2 instances or containers here. They only cache CloudFront content and resolve Route 53 DNS queries.
Networking
VPC (Virtual Private Cloud)
What
A VPC is your own isolated virtual network within AWS. You define an IP address range using CIDR notation (e.g.,
10.0.0.0/16 gives you 65,536 IP addresses), then carve it into subnets, configure routing, and attach gateways. A VPC is regional — it spans all AZs in a region.
Why
Without a VPC, your resources would share a flat network with no isolation. VPC is the foundation of all AWS network security — everything runs inside one.
Gotcha
- Every AWS account has a default VPC (
172.31.0.0/16) in every region. Do not delete it — getting it back requires an AWS support ticket. - For production and learning, always create a custom VPC. The default VPC has public subnets by default, which is not ideal for production.
- Two VPCs that need to communicate via VPC Peering cannot have overlapping CIDR ranges. Plan CIDR blocks upfront.
Public Subnet
What
A subnet whose associated route table has a route directing internet-bound traffic (
0.0.0.0/0) to an Internet Gateway. Resources in this subnet with a public IP can send and receive internet traffic.
Why
Internet-facing resources belong here: Application Load Balancers, NAT Gateways, and bastion hosts. You generally do not put application servers or databases here.
Gotcha
- Being in a public subnet does not automatically expose a resource. The resource also needs a public IP address, and the Security Group must allow the traffic.
Private Subnet
What
A subnet with no direct route to the internet. There is no
0.0.0.0/0 → IGW route in its route table. Resources here cannot receive inbound connections from the internet. Outbound internet access (e.g., for OS updates) goes through a NAT Gateway that sits in a public subnet.
Why
Defense in depth. Your RDS database or internal service belongs here. Even if a Security Group rule is accidentally misconfigured, there is no network path from the internet to the resource.
Gotcha
- Private subnet resources can still initiate outbound connections (to download packages, call external APIs) via NAT Gateway. NAT is outbound-only — it does not allow inbound connections from the internet.
Route Tables
What
A route table is a set of rules that tells the VPC where to send network traffic. Every subnet is associated with exactly one route table. A route of
0.0.0.0/0 → igw-xxx means: "send all internet-bound traffic to the Internet Gateway."
Why
The route table is what makes a subnet "public" or "private." Adding the IGW route is the single configuration change that gives a subnet internet access.
Gotcha
- The local route (e.g.,
10.0.0.0/16 → local) is automatically added and cannot be removed. This allows all resources within the VPC to communicate with each other. - Multiple subnets can share one route table, but a subnet can only be associated with one route table at a time.
Internet Gateway (IGW)
What
A horizontally-scaled, redundant, highly-available gateway that enables communication between your VPC and the internet. You attach it to the VPC itself (not to a subnet). Then you reference it in a subnet's route table to make that subnet public.
Why
Without an IGW, your VPC is completely isolated. Nothing can reach the internet, and the internet cannot reach anything in the VPC.
Gotcha
- One VPC can only have one IGW attached at a time.
- The IGW itself has no cost — charges come from data transfer through it.
NAT Gateway
What
A managed AWS service that allows resources in private subnets to initiate outbound internet connections (OS patches, external API calls) while blocking unsolicited inbound connections. The NAT Gateway itself lives in a public subnet and routes through the IGW. You put a route of
0.0.0.0/0 → nat-gw-xxx in the private subnet's route table.
Why
Your EC2 in a private subnet needs to run
yum update or call a third-party API. NAT Gateway enables this without exposing the instance to inbound internet traffic.
Gotcha
- NAT Gateway costs money: ~$0.045/hr per gateway plus data processing charges. For a learning account, delete it when not in use to avoid surprise bills.
- For high availability, deploy one NAT Gateway per AZ and have each AZ's private subnets route to their local NAT Gateway. Routing all AZs through one NAT Gateway means an AZ failure takes down all outbound internet access.
Security (IAM)
IAM User
What
An IAM User represents a person or application that needs long-term AWS credentials. There are two credential types: a username + password for the AWS Console, and an Access Key ID + Secret Access Key for CLI/API access. The root account (your signup email) is not an IAM User — it has unrestricted access to everything.
Why
You should never use the root account for daily work. Create an IAM User with only the permissions needed. If credentials are compromised, you can disable the user without losing the account.
Gotcha
- Access Keys are long-term credentials. If they are leaked (e.g., committed to GitHub), rotate them immediately and assume they were used maliciously.
- Never embed Access Keys in source code or Docker images. Use IAM Roles for EC2/Lambda and environment variables or a secrets manager for external systems.
IAM Group
What
A collection of IAM Users. You attach permission policies to the group once, and all users in the group inherit those permissions. Groups cannot contain other groups — they are flat.
Why
Managing permissions at the group level avoids attaching the same policies to dozens of individual users. When a new developer joins, add them to the "Developers" group and they get all required permissions immediately.
Gotcha
- A user can belong to multiple groups. Their effective permissions are the union of all policies from all groups they belong to, plus any policies attached directly to the user.
IAM Role
What
An IAM Role provides temporary credentials (via AWS STS) to whoever assumes it — AWS services (EC2, Lambda, ECS tasks), other AWS accounts, or federated users. A role has two policies: a Trust Policy (who is allowed to assume this role) and one or more Permission Policies (what the role can do). EC2 instances use a wrapper called an Instance Profile to assume a role.
Why
Roles are the correct way for AWS services to access other AWS services. An EC2 instance running your Spring Boot app should have an IAM Role with S3 read permission — no embedded access keys required. The AWS SDK automatically picks up the temporary credentials from the EC2 instance metadata.
Gotcha
- Temporary credentials from a role automatically expire (15 minutes to 12 hours) and are automatically rotated. A leaked set of temporary credentials has a built-in time limit — far safer than long-term Access Keys.
- If you SSH into an EC2 and run
aws s3 lswithout configuring credentials, it works because the Instance Profile role is used automatically. This is the intended behaviour.
IAM Policies
What
JSON documents that define what is allowed or denied. Each policy contains one or more Statements. A Statement has:
Effect (Allow or Deny), Action (e.g., s3:GetObject), and Resource (e.g., arn:aws:s3:::my-bucket/*). Policies are attached to Users, Groups, or Roles.
Why
Policies enforce principle of least privilege — give identities only the minimum permissions they need to do their job, nothing more.
Gotcha
- An explicit Deny always overrides an Allow, regardless of which policy it comes from. If any policy denies an action, that action is denied even if another policy allows it.
- AWS Managed Policies (like
AmazonS3ReadOnlyAccess) are maintained by AWS. Prefer these for standard roles. Avoid usingAdministratorAccessfor anything other than your admin user.
STS (Security Token Service) & AssumeRole
What
AWS STS issues temporary security credentials. When EC2 uses an Instance Profile, or when Lambda invokes, or when one account accesses another account's resources — STS is what generates the short-lived Access Key + Secret Key + Session Token combination behind the scenes. The core API call is
sts:AssumeRole: a principal requests temporary credentials to act as a different role.
Why
Cross-account access, federated identity (SSO), and service-to-service calls all flow through STS. A senior engineer at Walmart or Apple needs to understand: how a Kafka consumer in Account A accesses S3 in Account B, and how the Trust Policy on the target role controls who may assume it.
Gotcha
- Temporary credentials from STS expire (15 min–12 hrs depending on role session duration). The AWS SDK's credential provider chain automatically refreshes them before expiry — your application code never needs to handle credential rotation.
- The External ID is an optional parameter used in cross-account scenarios to prevent the "confused deputy" problem — where a malicious actor tricks a trusted service into assuming a role on their behalf. When a third party (e.g., a vendor) assumes a role in your account, always require an External ID in the trust policy.
VPC Endpoints (Gateway & Interface)
What
VPC Endpoints let resources in your private subnet talk to AWS services (S3, DynamoDB, Secrets Manager, ECR, etc.) without going through the internet or NAT Gateway. Two types: Gateway Endpoints (only S3 and DynamoDB — free, added as a route table entry) and Interface Endpoints (most other services — powered by AWS PrivateLink, creates an ENI in your subnet, costs ~$0.01/hr per AZ).
Why
Cost and security. If a private-subnet EC2 reads from S3 without a Gateway Endpoint, traffic goes: EC2 → NAT Gateway ($0.045/hr + $0.045/GB) → Internet → S3. With a Gateway Endpoint: EC2 → S3 directly, no NAT cost, traffic stays on the AWS network. For teams with Kafka on AWS or Spark jobs writing to S3, this difference is significant.
Gotcha
- Add S3 Gateway Endpoint to every VPC immediately — it is free and private-subnet instances accessing S3 without it pay unnecessary NAT costs.
- Interface Endpoints use DNS resolution: when enabled, the SDK resolves
secretsmanager.us-east-1.amazonaws.comto a private IP inside your VPC instead of the public IP. ConfirmenableDnsHostnamesandenableDnsSupportare enabled on your VPC, or the private DNS won't resolve.
Security Groups
What
Virtual firewalls at the resource level (EC2 instances, RDS databases, ECS tasks, Lambda in VPC). Rules are stateful: if you allow inbound TCP port 8080, the response traffic is automatically allowed without a separate outbound rule. You can only add Allow rules — there is no explicit Deny at the Security Group level.
Why
Security Groups are your primary tool for controlling which traffic can reach which resource. Every resource in a VPC has at least one Security Group.
Gotcha
- Security Groups can reference other Security Groups as the traffic source. Example: allow port 5432 from "App-SG" — any EC2 in that Security Group can connect to RDS, regardless of its IP address. This is better than using IP-based rules.
- Security Group changes take effect immediately. No instance restart needed.
NACLs (Network Access Control Lists)
What
Stateless firewalls at the subnet level. Unlike Security Groups, NACLs: are stateless (you must explicitly allow both inbound AND return outbound traffic), support both Allow and Deny rules, and evaluate rules in numeric order (lowest number first, first match wins).
Why
NACLs are a second firewall layer at the subnet boundary. Useful for blanket-blocking a specific IP range across all resources in a subnet — something you cannot do with Security Groups (which only allow, never deny).
Gotcha
- Stateless means: if you allow inbound port 80, you must also allow outbound ephemeral ports 1024–65535 so the HTTP response can leave the subnet. Forgetting this is a classic misconfiguration.
- The default NACL allows all traffic in both directions. If you create a custom NACL, it denies everything by default — you must add explicit allow rules.
- In practice, most teams keep the default NACLs and rely on Security Groups. Know the difference for interviews but don't over-engineer NACLs in practice.
Hands-on Tasks
Interview Q&A — Expand each to see the answer
What is a VPC and why does every AWS deployment need one?
A VPC (Virtual Private Cloud) is your own isolated virtual network within AWS. You define the IP address range using CIDR notation (e.g.,
10.0.0.0/16), create subnets within that range, configure route tables to control traffic flow, and attach gateways to connect to the internet or other networks. Every resource you launch in AWS — EC2, RDS, Lambda in a VPC — runs inside a VPC. It is the networking foundation that provides isolation, routing control, and the ability to define security boundaries.
Why use private subnets? What does it actually protect against?
Private subnets have no direct route to the internet — there is no
0.0.0.0/0 → IGW entry in their route table. A database or internal service placed in a private subnet is network-unreachable from the internet, even if you accidentally open a Security Group rule. This is defense in depth: you are not relying on a single misconfigurable firewall rule to protect sensitive resources. The internet has no network path to the resource, period. In contrast, a public subnet resource could be exposed if a Security Group is misconfigured — private subnets eliminate that risk.
What is the difference between Security Groups and NACLs?
Security Groups are stateful firewalls at the resource level (EC2, RDS, ECS task). Stateful means: allow inbound port 8080 → the response traffic is automatically allowed. Security Groups support Allow rules only. They are the primary tool teams use.
NACLs are stateless firewalls at the subnet level. Stateless means: you must explicitly allow both inbound request traffic AND outbound response traffic (on ephemeral ports 1024–65535). NACLs support both Allow and Deny rules, evaluated in numeric order. They act as a second layer of defense at the subnet boundary.
In practice: customise Security Groups for every resource. NACLs are usually left at default (allow all) unless you need to block a specific IP range at the subnet level.
NACLs are stateless firewalls at the subnet level. Stateless means: you must explicitly allow both inbound request traffic AND outbound response traffic (on ephemeral ports 1024–65535). NACLs support both Allow and Deny rules, evaluated in numeric order. They act as a second layer of defense at the subnet boundary.
In practice: customise Security Groups for every resource. NACLs are usually left at default (allow all) unless you need to block a specific IP range at the subnet level.
Why should databases never be publicly accessible?
A publicly accessible database is reachable from every IP address on the internet. This exposes it to: automated credential brute-force attacks, exploitation of known database engine vulnerabilities (CVEs), and accidental data exposure if credentials are weak or reused. Databases should only be reachable from your application tier (EC2, ECS, Lambda) within the same VPC, enforced by both private subnet placement (no internet route) and a Security Group that only allows connections from the application's Security Group on the DB port. There is no legitimate reason for a production database to have a public IP or be in a public subnet.
What is the difference between an IAM User and an IAM Role?
IAM User: long-term credentials. A person (or application) with a fixed username/password and/or Access Key ID + Secret. The credentials persist until explicitly rotated or deleted. If leaked, they remain valid indefinitely until you act.
IAM Role: temporary credentials issued by AWS STS when the role is assumed. They have an expiry (15 minutes to 12 hours) and are automatically renewed by the AWS SDK. Roles are assumed by AWS services (EC2 Instance Profile, Lambda execution role, ECS task role) or by federated users.
Best practice: EC2 instances should use an IAM Role via Instance Profile — never embed Access Keys in code or config files. The AWS SDK automatically uses the role credentials from the EC2 instance metadata endpoint. If the instance is compromised, the temporary credentials expire on their own.
IAM Role: temporary credentials issued by AWS STS when the role is assumed. They have an expiry (15 minutes to 12 hours) and are automatically renewed by the AWS SDK. Roles are assumed by AWS services (EC2 Instance Profile, Lambda execution role, ECS task role) or by federated users.
Best practice: EC2 instances should use an IAM Role via Instance Profile — never embed Access Keys in code or config files. The AWS SDK automatically uses the role credentials from the EC2 instance metadata endpoint. If the instance is compromised, the temporary credentials expire on their own.
My Notes
Saved to browser storage automatically as you type.
2
Deploy Spring Boot on EC2
0%
Topics
EC2 Core Concepts
EC2 (Elastic Compute Cloud)
What
Virtual machines in AWS. You choose an instance type (determines CPU and RAM), an AMI (the OS), storage (EBS), and which VPC/subnet to place it in. You manage everything from the OS upward — AWS manages the physical hardware underneath.
Why
EC2 is the most direct way to run a server in AWS. Full SSH access, install anything, configure however you want. The right mental model before moving to containers or serverless.
Gotcha
- Instance type naming:
t3.micro—t= burstable family,3= generation,micro= size. Free tier:t3.micro(2 vCPU, 1 GB RAM, 750 hrs/month). Common families: t (burstable), m (general), c (compute-optimised), r (memory-optimised). - Burstable instances (t-series) earn CPU credits at idle and spend them under load. If credits run out, CPU is throttled to a baseline rate. Fine for dev; watch out for sustained load in production.
- Stop ≠ Terminate. Stopped: instance paused, EBS persists, compute billing stops (EBS still billed). Terminated: instance deleted, EBS deleted by default. You cannot un-terminate.
AMI (Amazon Machine Image)
What
A pre-built OS image used to launch EC2 instances. Think of it as a disk snapshot that becomes your instance's root volume. AWS provides AMIs for Amazon Linux 2023, Ubuntu, Windows Server, and more. You can create custom AMIs from your own configured instances.
Why
Every EC2 instance starts from an AMI. The AMI determines the OS, pre-installed packages, and starting configuration. Custom AMIs let you bake in your Java runtime and config so new instances are ready immediately.
Gotcha
- AMIs are region-specific. An AMI from us-east-1 cannot be directly used in eu-west-1 — you must copy it first.
- Use Amazon Linux 2023 (AL2023), not Amazon Linux 2 which is approaching end-of-life. AL2023 uses
dnfas its package manager (thoughyumstill works as an alias).
EBS (Elastic Block Store)
What
Network-attached persistent storage for EC2. Every instance has a root EBS volume (the OS disk). Default root volume for AL2023 is 8 GB (gp3 type). EBS volumes can be detached from one instance and attached to another.
Why
Data written to EBS persists when you stop and restart the instance — unlike instance store (ephemeral) storage which is wiped. Your Spring Boot JAR and logs live on EBS.
Gotcha
- By default the root EBS volume is deleted when the instance is terminated. Change "Delete on Termination" to false if you want to keep the volume after termination.
- EBS volumes are AZ-specific. An EBS volume in us-east-1a cannot be attached to an instance in us-east-1b.
- You are billed for EBS storage even while the instance is stopped.
SSH Key Pairs
What
When launching an EC2 instance, AWS places the public key on the instance (in
~/.ssh/authorized_keys for the default user). You download the private key file (.pem) once. This is the only way to SSH in — there is no password login by default.
Why
SSH key auth is more secure than passwords. The private key never leaves your machine. AWS never stores it — if you lose the
.pem file, the key cannot be recovered.
Gotcha
- Set permissions immediately after download:
chmod 400 key.pem. Without this, SSH refuses to use the key: "WARNING: UNPROTECTED PRIVATE KEY FILE". - Default usernames:
ec2-user(Amazon Linux),ubuntu(Ubuntu),admin(Debian). Notroot. - Connect command:
ssh -i /path/to/key.pem ec2-user@<public-ip>
Elastic IP
What
A static public IPv4 address you allocate to your account and associate with an EC2 instance. By default, a public IP assigned on launch changes every time the instance is stopped and restarted. An Elastic IP stays fixed regardless of instance state.
Why
Useful when you need a predictable IP — e.g., if you whitelist it in a firewall rule or point a DNS A record to it.
Gotcha
- Elastic IPs are free when associated with a running instance. You are charged ~$0.005/hr when the IP is allocated but not associated (i.e., you're holding an IP without using it). Release IPs you're not using.
Linux & Deployment
Linux Administration on Amazon Linux 2023
What
AL2023 uses
dnf (RPM-based, like RHEL/Fedora). AWS provides Amazon Corretto — a free, production-ready OpenJDK build — in the AL2023 package repos.
Why
You need to install Java, Git, and other dependencies before deploying your app. Knowing basic Linux commands is essential for EC2 work.
Key commands
- Install Java 17:
sudo dnf install java-17-amazon-corretto -y - Install Git:
sudo dnf install git -y - Verify Java:
java -version→ should showCorretto-17.x.x - Copy JAR from laptop:
scp -i key.pem app.jar ec2-user@<ip>:~/ - Check disk space:
df -h· Check memory:free -h· Check processes:ps aux | grep java
Running Spring Boot as a Background Service (systemd)
What
Running
java -jar app.jar directly dies when your SSH session closes. systemd is the Linux init system that manages long-running services — starting them on boot and restarting them on failure.
Why
For any persistent deployment: your app must survive reboots and SSH disconnects, and recover automatically from crashes.
Service file
Create
Check logs:
Also add to
/etc/systemd/system/springapp.service:
[Unit] Description=Spring Boot Application After=network.target [Service] User=ec2-user WorkingDirectory=/home/ec2-user ExecStart=/usr/bin/java \ -Xms512m -Xmx1g \ -XX:+UseG1GC \ -Dspring.application.name=springapp \ -jar /home/ec2-user/app.jar SuccessExitStatus=143 Restart=on-failure RestartSec=10 TimeoutStopSec=60 [Install] WantedBy=multi-user.targetThen:
sudo systemctl daemon-reload && sudo systemctl enable springapp && sudo systemctl start springappCheck logs:
sudo journalctl -u springapp -fAlso add to
application.properties: server.shutdown=graceful and spring.lifecycle.timeout-per-shutdown-phase=30s — this lets in-flight requests complete during a deploy or ALB deregistration before the JVM exits.
Security Groups for a Spring Boot App
What
For a Spring Boot app exposed directly on EC2, the Security Group needs: inbound SSH (port 22) from your IP, inbound app traffic (port 8080) from wherever users connect, all outbound open. Spring Boot's default port is 8080; change with
server.port in application.properties.
Why
Without adding port 8080, your app runs correctly on the instance but is completely unreachable from outside — the Security Group silently blocks it.
Gotcha
- Never open SSH (port 22) to
0.0.0.0/0. Use "My IP" in the console. Bots scan for open SSH ports constantly. - Opening port 8080 to
0.0.0.0/0is acceptable for initial testing, but the tasks below will have you lock this down behind an ALB — which is how production always looks.
Application Load Balancer (ALB)
ALB, Target Groups, and Listeners
What
An Application Load Balancer sits in public subnets and distributes incoming HTTP/HTTPS traffic to backend targets (EC2 instances, ECS tasks, Lambda). It operates at Layer 7 — it can route requests based on URL path, hostname, and HTTP headers. Key components: Listener (port + protocol the ALB listens on), Rules (how traffic is routed), Target Group (the set of backends with health checks).
Why
In production, application servers should never be directly internet-facing. The ALB lives in the public subnet; your EC2 instances live behind it. The ALB handles health checks (removing unhealthy instances from rotation automatically), SSL termination, and is the attachment point for WAF, access logs, and sticky sessions.
Gotcha
- An ALB requires subnets in at least two AZs. Even with one EC2 target, you must specify two subnets in different AZs when creating the ALB.
- Health check path matters. If your Spring Boot app has Spring Actuator enabled, use
/actuator/health. If the path returns anything other than HTTP 2xx, targets are marked unhealthy and receive no traffic — your app appears "down" even though it's running. - The ALB has its own Security Group. Pattern: ALB SG allows 80/443 inbound from internet. EC2 SG allows 8080 inbound only from the ALB SG — not from
0.0.0.0/0. This is the correct production security model. - Startup timing trap: If Spring Boot takes 15–20 seconds to start and the ALB health check fires every 5 seconds with a threshold of 2 failures, the target is marked unhealthy during boot and receives no traffic — even though the app is fine. Set
initialDelaySeconds(or ALB health check healthy threshold + interval) to exceed your app's startup time. Monitor ALB target health status when a deployment stalls. - Graceful shutdown: Set ALB's deregistration delay (default 300 seconds) appropriately — it's how long the ALB continues sending in-flight requests to a de-registering target before cutting it off. Your Spring Boot app must have
server.shutdown=gracefuland a timeout that matches. Without this, rolling deploys cause 502s for active users.
HTTPS with ACM (AWS Certificate Manager)
What
ACM provides free SSL/TLS certificates for use with AWS services (ALB, CloudFront, API Gateway). You request a certificate for your domain, verify ownership via DNS (recommended) or email, and attach it to an ALB HTTPS listener. ACM auto-renews certificates — no manual renewal or private key management required.
Why
All production traffic must be HTTPS. The ALB terminates TLS — your EC2 and Spring Boot app receive plain HTTP internally on port 8080, and the ALB handles the encryption layer. You do not need to configure SSL in Spring Boot or install certificates on EC2.
Gotcha
- ACM certificates are free when attached to ALB, CloudFront, or API Gateway. You cannot export or download them for use on your own EC2 server.
- DNS validation is preferred: it adds a CNAME record to your domain's DNS and auto-renews without human action. Email validation requires you to click a link before the cert expires.
- If you do not own a domain yet, skip the HTTPS task — use the ALB's default DNS name (
my-alb-123.us-east-1.elb.amazonaws.com) with HTTP for now. The ALB + Target Group pattern is what matters here.
Hands-on Tasks
Interview Q&A — Expand each to see the answer
Walk me through provisioning an EC2 instance for a Spring Boot application.
Start by choosing an AMI — Amazon Linux 2023 for a Java app. Pick the instance type based on expected load;
t3.micro for dev/learning, t3.small or m5.large for production. Place it in a public subnet (if internet-facing) within your VPC. Create or assign an SSH key pair. Configure a Security Group: SSH on port 22 from your IP, app port 8080 from the internet or from an ALB's Security Group. Attach an IAM Role if the app needs to access other AWS services (S3, Secrets Manager). After launch, SSH in, install Java (Amazon Corretto), copy the JAR via SCP, and run it as a systemd service.
What's the difference between stopping and terminating an EC2 instance?
Stopping an instance pauses it. The EBS root volume is preserved and the instance can be restarted. Compute billing stops, but EBS storage billing continues. The public IP (if not an Elastic IP) is released and a new one is assigned on next start.
Terminating an instance deletes it. By default, the root EBS volume is also deleted ("Delete on Termination" = true). This action is irreversible — a terminated instance cannot be recovered. If you need the data, create a snapshot or change the Delete on Termination setting before terminating.
Terminating an instance deletes it. By default, the root EBS volume is also deleted ("Delete on Termination" = true). This action is irreversible — a terminated instance cannot be recovered. If you need the data, create a snapshot or change the Delete on Termination setting before terminating.
How do you run a Spring Boot app on EC2 so it starts automatically and restarts on failure?
Use systemd, the Linux service manager. Create a unit file at
A quick alternative is
/etc/systemd/system/springapp.service that defines the app's start command, working directory, user, and restart policy (Restart=on-failure). Run sudo systemctl enable springapp to register it for automatic start on boot, and sudo systemctl start springapp to start it now. View logs with sudo journalctl -u springapp -f.A quick alternative is
nohup java -jar app.jar > app.log 2>&1 &, but systemd is correct for anything production-like because it handles restarts and boot integration.
What Security Group rules does a Spring Boot app on EC2 need?
Minimum rules: inbound SSH (port 22) restricted to your IP (never 0.0.0.0/0 in production), and inbound TCP on port 8080 (Spring Boot's default) from wherever users connect. Leave outbound traffic fully open so the instance can download packages, call external APIs, etc.
In a production setup with an Application Load Balancer in front, the EC2 Security Group should allow port 8080 only from the ALB's Security Group — not from the internet directly. This way, all traffic is routed through the ALB, which handles HTTPS termination and health checks.
In a production setup with an Application Load Balancer in front, the EC2 Security Group should allow port 8080 only from the ALB's Security Group — not from the internet directly. This way, all traffic is routed through the ALB, which handles HTTPS termination and health checks.
What is an AMI and when would you create a custom one?
An AMI (Amazon Machine Image) is a snapshot of an EC2 instance's state that you use to launch new instances. It includes the OS, installed packages, and configuration baked in at the time the AMI was created.
You'd create a custom AMI when you want new instances to start already configured — for example, with Java, your application dependencies, and monitoring agents pre-installed. This speeds up Auto Scaling: new instances are ready in seconds rather than running a lengthy bootstrap script on every launch. Custom AMIs are also used for Golden Image pipelines — a security-hardened base image that teams build their application AMIs from.
You'd create a custom AMI when you want new instances to start already configured — for example, with Java, your application dependencies, and monitoring agents pre-installed. This speeds up Auto Scaling: new instances are ready in seconds rather than running a lengthy bootstrap script on every launch. Custom AMIs are also used for Golden Image pipelines — a security-hardened base image that teams build their application AMIs from.
My Notes
Saved to browser storage automatically as you type.
3
RDS Integration
0%
Topics
RDS Core Concepts
RDS (Relational Database Service)
What A managed database service. AWS handles: OS patching, DB software installation and upgrades, automated backups, Multi-AZ failover, and storage scaling. You manage only the schema and queries. Supports PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Aurora.
Why Running PostgreSQL on a plain EC2 means you own backups, patching, HA, and failover yourself. In production, that operational burden is enormous. RDS trades cost for time.
Gotcha
db.t3.microis free tier eligible (750 hrs/month for MySQL, PostgreSQL, or MariaDB — not Aurora). Good enough for this phase.- RDS must go in private subnets. It should never have a public IP. Access it only from within the VPC via Security Groups.
- The RDS endpoint looks like:
my-db.abc123.us-east-1.rds.amazonaws.com. Use this as your JDBC host.
Aurora vs Standard RDS
What Aurora PostgreSQL is AWS's cloud-native database, MySQL/PostgreSQL-compatible but with a fundamentally different architecture: storage is separated from compute and automatically replicated 6 ways across 3 AZs. Aurora offers: faster failover (30 seconds vs 60–120 for Multi-AZ RDS), up to 15 read replicas (vs 5 for RDS), auto-scaling storage (up to 128 TB), and Global Database (multi-region active-active reads). Aurora Serverless v2 auto-scales the compute tier in response to load.
Why For production systems requiring sub-30-second failover, many read replicas, or a massive single database, Aurora is the better choice despite being ~20% more expensive than RDS. Standard RDS is the right choice for dev/learning and smaller production workloads.
Gotcha
- Aurora is not free tier eligible. Standard RDS PostgreSQL is — use it for this phase's hands-on work.
- Aurora has its own endpoint types: cluster endpoint (always points to the primary, use for writes), reader endpoint (load-balances across read replicas). Your app should use both if running read replicas.
RDS Proxy
What RDS Proxy is a fully managed connection pooler that sits between your application and RDS. It maintains a pool of long-lived connections to the database and multiplexes thousands of application-side connections into far fewer database connections. Critical for Lambda → RDS scenarios: each Lambda invocation opens a new DB connection, and with 1000 concurrent Lambdas you'd exhaust PostgreSQL's default 100-connection limit immediately.
Why PostgreSQL has a max connection limit (usually 100–400 depending on instance size). Serverless architectures (Lambda) and container scaling (ECS/EKS with many pods) can each open separate connections and exhaust the limit, causing
FATAL: connection limit exceeded. RDS Proxy solves this without you changing application code — just point the JDBC URL at the proxy endpoint instead of the DB endpoint.Gotcha
- RDS Proxy costs ~$0.015/vCPU/hour plus $0.001/GB of data transferred. Worth it in Lambda-heavy architectures; overkill for a single EC2 app with HikariCP.
- RDS Proxy requires IAM authentication or Secrets Manager — it does not accept direct password credentials from the JDBC URL. This actually improves security: no passwords in JDBC strings.
Multi-AZ Deployment
What RDS maintains a synchronous standby replica in a different AZ. Every write to the primary is simultaneously written to the standby. If the primary fails, RDS automatically promotes the standby. The CNAME endpoint automatically points to the new primary. Typical failover time: 60–120 seconds.
Why Protects against an AZ outage or hardware failure. Without Multi-AZ, a failed DB instance means downtime until you restore from backup.
Gotcha
- The standby replica is not readable. It exists purely for failover. You cannot send read traffic to it to reduce load on the primary — that is what Read Replicas are for.
- Multi-AZ roughly doubles the RDS cost. In dev/learning environments, leave it off.
Read Replicas
What Asynchronous copies of the primary RDS instance. You can create up to 5 read replicas per source instance. Your application explicitly directs read-heavy queries (reports, analytics) to a replica's endpoint, reducing load on the primary. Replicas can be in the same region, a different region, or promoted to a standalone DB.
Why Scale read throughput horizontally without upgrading the primary instance size.
Gotcha
- Replication is asynchronous — replicas may lag behind the primary by seconds. Never use a replica for anything requiring up-to-date data (e.g., immediately after a write).
- Multi-AZ ≠ Read Replica. Multi-AZ is for availability (automatic failover). Read Replicas are for scalability (read offloading). These are separate features and can both be enabled simultaneously.
Automated Backups & Snapshots
What Automated backups: RDS takes a daily snapshot of the DB during a maintenance window, plus continuously backs up transaction logs to S3 (not visible in your S3 bucket — managed by RDS). This enables point-in-time recovery to any second within the retention window (1–35 days, default 7). Manual snapshots: you trigger these yourself and they persist until you delete them.
Why Automated backups are your safety net for data corruption and accidental deletion. Take a manual snapshot before any major schema migration.
Gotcha
- Automated backups are deleted when you delete the RDS instance (unless you take a final snapshot). Manual snapshots survive instance deletion.
- Restoring a snapshot creates a new RDS instance — it does not restore in place. Plan for the new endpoint in your app config.
AWS Secrets Manager
What A managed service for securely storing, rotating, and retrieving secrets (database passwords, API keys, connection strings). Secrets are encrypted with KMS. Your application retrieves the secret at runtime via the AWS SDK — no plaintext credentials in config files, environment variables hardcoded in systemd units, or Docker images.
Why The most common AWS security mistake: DB credentials hardcoded in
application.properties and committed to Git. Secrets Manager solves this: your EC2 IAM Role has secretsmanager:GetSecretValue permission, the app fetches the secret at startup, and credentials never appear in source control or process environment dumps.Retrieving a secret via AWS SDK v2
SecretsManagerClient client = SecretsManagerClient.create();
GetSecretValueResponse r = client.getSecretValue(
GetSecretValueRequest.builder()
.secretId("prod/myapp/db")
.build());
// r.secretString() → JSON: {"username":"dbadmin","password":"..."}
// Parse with ObjectMapper, inject into DataSource
Alternatively, Spring Cloud AWS (io.awspring.cloud:spring-cloud-aws-starter-secrets-manager) lets you reference secrets directly in application.properties via ${sm:/prod/myapp/db:password} — no boilerplate SDK code needed.
Gotcha
- Secrets Manager costs $0.40 per secret per month + $0.05 per 10,000 API calls. Negligible at learning scale.
- IAM permission required:
secretsmanager:GetSecretValueon the specific secret ARN, not on*. - For local development, use environment variables or a local
.envfile (excluded from Git). Never configure Secrets Manager with hardcoded access keys locally — that defeats the purpose.
Connecting Spring Boot to RDS
What From Spring Boot's perspective, RDS PostgreSQL is just a PostgreSQL instance. The JDBC URL uses the RDS endpoint. The Security Group on the RDS instance must allow port 5432 from the EC2 instance's Security Group.
Why Understanding the connectivity model (SG → SG, not IP → IP) is important for troubleshooting and for interviews.
Config in application.properties
ssl=true&sslmode=require: RDS supports TLS by default — enforce it in the JDBC URL to encrypt credentials and data in transit.
spring.datasource.url=jdbc:postgresql://my-db.abc123.us-east-1.rds.amazonaws.com:5432/mydb?ssl=true&sslmode=require
spring.datasource.username=dbadmin
spring.datasource.password=${DB_PASSWORD}
# Pool sizing formula: (core_count × 2) + effective_spindle_count
# For db.t3.micro (2 vCPU): (2 × 2) + 1 ≈ 10 connections per app instance
spring.datasource.hikari.maximum-pool-size=10
spring.datasource.hikari.minimum-idle=2
spring.datasource.hikari.connection-timeout=3000
spring.datasource.hikari.idle-timeout=600000
Store the password in an environment variable or AWS Secrets Manager — never hardcode it in source files.ssl=true&sslmode=require: RDS supports TLS by default — enforce it in the JDBC URL to encrypt credentials and data in transit.
Hands-on Tasks
Interview Q&A
Why use RDS instead of running PostgreSQL yourself on EC2?
With RDS, AWS handles automated backups, point-in-time recovery, OS and engine patching, Multi-AZ failover, and storage auto-scaling. On a self-managed EC2 PostgreSQL setup, your team owns all of that — which means writing backup scripts, managing cron jobs, handling failover manually, and staying on top of security patches. For most product teams, that operational burden is not the product they're building. RDS trades higher cost for lower operational overhead. The trade-off shifts when you have very specific PostgreSQL configuration requirements or extreme cost constraints at scale.
What's the difference between Multi-AZ and a Read Replica?
Multi-AZ is for availability. It maintains a synchronous standby replica in a different AZ. If the primary fails, RDS automatically fails over to the standby (60–120 seconds). The standby cannot serve read traffic — it exists solely for failover. Your app needs no changes; the endpoint CNAME switches automatically.
Read Replica is for scalability. It is an asynchronous copy of the primary. Your application must explicitly connect to the replica's separate endpoint for read queries. Replication lag means replicas may not have the most recent writes. Read Replicas can be promoted to standalone databases if needed.
You can (and often do) have both enabled at the same time on the same instance.
Read Replica is for scalability. It is an asynchronous copy of the primary. Your application must explicitly connect to the replica's separate endpoint for read queries. Replication lag means replicas may not have the most recent writes. Read Replicas can be promoted to standalone databases if needed.
You can (and often do) have both enabled at the same time on the same instance.
How do you connect Spring Boot to RDS securely?
Three layers: Network — RDS is in a private subnet with no public IP. The RDS Security Group allows port 5432 only from the application server's Security Group (not from an IP address range). Credentials — database username and password are stored in AWS Secrets Manager or SSM Parameter Store, not in application.properties or source code. The EC2 IAM Role grants permission to retrieve the secret at startup. Transport — enable SSL/TLS on the JDBC connection for encryption in transit (
ssl=true&sslmode=require in the JDBC URL). RDS provides a CA certificate for verification.What is your backup and disaster recovery strategy for RDS?
Three components: Automated backups with a 7-day retention window provide point-in-time recovery to any second within that window — covers data corruption and accidental deletion. Manual snapshots taken before schema migrations or major releases persist indefinitely and can be used to create a new instance if a migration goes wrong. Multi-AZ handles the infrastructure failure case — if the primary AZ goes down, failover happens automatically without restoring from backup.
For critical systems, also consider cross-region read replicas which can be promoted during a regional outage, giving a lower RTO than restoring from a cross-region snapshot copy.
For critical systems, also consider cross-region read replicas which can be promoted during a regional outage, giving a lower RTO than restoring from a cross-region snapshot copy.
My Notes
Saved to browser storage automatically as you type.
4
S3 Storage
0%
Topics
S3 Fundamentals
Buckets and Objects
What S3 is object storage — not a file system. You store objects (any file up to 5 TB) inside buckets. A bucket is a top-level container. Objects are identified by a key (a string like
uploads/2024/report.pdf) — there are no real folders, just naming conventions. S3 is designed for 99.999999999% (11 nines) durability by redundantly storing data across multiple AZs.Why S3 is the standard for file/blob storage in AWS. Infinitely scalable, no capacity planning, pay only for what you store.
Gotcha
- Bucket names are globally unique across all AWS accounts worldwide. If someone else already has
my-app-uploads, you cannot use it. Use a prefix like your company name or account ID. - Bucket names must be 3–63 characters, lowercase, no underscores.
- Buckets are created in a specific region. Data stays in that region unless you explicitly replicate it.
Versioning
What When versioning is enabled on a bucket, every upload creates a new version of the object instead of overwriting the previous one. Each version has a unique version ID. Deleted objects get a delete marker — the object is not actually gone and can be recovered by removing the marker.
Why Protection against accidental overwrites and deletions. Essential for any bucket that stores important data. Also required for S3 replication.
Gotcha
- Once enabled, versioning cannot be fully disabled — only suspended. Suspended means new uploads no longer create versions, but existing versions are preserved.
- Old versions accumulate storage costs. Pair versioning with a lifecycle rule to expire non-current versions after N days.
Storage Classes
What S3 offers multiple storage tiers with different cost/access-speed trade-offs:
- Standard (~$0.023/GB): frequent access, lowest latency. Default.
- Standard-IA (Infrequent Access, ~$0.0125/GB + retrieval fee): for files accessed less than once a month.
- Glacier Instant Retrieval (~$0.004/GB): archival, millisecond retrieval.
- Glacier Flexible Retrieval (~$0.0036/GB): archival, minutes to hours retrieval.
- Glacier Deep Archive (~$0.00099/GB): cheapest, 12-hour retrieval. For compliance archives.
- Intelligent-Tiering: automatically moves objects between tiers based on access patterns. Monitoring fee per object.
Why Storage classes let you cut costs significantly for data that is not accessed frequently. A lifecycle policy can automate transitions.
Gotcha
- IA classes have a minimum storage duration (30 days for Standard-IA, 90 days for Glacier). Objects deleted before the minimum are still billed for the full duration.
Lifecycle Policies
What Rules that automatically transition objects between storage classes or delete them after a set number of days. Example: transition to Standard-IA after 30 days → Glacier after 90 days → delete after 365 days. Applied at the bucket or prefix level.
Why Lifecycle policies are the hands-off way to manage storage costs. Without them, old objects accumulate in Standard storage indefinitely.
Presigned URLs
What A presigned URL is a time-limited URL generated server-side that grants temporary access to a private S3 object without making the bucket or object public. The URL contains an embedded signature with an expiry. Anyone with the URL can GET (download) or PUT (upload) the object until expiry — no AWS credentials needed.
Why Standard pattern for user file uploads and downloads: your backend generates the presigned URL and hands it to the client. The client transfers directly to/from S3 — your backend never touches the file bytes, saving bandwidth and compute.
Gotcha
- Presigned URLs inherit the permissions of the IAM identity that generated them. If your EC2 IAM Role has
s3:GetObjecton the bucket, it can generate presigned GET URLs. Without the permission, the URL will fail. - Maximum expiry: 7 days (604800 seconds) for IAM user or role credentials. For temporary session credentials (e.g., from STS), the URL expires no later than when the session expires — even if you set a longer duration.
Access Control — Block Public Access
What S3 has four "Block Public Access" settings at the account and bucket level. Enabling all four prevents any object in the bucket from being made publicly readable — regardless of bucket policy or object ACLs. This is the default for new buckets as of 2023.
Why Public S3 buckets have caused high-profile data breaches. "Block Public Access" is a safety net. Enable it on every bucket that is not intentionally serving public content (e.g., a static website).
Gotcha
- Bucket policies still control which IAM identities (your EC2 role, Lambda function) can access objects — Block Public Access only prevents public (unauthenticated) access.
- Never use bucket ACLs — they are a legacy mechanism. Use bucket policies and IAM policies instead.
S3 Event Notifications & Advanced Patterns
S3 Event Notifications
What S3 can publish events (ObjectCreated, ObjectDeleted, ObjectRestore) to Lambda, SQS, or SNS when objects are uploaded or deleted. This is the foundation of file-processing pipelines: user uploads a CSV → S3 → Lambda → parse and load into RDS. Configure in bucket Properties → Event Notifications.
Why Your backend API never needs to poll for new files. The upload triggers processing automatically and asynchronously — the client's upload completes immediately, and the processing happens in the background. This pattern scales to millions of files without changing the application.
Gotcha
- S3 Event Notifications are at-least-once — a Lambda or SQS consumer may receive the same event twice on rare occasions. Make your processing idempotent (check if the object was already processed before doing work).
- Use an SQS queue between S3 and Lambda rather than invoking Lambda directly — SQS buffers events during Lambda throttling and provides a DLQ for failed processing.
Multipart Upload
What S3 requires multipart upload for objects larger than 5 GB, and recommends it for objects larger than 100 MB. The file is split into parts (minimum 5 MB each) that are uploaded in parallel. Failed parts are retried individually without restarting the entire upload. AWS SDK v2's
S3TransferManager handles multipart upload automatically.S3TransferManager
S3TransferManager manager = S3TransferManager.create();
FileUpload upload = manager.uploadFile(b -> b
.putObjectRequest(r -> r.bucket("my-bucket").key("large-file.csv"))
.source(Path.of("/tmp/large-file.csv")));
upload.completionFuture().join();
Gotcha
- Incomplete multipart uploads accumulate storage charges for the uploaded parts. Add a lifecycle rule: "Abort incomplete multipart uploads after 7 days."
Spring Boot Integration (AWS SDK v2)
Using AWS SDK v2 for Java with S3
What AWS SDK v2 (
software.amazon.awssdk) is the current Java SDK. Use S3Client for synchronous operations or S3AsyncClient for async. The SDK automatically picks up credentials from the EC2 Instance Profile (IAM Role) — no hardcoded keys needed.pom.xml dependency
<dependency> <groupId>software.amazon.awssdk</groupId> <artifactId>s3</artifactId> <version>2.25.x</version> </dependency>Key operations:
s3Client.putObject(), s3Client.getObject(), s3Presigner.presignGetObject(). The SDK uses the default credential provider chain — on EC2 with an IAM Role, it reads temporary credentials from the instance metadata endpoint automatically.
Hands-on Tasks
Interview Q&A
Why store files in S3 instead of in the database?
Databases are optimised for structured data and queries — not binary blobs. Storing large files in a DB increases backup size, slows queries unrelated to those files, and doesn't scale economically. S3 is purpose-built for object storage: it costs a fraction of DB storage (~$0.023/GB vs ~$0.115/GB for RDS gp2), scales infinitely, and offers 11 nines of durability. Files stored in S3 can also be served directly to clients via presigned URLs, bypassing your application servers entirely and saving bandwidth and compute.
What are presigned URLs and when do you use them?
A presigned URL is a time-limited, signature-embedded URL that grants temporary access to a private S3 object. Your backend generates it using the AWS SDK (requires IAM permission on the bucket) and returns it to the client. The client then uploads or downloads directly from S3 using that URL — no AWS credentials needed, and your backend never handles the file bytes.
Use them for: user file uploads (presigned PUT URL → client uploads directly), file downloads (presigned GET URL → client downloads directly), and sharing private documents with external parties for a limited time. The key benefit is that large file transfers bypass your application servers completely.
Use them for: user file uploads (presigned PUT URL → client uploads directly), file downloads (presigned GET URL → client downloads directly), and sharing private documents with external parties for a limited time. The key benefit is that large file transfers bypass your application servers completely.
How do you prevent an S3 bucket from being accidentally made public?
Two layers: first, enable all four "Block Public Access" settings on the bucket (and ideally at the account level) — this acts as a guardrail that prevents any public bucket policies or public ACLs from taking effect, even if someone accidentally adds one. Second, use AWS Config or IAM SCPs (Service Control Policies) to enforce that Block Public Access stays enabled across all buckets in the account. Never grant
s3:PutBucketPolicy to application IAM roles — only administrators should be able to modify bucket policies.My Notes
Saved to browser storage automatically as you type.
5
Monitoring and Observability
0%
Topics
Application Metrics (Spring Boot → CloudWatch)
Micrometer + CloudWatch (Custom Application Metrics)
What Micrometer is the metrics façade built into Spring Boot Actuator. Add the
micrometer-registry-cloudwatch2 dependency and your application automatically pushes metrics (HTTP request rate, latency percentiles, JVM heap, DB connection pool size, your custom counters/timers) to CloudWatch under a namespace you define. This is how your application tells you what it's doing — not just what the VM is doing.Why CloudWatch without Micrometer only shows infrastructure metrics (CPU, disk). It cannot tell you: "95th percentile latency on the /checkout endpoint is 420ms" or "Hikari pool is at 90% capacity." These are the signals that matter for production incidents.
Setup
<!-- pom.xml --> <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-cloudwatch2</artifactId> </dependency> # application.properties management.cloudwatch.metrics.namespace=MyApp management.cloudwatch.metrics.step=1m management.endpoints.web.exposure.include=health,info,metrics,prometheusCustom timer:
@Timed("checkout.duration") on a service method, or registry.timer("payment.latency").record(duration). The EC2 IAM Role needs cloudwatch:PutMetricData permission.
Gotcha
- Micrometer pushes metrics in batches (default every 1 minute). CloudWatch charges per custom metric per month (~$0.30). With many fine-grained tags, costs can escalate — filter which metrics are exported.
- Use
management.metrics.tags.application=${spring.application.name}to tag all metrics with the service name — essential for multi-service dashboards.
AWS X-Ray: Distributed Tracing
What X-Ray records every request as a trace — a tree of segments (one per service) and subsegments (one per downstream call: RDS query, S3 call, Lambda invocation). You see exactly how long each part took, which service caused a slowdown, and the full call graph for any individual request.
Why CloudWatch Logs tells you something went wrong. X-Ray tells you where and why — across service boundaries. Essential once you have more than one service. You cannot answer "why was this checkout request slow?" without distributed tracing.
Spring Boot Setup
<!-- AWS X-Ray SDK --> <dependency> <groupId>com.amazonaws</groupId> <artifactId>aws-xray-recorder-sdk-spring</artifactId> </dependency> # Alternatively (modern approach): AWS Distro for OpenTelemetry # ADOT collector runs as a sidecar or agent, sends traces to X-Ray # Add opentelemetry-spring-boot-starter + configure OTEL_EXPORTER_OTLP_ENDPOINTIAM permission needed:
xray:PutTraceSegments, xray:PutTelemetryRecords. X-Ray console shows a Service Map — a live topology of your system.
Gotcha
- X-Ray uses sampling by default (5% of requests, or 1 req/sec minimum). Don't panic when you can't find a specific trace — it may not have been sampled. Increase the rate in the X-Ray sampling rules for debugging.
- OpenTelemetry (OTEL) with AWS Distro is the modern, vendor-neutral approach. X-Ray is AWS-specific. For teams planning multi-cloud or using Jaeger/Grafana Tempo, OTEL is the better investment.
Structured Logging (JSON) + Correlation IDs
What Instead of plain-text logs like
INFO Processing order 123, structured logging emits JSON: {"level":"INFO","orderId":"123","traceId":"abc-def","duration_ms":45}. CloudWatch Logs Insights can then query individual fields. A correlation ID (or trace ID) is a unique identifier generated per request and attached to every log line via MDC (Mapped Diagnostic Context) — allowing you to filter all log lines for one specific request across all service instances.Spring Boot JSON logging
# application.properties (Spring Boot 3.4+)
logging.structured.format.console=ecs # or 'logstash'
# For older versions, use logstash-logback-encoder:
# logback-spring.xml with <encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
# Set correlation ID in a filter:
MDC.put("traceId", UUID.randomUUID().toString());
# X-Ray or OTEL auto-injects trace IDs into MDC
CloudWatch
CloudWatch Metrics
What Time-series data points representing the health of your AWS resources. Metrics have a namespace (e.g.,
AWS/EC2), a name (e.g., CPUUtilization), and dimensions (e.g., InstanceId=i-xxx). Data is kept at 1-second to 1-day resolution depending on retention settings.Why Metrics are the foundation of all monitoring. Without them you are flying blind — you cannot know if an instance is struggling until users report problems.
What EC2 sends by default (free, 5-min intervals)
CPUUtilization,NetworkIn,NetworkOut,DiskReadOps,DiskWriteOps- NOT included by default: memory usage, disk space used. These require the CloudWatch Agent.
- Detailed monitoring (1-minute intervals): costs extra. Enable per instance in the console.
CloudWatch Agent
What A software agent you install on EC2 that collects metrics and logs beyond what AWS sends by default. Collects: memory usage (
mem_used_percent), disk space (disk_used_percent), and any log files you point it at (e.g., your Spring Boot log file).Why Memory and disk space are the two most common causes of production outages. Without the agent, CloudWatch has no visibility into either.
Setup on Amazon Linux 2023
sudo dnf install amazon-cloudwatch-agent -y # Use the wizard to generate config: sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard # Start the agent: sudo systemctl enable amazon-cloudwatch-agent sudo systemctl start amazon-cloudwatch-agentThe EC2 IAM Role must have the
CloudWatchAgentServerPolicy managed policy attached.
CloudWatch Logs
What Centralised log storage and search. Logs are organised into Log Groups (one per application or service) and Log Streams (one per instance or task). The CloudWatch Agent ships log files from EC2. Lambda, ECS, and EKS can push logs automatically.
Why Without centralised logging, you must SSH into each instance to read logs — impossible when you have multiple instances or when an instance has crashed.
Gotcha
- Log retention defaults to Never Expire. This accumulates storage costs indefinitely. Always set a retention policy (e.g., 30 or 90 days) on every log group.
- CloudWatch Logs Insights: SQL-like query language for searching and analysing logs. Example:
filter @message like /ERROR/ | stats count(*) by bin(5m)
CloudWatch Alarms
What An alarm watches a single metric over a time window and changes state when the metric crosses a threshold. Three states: OK, ALARM, INSUFFICIENT_DATA. When an alarm enters ALARM state, it can trigger: an SNS notification (→ email, Slack, PagerDuty), an EC2 action (stop, reboot), or an Auto Scaling action.
Why Alarms turn metrics into actionable notifications. Without alarms you would have to watch dashboards constantly — impractical at scale.
Gotcha
- Set the evaluation period and datapoints carefully. CPU > 80% for 1 out of 1 datapoints at 1-minute intervals will fire on transient spikes. 2 out of 3 datapoints reduces false positives.
- An alarm in
INSUFFICIENT_DATAstate means CloudWatch is not receiving metric data — which itself can indicate a problem (agent stopped, instance down).
CloudWatch Dashboards
What Custom visualisations of metrics and alarm states on a single screen. Add widgets for line graphs, stacked area charts, numbers, and alarm status. Dashboards are shared across the team and are the first thing oncall checks during an incident.
Why At-a-glance system health. One dashboard should show the key health signals for your entire application stack.
Gotcha
- Dashboards cost $3/month per dashboard (first 3 are free). Not a budget concern at learning scale.
Hands-on Tasks
Interview Q&A
What metrics does EC2 send to CloudWatch by default, and what requires the CloudWatch Agent?
By default (free, 5-minute intervals), EC2 sends:
Not included by default: memory usage and disk space utilisation. These require the CloudWatch Agent installed on the instance. Memory and disk are the two metrics most responsible for production incidents, so installing the agent is a baseline operational requirement — not optional.
CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps, and DiskReadBytes/DiskWriteBytes.Not included by default: memory usage and disk space utilisation. These require the CloudWatch Agent installed on the instance. Memory and disk are the two metrics most responsible for production incidents, so installing the agent is a baseline operational requirement — not optional.
How do you monitor a Spring Boot application on EC2?
Three layers: Infrastructure metrics via CloudWatch Agent (CPU, memory, disk — the instance health). Application logs via CloudWatch Agent shipping the Spring Boot log file to CloudWatch Logs — search with Logs Insights for errors and latency patterns. Application metrics via Spring Boot Actuator + Micrometer: expose custom metrics (request count, latency, DB pool size) and push them to CloudWatch using the Micrometer CloudWatch registry. Then build a dashboard combining all three and set alarms on the signals that indicate real user impact (high error rate, high latency) rather than just infrastructure metrics.
How would you investigate a production incident using CloudWatch?
Start with the time the alarm fired. Open the CloudWatch Dashboard and look for correlated metric spikes at that time — CPU, memory, network, error count. Then go to CloudWatch Logs Insights and query the application log group for ERROR-level messages in that time window. Check if a deployment happened around the same time (correlate with your CI/CD pipeline). Look at RDS metrics (CPU, database connections, read/write latency) if the issue could be DB-related. The goal is to narrow from "something went wrong" to "this specific component hit this specific limit at this time." Document findings and consider adding more targeted alarms for the root cause so the same issue triggers an alert faster next time.
My Notes
Saved to browser storage automatically as you type.
6
Containers and ECS
0%
Topics
Docker
Images and Containers
What A Docker image is a read-only, layered template built from a Dockerfile. It packages your OS base, runtime, and application into a single artefact. A container is a running instance of an image — an isolated process with its own filesystem, network, and process space. Images are immutable; you don't patch them, you build new ones.
Why Containers eliminate "it works on my machine" — the same image runs identically in dev, staging, and production. They start in seconds (unlike VMs), use less RAM, and pack many containers onto one host.
Gotcha
- Container filesystem is ephemeral. Anything written inside the container is lost when the container stops. Use volumes or external storage (S3, RDS) for persistence.
- Each layer in a Docker image is cached. Put infrequently-changing layers (base image, dependencies) early in the Dockerfile, and your app code last — this speeds up rebuilds significantly.
Dockerfile for Spring Boot
What A text file with instructions to build a Docker image. For Spring Boot, the recommended approach is a multi-stage build or a simple single-stage build using an official JRE base image.
Recommended Dockerfile
FROM eclipse-temurin:17-jre-alpine WORKDIR /app COPY target/app.jar app.jar ENTRYPOINT ["java", "-jar", "app.jar"]
eclipse-temurin:17-jre-alpine is the Eclipse Foundation's OpenJDK build on Alpine Linux — small (~175 MB) and production-safe. Build: docker build -t my-app . Run: docker run -p 8080:8080 my-app
Gotcha
- Use a JRE image (not JDK) in production containers — JDK includes the compiler which you don't need at runtime and adds unnecessary image size.
- Spring Boot's layered JAR feature (
spring-boot:build-info) creates separate Docker layers for dependencies and app code, making rebuilds faster. Worth exploring after basics are solid.
AWS Container Services
ECR (Elastic Container Registry)
What AWS's private Docker image registry. Like Docker Hub but private, integrated with IAM, and in your AWS account. ECS and EKS pull images from ECR automatically using the task/pod's IAM Role — no registry credentials to manage.
Why You need a place to store your Docker images that ECS can pull from. Public Docker Hub images should not be used in production (rate limits, supply chain risk).
Auth + push commands
# Authenticate Docker to ECR aws ecr get-login-password --region us-east-1 | \ docker login --username AWS --password-stdin \ 123456789.dkr.ecr.us-east-1.amazonaws.com # Tag and push docker tag my-app:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
ECS (Elastic Container Service)
What AWS's container orchestration service. You define what to run (Task Definition) and how many copies to keep running (Service). ECS handles placement, health checks, rolling deployments, and integration with load balancers. No Kubernetes knowledge required.
Why On raw EC2, you manually manage deployment scripts, restart logic, and load balancer registration. ECS handles all of that. It is significantly simpler than EKS for teams that do not need Kubernetes portability.
Gotcha
- There is no SSH into Fargate containers. Debugging uses CloudWatch Logs (stdout/stderr from the container) and ECS Exec (an optional feature that provides a shell into a running task).
- ECS tasks are not persistent. A stopped task is replaced by a new one. Any state must be in external storage (RDS, S3, ElastiCache).
Task Definitions
What The blueprint for your container in ECS. Defines: Docker image URI (ECR), CPU and memory allocation, port mappings, environment variables, log configuration, and the IAM Task Role (permissions the container has). Task Definitions are versioned — every change creates a new revision.
CPU and memory units
- Fargate CPU units: 256 (0.25 vCPU), 512 (0.5 vCPU), 1024 (1 vCPU), up to 16384 (16 vCPU)
- Memory: 512 MB minimum, must be compatible with chosen CPU. E.g., 512 CPU → 1–2 GB memory.
- Spring Boot typically needs at least 512 MB; 1 GB is comfortable for a small service.
CI/CD Pipeline: GitHub Actions → ECR → ECS
What In production, engineers never manually run
docker build and docker push. A CI/CD pipeline (GitHub Actions, Jenkins, or CodePipeline) triggers automatically on Git commits, builds and pushes the image to ECR, and updates the ECS Service to deploy the new image.GitHub Actions workflow (.github/workflows/deploy.yml)
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::ACCOUNT:role/github-deploy-role
aws-region: us-east-1
- name: Login to ECR
id: ecr-login
uses: aws-actions/amazon-ecr-login@v2
- name: Build and push
run: |
IMAGE_TAG=${{ github.sha }}
docker build -t $ECR_URI:$IMAGE_TAG .
docker push $ECR_URI:$IMAGE_TAG
- name: Deploy to ECS
run: |
aws ecs update-service --cluster my-cluster \
--service my-service --force-new-deployment
Use GitHub's OIDC provider with an IAM role (no long-lived access keys in GitHub secrets). The role-to-assume needs ecr:* and ecs:UpdateService permissions.
Gotcha
- Enable ECR image scanning on push:
aws ecr put-image-scanning-configuration --repository-name my-app --image-scanning-configuration scanOnPush=true. It flags known CVEs in your base image layers. Block deployments on CRITICAL findings using the scan results in your pipeline.
ECS Fargate vs EC2 Launch Type
What Fargate: serverless compute for containers. AWS provisions and manages the underlying EC2 instances. You pay per vCPU-second and GB-second of memory while the task runs. No instances to manage, patch, or right-size. EC2 launch type: you manage a cluster of EC2 instances. ECS places containers on them. More control, potentially cheaper at high sustained utilisation, but more operational overhead.
Why Fargate removes EC2 management entirely. For most Spring Boot microservice deployments, Fargate is the right starting point.
Gotcha
- Fargate has a slightly slower cold start than EC2 launch type (~10–30 seconds to start a new task). Not usually a problem for long-running services, but relevant for burst scaling.
- Fargate does not support GPU workloads or privileged containers.
Hands-on Tasks
Interview Q&A
Why use containers instead of deploying a JAR directly on EC2?
Containers package the application with its runtime and all dependencies into a single immutable artefact. This eliminates environment drift — the same container runs identically in dev, CI, staging, and production. They enable faster deployments (push a new image, ECS rolls it out), easier rollbacks (redeploy the previous image tag), and consistent behaviour regardless of what else is installed on the host. Containers also allow multiple services to run on shared infrastructure with isolation, making better use of resources compared to one-app-per-EC2.
What's the difference between ECS Fargate and the EC2 launch type?
Fargate: serverless. AWS provisions the underlying compute invisibly. You pay per vCPU-second and GB-second of memory while tasks run. No instances to manage, patch, or monitor. Simpler operationally, slightly more expensive at high sustained load, and slightly slower to scale (task startup takes ~10–30 seconds).
EC2 launch type: you manage a cluster of EC2 instances that ECS places containers on. More operational overhead (patching, right-sizing, capacity management), but more control and potentially cheaper at consistently high utilisation. Also required for GPU workloads or containers that need privileged mode.
Starting point for most teams: Fargate. Move to EC2 launch type when you have a concrete cost or capability reason.
EC2 launch type: you manage a cluster of EC2 instances that ECS places containers on. More operational overhead (patching, right-sizing, capacity management), but more control and potentially cheaper at consistently high utilisation. Also required for GPU workloads or containers that need privileged mode.
Starting point for most teams: Fargate. Move to EC2 launch type when you have a concrete cost or capability reason.
How does ECS handle a rolling deployment?
When you update an ECS Service (e.g., new Task Definition revision), ECS starts new tasks with the updated version while keeping old tasks running. The deployment is controlled by two parameters:
minimumHealthyPercent (e.g., 100 — never go below 100% of desired count) and maximumPercent (e.g., 200 — allow up to double the tasks temporarily). ECS waits for new tasks to pass health checks before stopping old ones. If new tasks fail health checks, the deployment stops and old tasks remain running. If you have a load balancer attached, traffic drains from old tasks before they are stopped.My Notes
Saved to browser storage automatically as you type.
7
Kubernetes with EKS
0%
Topics
Kubernetes Core Objects
Pod
What The smallest deployable unit in Kubernetes. A pod wraps one or more containers that share the same network namespace (same IP, same localhost) and storage volumes. In practice, most pods contain exactly one container. Pods are ephemeral — they are created and destroyed constantly.
Why You never create pods directly in production. You define a Deployment and Kubernetes manages the pods for you, ensuring the desired number are always running.
Gotcha
- Pods have dynamic IPs. Never hardcode a pod's IP — use a Service to get a stable endpoint.
- When a pod crashes, Kubernetes restarts it (according to the
restartPolicy). It does not move to a new IP or name unless it is rescheduled to a different node.
Deployment
What A Deployment declares the desired state: "run 3 replicas of this container image." Kubernetes continuously reconciles actual state to desired state. A Deployment manages a ReplicaSet, which manages the pods. Rolling updates and rollbacks are built in.
Production-ready Deployment manifest (Spring Boot)
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 2
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
serviceAccountName: my-app-sa # for IRSA (AWS permissions)
containers:
- name: my-app
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
ports:
- containerPort: 8080
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
# readinessProbe: gate traffic until app is ready
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 20
periodSeconds: 5
# livenessProbe: restart if app is deadlocked
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 40
periodSeconds: 10
Why probes matter: Without readinessProbe, Kubernetes sends traffic to pods still initialising their Spring context (5–10 seconds) — causing 500s during rolling deploys. Without livenessProbe, a deadlocked app stays in service indefinitely. Enable both Actuator endpoints: management.health.probes.enabled=true in application.properties.
Service
What A stable network endpoint that load-balances traffic to a set of pods matching a label selector. Three main types: ClusterIP (default, internal-only, for service-to-service traffic), NodePort (exposes on each node's IP at a static port, mostly for testing), LoadBalancer (provisions an AWS Network Load Balancer automatically, for internet-facing exposure).
Why Services decouple callers from pods. Even as pods are replaced during deployments, the Service endpoint stays stable.
Gotcha
- A
LoadBalancerService in EKS creates an AWS NLB, which costs money. For HTTP routing to multiple services, use an Ingress with the AWS Load Balancer Controller instead — it creates a single ALB shared across services.
ConfigMap and Secret
What ConfigMap: stores non-sensitive configuration (e.g., database URL, feature flags) as key-value pairs. Injected into pods as environment variables or mounted as files. Secret: same structure but for sensitive data (passwords, API keys). Base64-encoded (not encrypted) by default in etcd.
Gotcha
- Kubernetes Secrets are not encrypted at rest by default in etcd. For production, integrate with AWS Secrets Manager using the External Secrets Operator, or use the AWS Secrets and Configuration Provider (ASCP) with the Secrets Store CSI Driver.
- Base64 is encoding, not encryption. Anyone with kubectl access can decode a Secret trivially:
kubectl get secret my-secret -o jsonpath='{.data.password}' | base64 -d
Ingress
What HTTP/HTTPS routing rules that direct external traffic to different Services based on host or path. Example:
api.example.com/users → user-service, api.example.com/orders → order-service. Requires an Ingress Controller to implement the rules. In EKS, the AWS Load Balancer Controller creates an ALB from Ingress resources.Why Instead of one LoadBalancer per service (one NLB each = expensive), a single ALB can route to all services based on path — much more cost-efficient.
Helm & Autoscaling
Helm: Kubernetes Package Manager
What Helm is the package manager for Kubernetes. A chart is a packaged collection of YAML manifests with templating (
{{ .Values.image.tag }}). Instead of maintaining separate YAML files per environment, you define one chart and override values per environment. The Helm Registry hosts community charts for common infrastructure (nginx-ingress, cert-manager, prometheus-stack, aws-load-balancer-controller).Why Teams don't write raw Kubernetes YAML in production — they use Helm. Installing the AWS Load Balancer Controller (required for Ingress) is a Helm chart install. Most enterprise deployments use Helm for parameterisation and rollback history.
Key commands
helm install my-app ./my-chart --values values-prod.yaml helm upgrade my-app ./my-chart --set image.tag=v1.2.3 helm rollback my-app 1 # rollback to revision 1 helm list # list installed releases
HPA (Horizontal Pod Autoscaler)
What HPA automatically adjusts the number of pods in a Deployment based on observed metrics (CPU utilisation, memory, or custom metrics from Prometheus). You define a minimum replica count, maximum, and a target metric value. The HPA controller checks metrics every 15 seconds and scales up or down accordingly.
Minimal HPA manifest
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Node-level autoscaling (adding/removing EC2 worker nodes) is handled by Karpenter — the AWS-recommended node autoscaler that provisions nodes in response to unschedulable pods.
EKS (Elastic Kubernetes Service)
EKS Cluster and Node Groups
What EKS is a managed Kubernetes control plane. AWS runs and maintains etcd, the API server, controller manager, and scheduler. You manage (or let AWS manage) the worker nodes via Managed Node Groups — EC2 instances that EKS provisions, registers, and drains during updates automatically.
Cost warning
- EKS control plane: $0.10/hour (~$72/month) per cluster. Plus the EC2 cost of worker nodes. Delete the cluster when not actively learning — this is the biggest cost trap in this roadmap.
- Alternatively, use EKS with Fargate profiles to avoid managing EC2 worker nodes entirely. Pay per pod CPU/memory instead.
IRSA (IAM Roles for Service Accounts)
What In EKS, pods cannot use the node's EC2 Instance Profile — every pod on the node would share the same AWS permissions, completely breaking least privilege. IRSA is the solution: you annotate a Kubernetes Service Account with an IAM Role ARN. Only pods that reference that Service Account receive temporary credentials for that specific role. Credentials are injected as environment variables and rotated automatically — no code changes needed, the AWS SDK picks them up via the standard credential provider chain.
Why This is the mandatory production pattern for giving AWS permissions to pods. Without IRSA, teams fall back to embedding Access Keys in Kubernetes Secrets or on ConfigMaps — a serious security violation. IRSA is a day-1 requirement on any real EKS deployment and a common interview question.
Setup with eksctl
# Step 1: Associate OIDC provider with the cluster (once per cluster) eksctl utils associate-iam-oidc-provider \ --cluster my-cluster --region us-east-1 --approve # Step 2: Create an IAM service account in your cluster namespace eksctl create iamserviceaccount \ --cluster my-cluster \ --namespace default \ --name my-app-sa \ --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \ --approve # Step 3: Reference it in your Deployment manifest # spec.template.spec.serviceAccountName: my-app-sa
Gotcha
- The OIDC provider must be associated with the cluster before creating IRSA service accounts. Without this, the IAM role trust policy cannot validate the pod's identity token.
- The AWS SDK inside the pod automatically uses IRSA credentials. On EC2, it uses the Instance Profile. Both flow through the same credential provider chain — your Spring Boot code works identically in both environments without any changes.
- When a second team member tries to
kubectlinto your cluster and gets "Unauthorized", they need to be added to theaws-authConfigMap:kubectl edit configmap aws-auth -n kube-systemand add their IAM user ARN.
eksctl and kubectl
What eksctl: CLI tool by Weaveworks (officially supported by AWS) for creating and managing EKS clusters. Abstracts the complex CloudFormation stacks that EKS requires. kubectl: the standard Kubernetes CLI for all cluster operations — deploying, scaling, inspecting, and debugging.
Key commands
# Create cluster (takes ~15-20 minutes) eksctl create cluster --name my-cluster --region us-east-1 \ --nodegroup-name workers --node-type t3.small --nodes 2 kubectl get nodes # verify workers are Ready kubectl get pods -A # all pods in all namespaces kubectl apply -f app.yaml # deploy from manifest kubectl get svc # list services and their external IPs kubectl logs pod-name -f # tail pod logs kubectl rollout undo deployment/my-app # rollback
Hands-on Tasks
Interview Q&A
When would you choose ECS over EKS, and vice versa?
Choose ECS when: your team is AWS-focused and doesn't need multi-cloud portability, you want simpler operations with less overhead, you're running a small number of services, or Kubernetes' learning curve isn't justified by your team's size and complexity.
Choose EKS when: you need Kubernetes-native tooling (Helm charts, Argo CD, Istio, Karpenter), your organisation is standardising on Kubernetes across clouds, you need fine-grained scheduling or custom operators, or you're migrating from an on-premises Kubernetes cluster.
EKS adds operational complexity (control plane cost, version upgrades, node group management) and a steeper learning curve. ECS is simpler and tightly integrated with AWS services but less portable. For a new Spring Boot project with a small team: ECS Fargate is often the pragmatic choice.
Choose EKS when: you need Kubernetes-native tooling (Helm charts, Argo CD, Istio, Karpenter), your organisation is standardising on Kubernetes across clouds, you need fine-grained scheduling or custom operators, or you're migrating from an on-premises Kubernetes cluster.
EKS adds operational complexity (control plane cost, version upgrades, node group management) and a steeper learning curve. ECS is simpler and tightly integrated with AWS services but less portable. For a new Spring Boot project with a small team: ECS Fargate is often the pragmatic choice.
What happens when a Kubernetes pod crashes?
Kubernetes restarts the container within the pod according to the pod's
The Deployment controller ensures the desired number of healthy pods always runs. If a pod on a node crashes repeatedly, the scheduler may reschedule it on a different node. Liveness probes detect unhealthy pods (e.g., deadlocked app) and kill them for restart. Readiness probes prevent traffic from reaching pods that are not yet ready to serve requests.
restartPolicy (default: Always for Deployments). After repeated failures, Kubernetes enters CrashLoopBackOff — it restarts the container with exponential backoff (10s, 20s, 40s... up to 5 minutes between restarts) to avoid thrashing.The Deployment controller ensures the desired number of healthy pods always runs. If a pod on a node crashes repeatedly, the scheduler may reschedule it on a different node. Liveness probes detect unhealthy pods (e.g., deadlocked app) and kill them for restart. Readiness probes prevent traffic from reaching pods that are not yet ready to serve requests.
How does a Kubernetes rolling update work?
When you update a Deployment (e.g., change the image tag), Kubernetes creates a new ReplicaSet with the new version. It then scales up the new ReplicaSet and scales down the old one gradually, controlled by
New pods must pass readiness probes before old pods are terminated — ensuring no traffic goes to pods that aren't ready. If the new pods fail readiness probes, the rollout pauses and old pods remain running. Rollback is instant:
maxSurge (how many extra pods can exist during rollout, default 25%) and maxUnavailable (how many pods can be unavailable during rollout, default 25%).New pods must pass readiness probes before old pods are terminated — ensuring no traffic goes to pods that aren't ready. If the new pods fail readiness probes, the rollout pauses and old pods remain running. Rollback is instant:
kubectl rollout undo deployment/my-app switches back to the previous ReplicaSet without rebuilding anything.My Notes
Saved to browser storage automatically as you type.
8
Serverless
0%
Topics
AWS Lambda
Lambda Function Lifecycle
What Lambda runs your code in response to events (HTTP requests, S3 uploads, DynamoDB streams, SQS messages, etc.) without you provisioning or managing servers. You upload code (ZIP or container image), configure a trigger, set memory (128 MB–10 GB), and pay per invocation + duration (GB-seconds). Maximum execution time: 15 minutes.
Why No servers to patch or scale. Lambda auto-scales from 0 to thousands of concurrent executions. Cost is zero when idle — you pay only for actual compute used, down to the millisecond.
Gotcha
- Default concurrent execution limit: 1000 per account per region (soft limit, can be increased). One Lambda function hitting the limit can throttle all other functions in the account. Use reserved concurrency to isolate critical functions.
- Lambda is not suitable for long-running processes (batch jobs, video encoding) or anything requiring persistent local state.
Writing a Java Lambda Handler (plain Java, no Spring)
What The handler is the entry point Lambda invokes. Implement
RequestHandler<Input, Output> from the aws-lambda-java-core library. Input/output types are automatically serialised from/to JSON by the Lambda runtime. Keep the handler class small — initialise expensive objects (SDK clients, DB connections) as instance variables so they survive across warm invocations on the same container.Minimal working handler (Maven dependency:
aws-lambda-java-core + aws-lambda-java-events)
public class UserHandler
implements RequestHandler<APIGatewayProxyRequestEvent,
APIGatewayProxyResponseEvent> {
// Initialised ONCE per container — reused on warm starts
private final DynamoDbClient dynamo = DynamoDbClient.create();
@Override
public APIGatewayProxyResponseEvent handleRequest(
APIGatewayProxyRequestEvent event, Context ctx) {
String userId = event.getPathParameters().get("id");
// query dynamo, build response...
return new APIGatewayProxyResponseEvent()
.withStatusCode(200)
.withBody("{\"userId\":\"" + userId + "\"}");
}
}
Handler config in Lambda console: com.example.UserHandler::handleRequest
Gotcha
- Do not use Spring Boot as a Lambda runtime. Spring's application context initialises in 3–8 seconds — that's a multi-second cold start on every scale-out event. Use plain Java for Lambda. If your team requires Spring, use Spring Cloud Function + SnapStart, which snapshots the initialised context to cut cold starts to ~1 second.
- Static initialisers (
static { ... }) run during the init phase — counted as cold start time. Move heavyweight setup into instance variables (initialised once) or lazy-init them on first invocation.
Cold Starts and SnapStart
What A cold start happens when Lambda must provision a new container to handle an invocation — no warm container is available. The sequence: provision container → download code → initialise runtime → run static initialiser code → run handler. For Java, this takes 2–10 seconds. A warm start reuses an existing container and only runs the handler (~10–100ms).
Why Cold starts add latency for the first request and after periods of inactivity. Critical for customer-facing APIs.
Java-specific mitigations
- Lambda SnapStart (available for Java 11+ managed runtimes): AWS takes a snapshot of the initialised execution environment after the init phase, then restores from it on cold starts. Reduces Java cold starts from seconds to ~1 second. Enable in the function configuration → "Edit" → SnapStart.
- Provisioned Concurrency: keeps N containers pre-warmed and ready. Eliminates cold starts completely but you pay for the reserved capacity even when idle. Use for latency-sensitive APIs.
- Avoid heavy Spring Boot in Lambda — Spring's context initialisation is slow. Use plain Java, Quarkus (with native compilation), or Micronaut instead. If you need Spring, use Spring Cloud Function with GraalVM native image.
Memory and Performance
What Lambda CPU allocation scales proportionally with memory. A function at 1024 MB gets approximately 2x the CPU of a function at 512 MB. This means allocating more memory can make a function faster and potentially cheaper overall (faster execution = fewer GB-seconds billed).
Gotcha
- The right memory setting is not always the minimum. Use AWS Lambda Power Tuning (an open source Step Functions state machine) to find the optimal memory-to-cost ratio for your function.
- Ephemeral storage (
/tmp): 512 MB by default, configurable up to 10 GB. Files in /tmp persist between warm invocations on the same container — useful for caching, but don't assume it's always available.
Event Sources for Lambda
SQS, SNS, Kinesis, and DynamoDB Streams as Lambda Triggers
What Lambda is not just an HTTP backend — event-driven processing is its primary use case in enterprise systems. Four critical triggers:
- SQS: Lambda polls the queue and invokes your handler with a batch of messages. Built-in retry (message returns to queue on failure), DLQ support, batch window and batch size configuration.
- SNS: Fan-out pattern — SNS topic → multiple SQS queues → each with its own Lambda. Decouples producer from many consumers.
- Kinesis Data Streams: Lambda processes records in order within a shard. Use for real-time streaming (log processing, IoT data, event sourcing). Each shard gets one concurrent Lambda invocation.
- DynamoDB Streams: Every write to a DynamoDB table appears as an event. Lambda can react to inserts/updates/deletes for change data capture (CDC), cross-table sync, or event sourcing.
Why This is how senior engineers actually use Lambda at Walmart, Fifth Third Bank, and Boscov's — not primarily as HTTP handlers. Decoupled event-driven processing via SQS + Lambda is more resilient than synchronous chains: failures are retried automatically, messages don't block the producer, and processing can scale independently.
Gotcha
- Idempotency is mandatory for SQS and Kinesis consumers. Lambda may invoke your handler more than once for the same message (at-least-once delivery). Your handler must produce the same result on re-processing — use a DynamoDB conditional write or idempotency key table to detect duplicates.
- SQS batch failures: If one message in a batch fails, the entire batch is retried by default (returning all messages to the queue — including the ones that succeeded). Use
ReportBatchItemFailuresin your Lambda response to return only the failed message IDs for retry. - A DLQ (Dead Letter Queue) captures messages that fail after all retries. Always configure one — without it, poisoned messages cycle forever and block the queue.
API Gateway
HTTP API vs REST API
What API Gateway offers two main API types for Lambda integrations: HTTP API (newer, simpler, ~70% cheaper, lower latency) and REST API (more features, more expensive). Both create managed HTTP endpoints backed by Lambda.
Feature comparison
- HTTP API: JWT authorisers, Lambda proxy integration, CORS, lower cost (~$1/million requests). Missing: caching, API keys/usage plans, per-method throttling, AWS WAF integration.
- REST API: all HTTP API features plus response caching, usage plans, API keys, resource policies, AWS WAF, mock integrations, request/response transformations. Costs ~$3.50/million requests.
- Default recommendation: use HTTP API unless you need a specific REST API feature.
Throttling and Security
What API Gateway throttles requests to protect your backend. Account-level default: 10,000 requests/second burst, 5,000 steady-state. Returns HTTP 429 (Too Many Requests) when exceeded. For REST API, you can set per-route throttle limits via usage plans.
Gotcha
- The API Gateway endpoint is public by default. Anyone can call it. Add an authoriser (JWT/Cognito for HTTP API, Lambda authoriser for custom logic) or at minimum use an API key to prevent open access in production.
DynamoDB
Partition Keys and Sort Keys
What DynamoDB is a fully managed NoSQL key-value and document database. Every item has a primary key. Simple primary key: partition key only (must be unique per item). Composite primary key: partition key + sort key (the combination must be unique; the partition key alone can repeat). DynamoDB uses the partition key to determine which physical partition stores the item (sharding). The sort key enables range queries within a partition.
Why DynamoDB provides single-digit millisecond performance at any scale, with no schema to define beyond the key structure. Perfect for high-throughput, simple access patterns.
Gotcha
- Choose a high-cardinality partition key (e.g., user ID, order ID — values that are unique and evenly distributed). A low-cardinality partition key (e.g., status = "active"/"inactive") creates hot partitions that throttle performance.
- DynamoDB has no joins. Design your table access patterns upfront — think in queries, not entities. One table design (putting multiple entity types in one table) is the advanced but optimal pattern.
- Free tier: 25 GB storage + 25 RCUs + 25 WCUs per month — always free, not just 12 months.
Capacity Modes
What On-Demand: pay per request. No capacity planning. Scales instantly to any traffic level. ~$1.25 per million write request units, $0.25 per million read request units. Provisioned: you specify Read Capacity Units (RCUs) and Write Capacity Units (WCUs). Cheaper at predictable, sustained load. Auto Scaling adjusts RCU/WCU automatically within min/max bounds.
Gotcha
- 1 RCU = 1 strongly consistent read per second for items up to 4 KB, or 2 eventually consistent reads per second. 1 WCU = 1 write per second for items up to 1 KB.
- On-Demand is ~5–7x more expensive than Provisioned at equivalent sustained throughput. Use On-Demand for unpredictable or new workloads; switch to Provisioned once traffic patterns are known.
DynamoDB Transactions & Conditional Writes
What TransactWriteItems makes up to 100 write operations atomic across multiple items or tables — all succeed or all fail, no partial state. Conditional writes add optimistic locking: a write only applies if a specified attribute matches a condition. Example:
ConditionExpression: "attribute_not_exists(orderId)" prevents duplicate order creation even under concurrent Lambda invocations.Why Distributed systems need exactly-once semantics. Philip implements idempotent APIs at Boscov's; conditional writes in DynamoDB are the mechanism. Without them, race conditions create duplicate records when retries overlap.
Gotcha
- Transactions consume 2× the capacity units (reads and writes are doubled). Budget accordingly.
TransactionConflictExceptionoccurs when two concurrent transactions touch the same item. Implement exponential backoff and retry in your Lambda handler.
Global Secondary Index (GSI)
What A GSI lets you query a DynamoDB table on attributes other than the primary key. A GSI has its own partition key (and optional sort key) that can be any table attribute. Data is replicated asynchronously from the base table to the GSI. You can have up to 20 GSIs per table. You pay separately for GSI storage and throughput.
Gotcha
- GSI reads are eventually consistent — the GSI may lag behind the base table by milliseconds to seconds. Never use a GSI for reads that must reflect the very latest write.
Hands-on Tasks
Interview Q&A
What is a Lambda cold start and how do you mitigate it for Java?
A cold start occurs when Lambda provisions a new execution environment — it must download code, start the JVM, and run initialisation code before handling the request. For Java, this typically adds 2–10 seconds of latency. Subsequent requests to the same warm container run in milliseconds.
Mitigations in order of effectiveness: (1) Lambda SnapStart — enable on Java 11+ functions; AWS snapshots the initialised environment and restores from it, reducing cold starts to ~1 second. (2) Provisioned Concurrency — keeps N containers pre-initialised, eliminating cold starts at extra cost. (3) Avoid Spring Boot — Spring's context initialisation is heavy; use Quarkus with native compilation or plain Java. (4) Keep init code minimal — defer expensive initialisation until first use rather than in static blocks.
Mitigations in order of effectiveness: (1) Lambda SnapStart — enable on Java 11+ functions; AWS snapshots the initialised environment and restores from it, reducing cold starts to ~1 second. (2) Provisioned Concurrency — keeps N containers pre-initialised, eliminating cold starts at extra cost. (3) Avoid Spring Boot — Spring's context initialisation is heavy; use Quarkus with native compilation or plain Java. (4) Keep init code minimal — defer expensive initialisation until first use rather than in static blocks.
When would you choose DynamoDB over PostgreSQL (RDS)?
Choose DynamoDB when: you need single-digit millisecond latency at massive scale, your access patterns are simple and known upfront (get by ID, query by partition key), you need near-infinite scalability without manual sharding, or you're building session storage, leaderboards, IoT data, or gaming backends.
Stick with PostgreSQL/RDS when: your data has complex relationships requiring joins, you need ad-hoc queries or reporting, you have transactional requirements (ACID across multiple entities), or your team thinks in relational terms — the operational overhead of DynamoDB table design is significant for teams not familiar with NoSQL access patterns. DynamoDB requires you to design your data model around your queries upfront, not the other way around.
Stick with PostgreSQL/RDS when: your data has complex relationships requiring joins, you need ad-hoc queries or reporting, you have transactional requirements (ACID across multiple entities), or your team thinks in relational terms — the operational overhead of DynamoDB table design is significant for teams not familiar with NoSQL access patterns. DynamoDB requires you to design your data model around your queries upfront, not the other way around.
What's the difference between API Gateway REST API and HTTP API?
HTTP API is the newer, simpler, and cheaper option (~$1/million requests). It covers the majority of use cases: Lambda proxy integration, JWT authentication, CORS, and custom routes. Lower latency than REST API.
REST API (~$3.50/million requests) adds features not in HTTP API: response caching (reduces Lambda invocations and latency for cacheable responses), usage plans and API keys (for partner/developer APIs with rate limits per key), AWS WAF integration (web application firewall), request/response transformation without Lambda, and resource-level policies.
Default choice: HTTP API. Use REST API specifically when you need caching, usage plans, or WAF.
REST API (~$3.50/million requests) adds features not in HTTP API: response caching (reduces Lambda invocations and latency for cacheable responses), usage plans and API keys (for partner/developer APIs with rate limits per key), AWS WAF integration (web application firewall), request/response transformation without Lambda, and resource-level policies.
Default choice: HTTP API. Use REST API specifically when you need caching, usage plans, or WAF.
My Notes
Saved to browser storage automatically as you type.
9
Infrastructure as Code
0%
Topics
Terraform
Core Concepts: Providers, Resources, Variables
What Terraform is an open-source IaC tool by HashiCorp using HCL (HashiCorp Configuration Language). Providers are plugins that interact with APIs (the
aws provider calls AWS APIs). Resources are the infrastructure you declare (aws_s3_bucket, aws_instance). Variables parameterise your configuration so the same code can provision dev, staging, and prod.Minimal example
terraform {
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
}
provider "aws" {
region = var.region
}
variable "region" {
default = "us-east-1"
}
resource "aws_s3_bucket" "my_bucket" {
bucket = "my-app-uploads-${var.region}"
}
resource "aws_s3_bucket_versioning" "my_bucket" {
bucket = aws_s3_bucket.my_bucket.id
versioning_configuration { status = "Enabled" }
}
Terraform Workflow: init → plan → apply
What The standard Terraform workflow is three commands:
terraform init: download providers and modules, initialise the backend. Run once per project or after backend/provider changes.terraform plan: show what will be created, changed, or destroyed. Reads current state and compares to your code. Always review this output before applying.terraform apply: execute the plan. Prompts for confirmation. Writes results to state file.terraform destroy: destroys all resources managed by the current state. Use carefully.
Gotcha
- Never run
terraform applyin CI without first reviewing theplanoutput. A-replaceor-destroyflag in the wrong hands deletes production resources. terraform import: bring an existing manually-created resource under Terraform management without recreating it.
Remote State (S3 + DynamoDB)
What Terraform state (
terraform.tfstate) is a JSON file that maps your code to real-world resources. By default it is stored locally. For teams, store it remotely in S3 (for shared access and versioning) with a DynamoDB table for state locking (prevents two people from applying simultaneously and corrupting state).backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
Gotcha
- State files can contain sensitive values (RDS passwords, private keys). Enable S3 bucket encryption and restrict access with bucket policies.
- Never edit the state file manually. Use
terraform state mv,terraform state rm, orterraform importfor state surgery.
Modules
What Reusable groups of Terraform resources. A module is a directory of
.tf files. Call it with a module block and pass variables. Modules enforce consistency — define your VPC pattern once and call it for dev, staging, and prod. The Terraform Registry has community modules for common patterns (AWS VPC, EKS, RDS).Gotcha
- Avoid deeply nested modules — they make debugging hard. Flat module structures are more readable.
- Pin module versions (
version = "5.5.0") to avoid unexpected breaking changes from upstream updates.
AWS CDK (for Java & TypeScript Teams)
AWS Cloud Development Kit (CDK)
What CDK lets you write infrastructure in a real programming language — Java, TypeScript, Python, or Go — instead of YAML or HCL. A CDK app synthesises to CloudFormation templates under the hood. You get loops, conditionals, functions, unit tests, and type safety. The CDK Constructs Library provides high-level abstractions:
new ApplicationLoadBalancedFargateService(this, "Api", {...}) creates the ECS cluster, task definition, ALB, target group, and security groups in one call.Why For engineers coming from a Java/Spring background, CDK is a natural fit — you think in code, not YAML. CDK is increasingly the preferred IaC approach at enterprise AWS shops because infrastructure can be tested with JUnit, reused as libraries, and maintained with the same tooling as application code.
Minimal CDK stack (Java)
public class MyStack extends Stack {
public MyStack(Construct scope, String id) {
super(scope, id);
Bucket bucket = Bucket.Builder.create(this, "MyBucket")
.versioned(true)
.blockPublicAccess(BlockPublicAccess.BLOCK_ALL)
.build();
}
}
// Deploy: cdk synth → cdk deploy
Gotcha
- CDK synthesises to CloudFormation — you get Change Sets and rollback protection for free. The downside: CDK's generated templates are verbose and hard to read directly.
- CDK is opinionated. It sets sensible defaults (encryption on S3, deletion protection on RDS). Sometimes you need to explicitly opt out of a sensible default, which requires reading the Construct docs carefully.
AWS CloudFormation
Templates and Stacks
What CloudFormation is AWS's native IaC service. You define infrastructure in YAML or JSON templates. Deploying a template creates a stack — a collection of AWS resources managed as a unit. AWS determines the correct creation order based on dependencies between resources. CloudFormation is free — you pay only for the resources it creates.
Template structure
AWSTemplateFormatVersion: '2010-09-09'
Description: My application stack
Parameters:
InstanceType:
Type: String
Default: t3.micro
Resources: # required
MyInstance:
Type: AWS::EC2::Instance
Properties:
InstanceType: !Ref InstanceType
ImageId: ami-0abcdef1234567890
Outputs:
InstanceId:
Value: !Ref MyInstance
Gotcha
- If a stack update fails, CloudFormation automatically rolls back to the previous state. This is generally helpful but means you must investigate why the update failed before retrying.
- Deleting a stack deletes all resources in it (by default). Use DeletionPolicy: Retain on critical resources like RDS and S3 to protect them.
Change Sets and Drift Detection
What A Change Set is a preview of what CloudFormation will do when you update a stack — equivalent to
terraform plan. Always create a change set and review it before updating production stacks. Drift detection identifies resources that have been modified outside of CloudFormation (e.g., someone changed a Security Group in the console). Drift detection does not auto-remediate — it reports differences so you can decide what to do.Gotcha
- Manual changes to CloudFormation-managed resources cause drift. If you update a resource manually and then run CloudFormation, it may overwrite your manual change or fail with a conflict. Always make changes through CloudFormation or Terraform — never manually in the console for IaC-managed resources.
Hands-on Tasks
Interview Q&A
Why use Infrastructure as Code instead of clicking in the console?
Four reasons: Reproducibility — run the same code and get identical environments. No manual steps that differ between dev, staging, and prod. Version control — infrastructure changes are committed to Git, code-reviewed in PRs, and have a full audit trail. You know what changed, when, and why. Disaster recovery — if an environment is destroyed, recreate it from code in minutes rather than days of manual work. Drift prevention — IaC is the source of truth; manual console changes are detected as drift and can be corrected. The alternative — clicking in the console — produces "snowflake servers" that are impossible to reproduce exactly.
What is Terraform state and why does it need to be stored remotely?
Terraform state (
Remote state (in S3) is required for team use: if each engineer has their own local state file, there is no shared understanding of what is deployed and two people applying simultaneously can corrupt state or create duplicate resources. DynamoDB state locking prevents concurrent applies by creating a lock entry before any operation and releasing it after. Encryption on the S3 bucket is important because state files often contain sensitive values like database passwords.
terraform.tfstate) is a JSON file mapping your code to real AWS resources. Without it, Terraform cannot know what it already created — it would try to create everything again on the next apply. The state file also tracks resource attributes (IDs, ARNs) needed to create dependencies.Remote state (in S3) is required for team use: if each engineer has their own local state file, there is no shared understanding of what is deployed and two people applying simultaneously can corrupt state or create duplicate resources. DynamoDB state locking prevents concurrent applies by creating a lock entry before any operation and releasing it after. Encryption on the S3 bucket is important because state files often contain sensitive values like database passwords.
What is a CloudFormation change set and why should you always use one?
A change set is CloudFormation's preview of what will happen when you update a stack — it lists which resources will be Added, Modified, or Replaced. Replaced is critical: some property changes require CloudFormation to delete and recreate a resource (e.g., changing an RDS instance's engine version or renaming a resource). Without reviewing the change set, you could accidentally delete a production database by making what looks like a minor configuration change.
Always create and review a change set before updating any production stack. The workflow is: create change set → review (specifically look for any "Replacement: True" entries) → execute if safe, or modify the template if not. This is the IaC equivalent of code review before deployment.
Always create and review a change set before updating any production stack. The workflow is: create change set → review (specifically look for any "Replacement: True" entries) → execute if safe, or modify the template if not. This is the IaC equivalent of code review before deployment.
My Notes
Saved to browser storage automatically as you type.
10
ElastiCache / Redis
0%
Topics
ElastiCache Fundamentals
ElastiCache for Redis vs Memcached
What ElastiCache is AWS's managed in-memory caching service. It supports two engines: Redis and Memcached. Redis supports rich data structures (lists, sets, sorted sets, hashes), optional persistence (RDB snapshots and AOF), pub/sub messaging, Lua scripting, transactions, and cluster mode for horizontal sharding. Memcached is simpler, multi-threaded, and has no persistence or replication.
Why Use Redis for virtually every use case: caching, session storage, leaderboards (sorted sets), pub/sub, rate limiting, and distributed locks. Memcached's multi-threaded model can outperform Redis at extreme throughput for pure simple key-value operations, but for a Spring Boot e-commerce application Redis is the right default — it supports all the patterns you will need as requirements evolve.
Gotcha
- ElastiCache Redis does not support all Redis commands. In cluster mode, commands that operate across multiple keys (e.g.,
MGETwith keys in different slots,KEYS *, Lua scripts touching multiple keys) are restricted or behave differently. Always test your Redis command usage against a cluster-mode instance before going to production. - Memcached on ElastiCache does not support Multi-AZ automatic failover or replication. If a Memcached node fails, all data in that node is lost.
Cluster Mode Disabled vs Enabled
What Cluster Mode Disabled: a single shard with one primary node and up to 5 read replicas. All data lives on the one shard. Maximum data size is determined by the node type (e.g., cache.r6g.xlarge provides ~26 GB). Cluster Mode Enabled: data is sharded (partitioned by key hash slot) across up to 500 shards, each with its own primary and replicas — enabling both horizontal read and write scaling and datasets larger than a single node can hold.
Why For most Spring Boot applications — including Boscov's inventory caching — cluster mode disabled with 1–2 read replicas in separate Availability Zones is sufficient. It provides Multi-AZ failover (automatic promotion of a replica if the primary fails) without the complexity of cluster-mode key routing. Enable cluster mode only when your dataset genuinely exceeds a single large node or when write throughput requires sharding.
Gotcha
- Spring Boot's default Redis client is Lettuce (not Jedis). Lettuce handles cluster topology changes automatically — when a primary fails and a replica is promoted, Lettuce refreshes the cluster view and reconnects without application restart. Jedis requires more manual cluster configuration.
- You cannot switch a cluster between cluster mode disabled and enabled without recreating it. Plan your topology before provisioning.
- In cluster mode enabled, all keys in a
MULTI/EXECtransaction or a Lua script must hash to the same slot. Use hash tags (e.g.,{user:123}:cartand{user:123}:profile) to force related keys into the same slot.
Caching Patterns
Cache-Aside Pattern (Lazy Loading)
What The application controls the cache explicitly. On a read: check cache → if hit, return cached value; if miss, fetch from database, write the result to cache with a TTL, return the result. The cache is populated lazily — only data that is actually requested gets cached. This is the most common caching pattern and the one
@Cacheable in Spring implements.Why Simple to implement and resilient — if the cache fails, the application falls through to the database. Cache contains only data that has been requested, so memory is not wasted on data that is never read. TTL ensures stale entries eventually expire. This is what delivered the 60%+ read performance improvement on Boscov's inventory reads: product and inventory data is read far more often than it is written.
Gotcha
- TTL is mandatory. Without a TTL, cached entries never expire. If the underlying data changes in the database and the cache is not explicitly evicted, the application will serve stale data indefinitely. Always set a TTL appropriate to your data's rate of change.
- Cache stampede (thundering herd): when a popular key's TTL expires (or on a cold start), dozens or hundreds of simultaneous requests all miss the cache and fire concurrent queries to the database — which can overwhelm it. Mitigate with: (1) adding random jitter to TTLs (e.g., 55–65 seconds instead of exactly 60) so hot keys don't expire simultaneously; (2) probabilistic early expiry (PER) — start refreshing a key before it expires based on a probability function; (3) a single-flight/request-coalescing pattern where only one thread fetches from the DB and others wait for the result.
Write-Through Pattern
What On every database write, the application also writes the same data to the cache. The cache is always warm and consistent with the database. There is no cache-miss penalty for recently written data, and there is no stampede risk because the cache is pre-populated on write rather than on first read.
Why Best for workloads where data is written and then immediately read (e.g., a user updates their profile and then views it). Guarantees that the cache always reflects the latest database state without relying on TTL expiry or explicit eviction.
Gotcha
- Write penalty: every database write now requires a second write to Redis. Writes are slower, and the write path now has a dependency on Redis availability.
- Cache pollution: data written to the cache may never be read, wasting memory. For write-heavy datasets with unpredictable read access patterns (e.g., audit logs, bulk imports), write-through wastes cache space.
- Hybrid approach: use write-through for hot, frequently-read data (user sessions, current inventory counts for top products) and cache-aside for cold or unpredictable data. Many production systems combine both patterns by data category.
Spring Boot Integration
Spring Boot Integration (
spring-boot-starter-data-redis)What Spring Boot's Redis starter auto-configures a Lettuce connection factory and a
RedisTemplate. Add @EnableCaching to your main class to activate the Spring Cache abstraction. Annotate service methods with @Cacheable, @CacheEvict, and @CachePut to declaratively manage the cache without writing Redis commands manually.application.properties configuration
# Fetch host from environment variable injected via Secrets Manager / EKS secret
spring.data.redis.host=${REDIS_HOST}
spring.data.redis.port=6379
spring.data.redis.password=${REDIS_AUTH_TOKEN}
spring.data.redis.ssl.enabled=true # required when TLS is enabled on the cluster
# Connection pool (Lettuce)
spring.data.redis.lettuce.pool.max-active=20
spring.data.redis.lettuce.pool.max-idle=10
spring.data.redis.lettuce.pool.min-idle=2
Cache configuration bean (TTL + JSON serializer)
@Configuration
@EnableCaching
public class CacheConfig {
@Bean
public RedisCacheConfiguration redisCacheConfiguration(ObjectMapper objectMapper) {
Jackson2JsonRedisSerializer<Object> serializer =
new Jackson2JsonRedisSerializer<>(objectMapper, Object.class);
return RedisCacheConfiguration.defaultCacheConfig()
.entryTtl(Duration.ofSeconds(60))
.serializeValuesWith(
RedisSerializationContext.SerializationPair.fromSerializer(serializer));
}
}
// Service layer
@Service
public class InventoryService {
@Cacheable(value = "inventory", key = "#id")
public InventoryItem getInventoryById(Long id) {
return inventoryRepository.findById(id).orElseThrow();
}
@CacheEvict(value = "inventory", key = "#id")
public void updateInventory(Long id, InventoryItem item) {
inventoryRepository.save(item);
}
}
Manual RedisTemplate usage (for complex operations)
@Autowired
private RedisTemplate<String, Object> redisTemplate;
// Manual cache-aside
public InventoryItem getWithManualCache(Long id) {
String key = "inventory::" + id;
InventoryItem cached = (InventoryItem) redisTemplate.opsForValue().get(key);
if (cached != null) return cached;
InventoryItem item = inventoryRepository.findById(id).orElseThrow();
redisTemplate.opsForValue().set(key, item, Duration.ofSeconds(60));
return item;
}
Gotcha
- Default JDK serialization is unreadable. Out of the box, Spring's
RedisTemplateuses JDK serialization — cached values appear as binary gibberish inredis-cliand are tied to the exact class structure. Always configureJackson2JsonRedisSerializerorGenericJackson2JsonRedisSerializerso cached objects are stored as human-readable JSON and are forward-compatible with class refactoring. - When using
GenericJackson2JsonRedisSerializer, the serialized JSON includes the fully qualified class name. This means renaming or moving a class will break deserialization of existing cached entries. Plan cache key versioning or useflushdbon deployment when class structure changes.
Security and Monitoring
Security — VPC, TLS, AUTH, Secrets Manager
What ElastiCache clusters run entirely inside your VPC — there is no public endpoint. Access is controlled by Security Groups. Enable in-transit encryption (TLS) and at-rest encryption at cluster creation time. Protect the cluster with a Redis AUTH token (a static password) stored in AWS Secrets Manager.
Why Redis was originally designed for trusted networks with no authentication. An unprotected ElastiCache cluster in a VPC is protected only by Security Groups — if another compromised resource in the same VPC can reach port 6379, it has full access to all cached data. Defence in depth: Security Group restriction + TLS in transit + AUTH token + at-rest encryption.
Security Group rule
# ElastiCache Security Group — inbound rule
Type: Custom TCP
Port: 6379
Source: [Application Security Group ID] # NOT 0.0.0.0/0, NOT the VPC CIDR
# Spring Boot: use rediss:// (double-s) when TLS is enabled
spring.data.redis.url=rediss://:${REDIS_AUTH_TOKEN}@${REDIS_HOST}:6379
Gotcha
- ElastiCache does not support IAM authentication — unlike RDS IAM auth, there is no way to use an IAM role to authenticate to Redis. The AUTH token is a static string. Store it in Secrets Manager, inject it into your application at startup via an environment variable or Spring Cloud AWS, and rotate it via Secrets Manager's rotation +
aws elasticache modify-replication-group --auth-token <new-token>. - In-transit encryption and at-rest encryption cannot be enabled on an existing cluster — they must be enabled at creation time. Plan for this before provisioning your first cluster.
- Spring's
rediss://URL (double s) enables TLS. Usingredis://against a TLS-enabled cluster will result in a connection error that can be difficult to diagnose.
CloudWatch Metrics for ElastiCache
What ElastiCache publishes metrics to CloudWatch under the
AWS/ElastiCache namespace. Key metrics: CacheHits and CacheMisses (calculate hit rate: CacheHits / (CacheHits + CacheMisses)), CurrConnections (current client connections), Evictions (keys evicted because maxmemory was reached), EngineCPUUtilization (Redis is single-threaded for commands; high CPU here means your commands are slow), FreeableMemory (remaining free memory on the node), and ReplicationLag (seconds the replica is behind the primary — relevant for read-replica reads).Why These metrics are your window into cache health. A hit rate below 80% means the cache is not providing much benefit — investigate TTLs and access patterns. Any evictions mean Redis is running out of memory and ejecting valid cached data to make room for new data — your application suddenly gets more DB read traffic without warning.
CurrConnections growing toward the node's maxclients limit indicates a connection leak in the application.Recommended alarms
- Evictions > 0 for 1 datapoint: any eviction means Redis is memory-constrained and is silently invalidating your cache. Action: upgrade the node type or reduce the TTL of lower-priority keys.
- CurrConnections approaching
maxclients: the defaultmaxclientsis 65000 but is lower for smaller node types (e.g., cache.t3.micro has a lower limit). New connections will be refused when the limit is hit, causing application errors. Check for connection leaks (Lettuce connection pool misconfiguration). - ReplicationLag > 1 second: reads from replicas may return stale data older than 1 second. Investigate write throughput — the replica may be falling behind on replication.
Hands-on Tasks
Interview Q&A
When would you add Redis in front of RDS rather than just adding a Read Replica?
Read Replicas scale read throughput but each read still hits PostgreSQL — they help with concurrent read volume but not latency. Every query to a Read Replica still involves parsing SQL, planning, disk I/O, and network round-trips, producing typical latencies of 5–50ms. Redis serves cached responses from memory in under 1ms.
Use Redis when: the same data is read repeatedly with the same parameters (product catalogue, user profiles, inventory counts for popular items), read latency matters more than perfect consistency, or the database is CPU-bound on read queries that are expensive to compute repeatedly.
Read Replicas are the better choice when you need fresh data on every read (financial balances, live inventory at checkout), need SQL flexibility (ad-hoc queries, joins, aggregations), or your data access patterns are highly unpredictable and cache keys are difficult to define. In practice, most production e-commerce systems use both: Redis in front of RDS for the 80% of reads that are repetitive, and Read Replicas to handle the remaining query load that cannot be cached.
Use Redis when: the same data is read repeatedly with the same parameters (product catalogue, user profiles, inventory counts for popular items), read latency matters more than perfect consistency, or the database is CPU-bound on read queries that are expensive to compute repeatedly.
Read Replicas are the better choice when you need fresh data on every read (financial balances, live inventory at checkout), need SQL flexibility (ad-hoc queries, joins, aggregations), or your data access patterns are highly unpredictable and cache keys are difficult to define. In practice, most production e-commerce systems use both: Redis in front of RDS for the 80% of reads that are repetitive, and Read Replicas to handle the remaining query load that cannot be cached.
How do you handle cache invalidation — the hardest problem in distributed systems?
Three strategies, each with different consistency/complexity trade-offs:
1. TTL-based expiry: accept stale data for the duration of the TTL window and let entries expire naturally. This is simple, requires no coupling between the write path and Redis, and works well for data where brief staleness is acceptable — product prices, user preferences, category lists. The downside is that stale data can be served for up to the TTL duration after a write.
2. Event-driven eviction (
3. Write-through: on every write, update both the database and the cache atomically. The cache is always consistent with the database. Writes are slower and the cache may fill with data that is never re-read.
In practice for Boscov's inventory: use TTL (60 seconds) as the baseline for most inventory reads, combined with event-driven eviction on explicit updates — so a product update is reflected immediately rather than waiting 60 seconds. This hybrid approach is the standard production pattern.
1. TTL-based expiry: accept stale data for the duration of the TTL window and let entries expire naturally. This is simple, requires no coupling between the write path and Redis, and works well for data where brief staleness is acceptable — product prices, user preferences, category lists. The downside is that stale data can be served for up to the TTL duration after a write.
2. Event-driven eviction (
@CacheEvict): on every database write, immediately evict the corresponding cache key. The next read will miss the cache and repopulate it with fresh data. This guarantees consistency but adds write latency (an extra Redis call on every write) and couples the write path to Redis availability — if Redis is down during a write, the eviction may fail.3. Write-through: on every write, update both the database and the cache atomically. The cache is always consistent with the database. Writes are slower and the cache may fill with data that is never re-read.
In practice for Boscov's inventory: use TTL (60 seconds) as the baseline for most inventory reads, combined with event-driven eviction on explicit updates — so a product update is reflected immediately rather than waiting 60 seconds. This hybrid approach is the standard production pattern.
What happens if ElastiCache goes down? How does your Spring Boot app behave?
By default, with
The correct fix is a custom
With this handler, any Redis exception on a GET is logged and suppressed — Spring calls the actual service method (which queries the DB) instead. The application degrades gracefully: slower but correct.
Also configure connection and command timeouts (
spring-boot-starter-data-redis and @Cacheable, a Redis connection failure throws a RedisConnectionException — which propagates up through your service method and returns a 500 error to the client. This means a Redis outage takes down application reads entirely, even though the database is healthy. This is the wrong behaviour: Redis is a performance optimisation, not a critical dependency.The correct fix is a custom
CacheErrorHandler:@Configuration
public class CacheConfig extends CachingConfigurerSupport {
@Override
public CacheErrorHandler errorHandler() {
return new SimpleCacheErrorHandler() {
@Override
public void handleCacheGetError(RuntimeException e, Cache cache, Object key) {
log.warn("Redis GET failed for key {}, falling through to DB", key, e);
// do not rethrow — Spring will call the underlying method
}
// override handleCachePutError, handleCacheEvictError similarly
};
}
}With this handler, any Redis exception on a GET is logged and suppressed — Spring calls the actual service method (which queries the DB) instead. The application degrades gracefully: slower but correct.
Also configure connection and command timeouts (
spring.data.redis.lettuce.command-timeout=500ms) and pool exhaustion settings so a slow Redis does not block application threads indefinitely.My Notes
Saved to browser storage automatically as you type.
11
Prometheus + Grafana
0%
Topics
Prometheus Fundamentals
Prometheus Pull Model and Metrics Types
What Prometheus scrapes HTTP endpoints (your app's
/actuator/prometheus) on a configurable interval. Unlike CloudWatch which you push to, Prometheus pulls — meaning your app doesn't need AWS credentials or SDK to emit metrics. Four metric types: Counter (monotonically increasing: requests served), Gauge (current value: active connections, heap used), Histogram (distribution with buckets: request latency), Summary (quantiles, less flexible than histogram).Why The pull model decouples your application from the metrics infrastructure — your app exposes a passive endpoint and Prometheus does the work of collecting. Your app has no outbound network dependency for metrics, no AWS credentials needed, and if Prometheus goes down your application keeps running unaffected.
Gotcha
- Counters only go up — never reset them. If you need rate, use
rate()in PromQL. Never use a Gauge for something that only increases. - Histograms require pre-defined buckets. If your p99 latency exceeds your highest bucket value,
histogram_quantile()will return+Inf. Configure buckets that cover your expected latency range.
kube-prometheus-stack Helm Chart
What The community standard for Prometheus on Kubernetes. One
helm install deploys: Prometheus Operator, Prometheus (metrics store), Grafana (dashboards), AlertManager (routing), node-exporter (node-level metrics), and kube-state-metrics (Kubernetes object metrics — pod count, deployment replicas, etc.).Why Running Prometheus manually on Kubernetes requires managing StatefulSets, RBAC, scrape configuration, and upgrade paths — a significant operational burden. kube-prometheus-stack bundles all best-practice configuration. The Prometheus Operator introduces custom resources (ServiceMonitor, PrometheusRule) that let you manage scraping and alerting as Kubernetes objects.
Gotcha
- The chart creates ~50 default Kubernetes alert rules. On first install they may fire immediately — for example, the "Watchdog" alert is intentional and proves alerting is working end-to-end.
- Review and tune the default rules for your cluster size. Some rules (e.g., "KubeMemoryOvercommit") require adjustment for small dev clusters.
Spring Boot Integration
Spring Boot → Prometheus: Micrometer Registry
What Add
micrometer-registry-prometheus dependency and expose /actuator/prometheus (add prometheus to management.endpoints.web.exposure.include). Spring Boot auto-configures JVM metrics (heap, GC pause duration, threads), Tomcat/Netty metrics, HikariCP pool metrics, and Spring MVC request latency histograms. Use @Timed("my.service.method") or inject MeterRegistry directly for custom metrics.Why Unlike
micrometer-registry-cloudwatch2 which pushes metrics (requiring AWS credentials and network access), the Prometheus registry simply formats metrics as Prometheus text at the /actuator/prometheus endpoint. No credentials, no outbound connections, no cost per metric push.pom.xml
<dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency> # application.properties management.endpoints.web.exposure.include=health,info,metrics,prometheus
Gotcha
- Histogram buckets must be configured for latency. By default Spring exports
http.server.requestswith SLO buckets at 50ms, 100ms, 200ms, etc. If your p99 exceeds 1s, add custom buckets:management.metrics.distribution.slo.http.server.requests=50ms,100ms,200ms,500ms,1s,2s,5s
ServiceMonitor: Telling Prometheus to Scrape Your App
What The Prometheus Operator introduces the
ServiceMonitor custom resource. Create one pointing at your app's Kubernetes Service on port 8080, path /actuator/prometheus. The Operator automatically updates Prometheus's scrape configuration — no manual Prometheus config file editing needed.Why Without ServiceMonitor, adding a new service to Prometheus requires editing its config file and reloading. With ServiceMonitor, a developer deploys a YAML file alongside their app and Prometheus discovers it automatically — GitOps-friendly and self-service.
Example ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-spring-app
namespace: default
spec:
selector:
matchLabels:
app: my-spring-app
endpoints:
- port: http
path: /actuator/prometheus
interval: 30s
Gotcha
- The
ServiceMonitormust be in the same namespace as Prometheus or the Prometheus resource must haveserviceMonitorNamespaceSelector: {}to watch all namespaces. This is the most common reason scraping silently fails — the ServiceMonitor exists but Prometheus never picks it up. - The label selector on the ServiceMonitor must match the labels on your Kubernetes Service object, not the Pod labels.
PromQL and Grafana
PromQL Basics
What PromQL is Prometheus's query language. Essential queries for a Spring Boot app:
- Request rate:
rate(http_server_requests_seconds_count{job="my-app"}[5m]) - Error rate:
rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) - p99 latency:
histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m])) - JVM heap used:
jvm_memory_used_bytes{area="heap"} - HikariCP pool usage:
hikaricp_connections_active / hikaricp_connections_max
Why PromQL's power is in its label filtering and range vector functions. Slicing by URI, status code, or instance, and computing rates over rolling windows, enables questions that CloudWatch Metrics Insights requires complex metric math expressions to answer — PromQL answers them in one line.
Gotcha
rate()requires a counter, not a gauge. Usingrate()on a gauge gives nonsense. Usederiv()ordelta()for gauges instead.histogram_quantile()requires the_bucketmetric, not_countor_sum. Always userate()on the bucket series inside the function.
Grafana Dashboards
What Grafana connects to Prometheus as a data source and renders PromQL results as panels (time-series graphs, gauges, stat tiles, tables). Import community dashboard ID
4701 (JVM Micrometer) for instant Spring Boot visibility. Key panels to build: request rate + error rate (two lines on one graph), p50/p95/p99 latency, active JVM threads, HikariCP pool saturation, pod count via kube_deployment_spec_replicas.Why Grafana's panel library and community dashboards (especially the JVM Micrometer dashboard) are more mature and flexible than CloudWatch dashboards. Grafana supports templating (dropdown to select service/namespace), annotations (mark deployments on graphs), and alerting from within dashboards.
Gotcha
- Grafana dashboards are stored in the Grafana pod's SQLite DB by default — it is ephemeral. If the pod restarts, you lose all custom dashboards. Configure a PVC for persistence or provision dashboards as ConfigMaps (GitOps-friendly, dashboards live in Git as JSON files).
AlertManager: Routing Alerts
What AlertManager receives firing alerts from Prometheus rules and routes them to receivers (Slack, PagerDuty, email). Configure in
values.yaml when installing kube-prometheus-stack. Define routes by label: severity: critical → PagerDuty, severity: warning → Slack. Alerting rules are defined as PrometheusRule custom resources.Why Prometheus evaluates alerting rules and fires alerts when conditions are met, but it does not send notifications directly — AlertManager handles deduplication, grouping, silencing, and routing to the right receiver. This separation keeps Prometheus focused on metrics and lets you change notification channels without touching alert rules.
Gotcha
- AlertManager has "inhibition" — a critical alert can suppress related warning alerts for the same component. Configure inhibition rules before going to production or one outage may page you 40 times for the same root cause.
- Test your alerting pipeline end-to-end with the "Watchdog" alert — it fires continuously and proves your full chain (Prometheus → AlertManager → receiver) works.
Prometheus vs CloudWatch: When to Use Each
What CloudWatch: AWS-native, zero in-cluster infrastructure, best for EC2/Lambda/RDS/ECS infrastructure metrics, ALB access logs, VPC Flow Logs — anything AWS manages. Prometheus: best for application-level metrics on Kubernetes, PromQL is far more powerful for complex queries, Grafana is more flexible than CloudWatch dashboards.
Why Production answer: run both. CloudWatch for infra (node CPU, EKS control plane metrics, RDS), Prometheus+Grafana for app metrics (latency, error rate, pool saturation). Each tool is optimal in its domain.
Gotcha
- Prometheus requires cluster resources and operational care — StatefulSet storage, backup, capacity planning. CloudWatch is serverless. Factor operational overhead into the decision.
- You can send metrics to both by adding both
micrometer-registry-prometheusandmicrometer-registry-cloudwatch2— Micrometer fans out to all configured registries simultaneously.
Hands-on Tasks
Interview Q&A
What is Prometheus's pull model and why does it matter for security?
Prometheus polls your app's metrics endpoint on a schedule — your app doesn't push anything anywhere. This means: your app has no outbound network dependency for metrics (doesn't need AWS credentials, SDK, or internet access), and the metrics endpoint is inside the cluster, unreachable from outside by default. It also means the blast radius if Prometheus goes down is zero — your app keeps running, just without metrics collection.
The tradeoff: your app must be discoverable by Prometheus (via ServiceMonitor) and must be running to be scraped — short-lived batch jobs should use Pushgateway instead, which accepts pushed metrics and holds them until Prometheus scrapes.
The tradeoff: your app must be discoverable by Prometheus (via ServiceMonitor) and must be running to be scraped — short-lived batch jobs should use Pushgateway instead, which accepts pushed metrics and holds them until Prometheus scrapes.
When would you choose Prometheus + Grafana over CloudWatch for a Spring Boot service on EKS?
CloudWatch is excellent for AWS-managed resources (RDS CPU, ALB request count, node-level metrics from Container Insights) but PromQL is far more expressive for application-level analysis.
Grafana's panel library and community dashboards (like the JVM Micrometer dashboard ID 4701) are also more mature than CloudWatch dashboards. Operationally: CloudWatch has zero in-cluster infrastructure cost; Prometheus requires cluster resources and operational care. Production answer: run both — CloudWatch for infra (node CPU, EKS control plane metrics), Prometheus+Grafana for app metrics (latency, error rate, pool saturation).
histogram_quantile(0.99, rate(http_server_requests_seconds_bucket{uri="/checkout"}[5m])) — the p99 latency for a specific endpoint over a rolling 5-minute window — is a one-liner in PromQL and requires a complex metric math expression in CloudWatch.Grafana's panel library and community dashboards (like the JVM Micrometer dashboard ID 4701) are also more mature than CloudWatch dashboards. Operationally: CloudWatch has zero in-cluster infrastructure cost; Prometheus requires cluster resources and operational care. Production answer: run both — CloudWatch for infra (node CPU, EKS control plane metrics), Prometheus+Grafana for app metrics (latency, error rate, pool saturation).
A new engineer joins and can't find their service's metrics in Grafana. Walk through your debugging process.
Systematic five-step check: (1) Verify the pod is running and
/actuator/prometheus returns data: kubectl exec -it <pod> -- curl localhost:8080/actuator/prometheus | head -5 — if this fails, the app is the problem (missing dependency, endpoint not exposed). (2) Verify a ServiceMonitor exists and its label selector matches the labels on the Kubernetes Service (not the Pod). (3) Check the Prometheus Targets page (/targets) — is the target listed? If listed but DOWN, the error message tells you: wrong port, wrong path, or network policy blocking the scrape. (4) If the target is not listed at all, check whether the ServiceMonitor's namespace is watched by Prometheus — the Prometheus resource needs serviceMonitorNamespaceSelector: {} to watch all namespaces. (5) Check RBAC — Prometheus needs ClusterRole permissions to read ServiceMonitor resources in that namespace. Most failures are one of: wrong label selector, wrong namespace configuration, or missing RBAC.My Notes
Saved to browser storage automatically as you type.
12
ArgoCD / GitOps
0%
Topics
GitOps Fundamentals
GitOps Principles
What Git is the single source of truth for desired cluster state. Every change to the cluster happens through a Git commit — no manual
kubectl apply, no --force in CI. The cluster continuously reconciles itself to match what's in Git. Benefits: full audit trail (git log is your change history), instant rollback (revert a commit), disaster recovery (recreate the entire cluster from a repo), and peer review via PR before production changes.Why Imperative deployments (
kubectl apply from a CI step, aws ecs update-service) leave no durable record of what the cluster should look like right now. If a node is replaced or a namespace is accidentally deleted, there is no authoritative source to restore from. GitOps solves this: the Git repo is always the answer to "what should be running?"Gotcha
- GitOps does not mean putting secrets in Git. Use External Secrets Operator or Sealed Secrets to store encrypted secret references in Git while the actual values live in AWS Secrets Manager. Committing a plaintext Secret manifest is a critical security failure.
- GitOps replaces cluster state management, not the entire pipeline. GitHub Actions still builds your image and pushes to ECR — GitOps takes over at the point of deploying to the cluster.
ArgoCD Core Concepts
ArgoCD Architecture
What ArgoCD runs inside your EKS cluster (in the
argocd namespace). Core components: API Server (web UI + CLI + API), Repository Server (clones and renders manifests from Git), Application Controller (continuously compares desired state from Git with live cluster state, detects drift, reconciles). An Application resource defines: source repo + path + target cluster + namespace. ArgoCD can manage multiple clusters from one control plane.Why Instead of your CI pipeline having
kubectl apply at the end (imperative push), ArgoCD pulls from Git continuously (declarative pull). The cluster always self-heals to match the repo. This eliminates configuration drift: no more "what is actually running vs what was deployed last Tuesday?"Gotcha
- ArgoCD itself is a workload in your cluster — if the cluster is down, ArgoCD is down. This is expected: Kubernetes keeps running whatever was last applied regardless of ArgoCD's state. Don't confuse an ArgoCD outage with an inability to serve traffic.
- ArgoCD stores application state in its own CRDs inside the cluster. Back up your ArgoCD Application resources or manage them via Git (App of Apps pattern) so they survive cluster recreation.
Application Resource and Sync Policies
What The
Application CRD is what you create to register a workload with ArgoCD. Key fields: repoURL, targetRevision (branch/tag/commit), path (folder in repo containing manifests), destination.server, destination.namespace. Sync policies: Manual (ArgoCD detects drift and shows it in the UI, but you press Sync to apply), Automatic (ArgoCD applies any detected diff immediately — add selfHeal: true to also revert manual kubectl changes).Minimal Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/manifests.git
targetRevision: main
path: apps/my-app
destination:
server: https://kubernetes.default.svc
namespace: default
syncPolicy:
automated:
selfHeal: true
prune: true # delete resources removed from Git
Gotcha
- Start with Manual sync for production, Automatic for dev. With Automatic + selfHeal, a misguided
kubectl editwill be silently reverted — which is correct GitOps behaviour but surprises engineers used to imperative workflows. prune: truetells ArgoCD to delete cluster resources that no longer exist in Git. Without it, removed manifest files leave orphaned resources running in the cluster indefinitely.
Health Checks and Sync Status
What ArgoCD tracks two independent statuses per Application: Sync Status (Synced / OutOfSync — does Git match the cluster?) and Health Status (Healthy / Degraded / Progressing — are the resources actually working?). ArgoCD has built-in health checks for Deployments (checks rollout progress and available replicas), Services, Ingresses, etc. A Deployment is Healthy only when its desired replicas are all Ready.
Why These two axes give a precise picture of the cluster. Synced + Healthy is the target state. OutOfSync + Healthy means a Git change is pending. Synced + Degraded means the cluster matches Git but something is failing (e.g., a pod is crash-looping). OutOfSync + Degraded means both a pending change and a runtime failure — investigate Degraded first.
Gotcha
- OutOfSync does not mean broken — it means Git and the cluster differ. OutOfSync + Degraded together means something is both different AND not working. The combination matters more than either status alone.
- ArgoCD's health status for a Deployment is Progressing (not Healthy) during a rolling update. Alert on Degraded, not on Progressing — otherwise every deploy triggers an alert.
Automation and Promotion
ArgoCD Image Updater
What An ArgoCD addon that polls ECR (or Docker Hub) for new image tags and automatically commits the updated tag to the manifests repo. This closes the GitOps loop: your GitHub Actions CI builds and pushes a new image tag to ECR → Image Updater detects it → commits the new tag to Git → ArgoCD sees the Git change → syncs to the cluster. No human intervention, no
kubectl set image in your pipeline.Why Without Image Updater, someone must manually update the image tag in the manifests repo every time a new image is pushed — breaking the automation loop. Image Updater automates the "update manifests repo" step so the full cycle (code push → live deployment) is hands-off for dev and staging.
Gotcha
- Image Updater needs IAM permissions to read from ECR — use IRSA so no credentials are embedded in the cluster. It also needs write access to your Git repo (via SSH key or GitHub App token stored in a Kubernetes Secret).
- Configure the update strategy carefully:
digest(always pull latest digest of a tag — useful forlatesttag workflows) vssemver(update to latest matching semver pattern, e.g.,~1.2updates to any 1.2.x). For production,semverwith explicit constraints is safer than trackinglatest.
Environment Promotion (dev → staging → prod)
What Two common patterns: (1) Directory-based:
apps/my-app/dev/, apps/my-app/staging/, apps/my-app/prod/ — each tracked by a separate ArgoCD Application resource pointing at the same repo but different paths. Promote by updating files and committing to the target path. (2) Kustomize overlays: a base/ directory with common manifests and per-environment overlays/dev/, overlays/prod/ that patch image tags and replica counts. ArgoCD natively renders Kustomize — set the source type in the Application spec.Gotcha
- Never auto-sync production. Always require a manual PR approval → merge → manual ArgoCD sync for prod. Dev and staging can be auto-synced. This is the most important governance rule in a GitOps workflow.
- Promotion is not ArgoCD's job — it is a Git operation. Promote by opening a PR that updates the prod path. ArgoCD detects the merged commit and either auto-syncs (staging) or waits for a manual sync (prod).
Rollback in ArgoCD
What ArgoCD maintains a history of all synced revisions (configurable depth, default 10). Rolling back is a UI click (History → select revision → Rollback) or
argocd app rollback my-app <revision-id>. ArgoCD deploys the exact manifests from that Git commit — deterministic rollback to a known-good state.Gotcha
- Rollback in ArgoCD does not revert the Git repo — it deploys an older revision while the repo still has the newer commit. This creates a deliberate sync divergence (the cluster is behind Git). After rollback, fix the root cause, commit a fix forward, and sync — do not leave the cluster in a rolled-back state indefinitely.
- The correct production process: rollback ArgoCD to restore service → open a hotfix PR → merge → sync forward. Never leave the cluster and Git permanently diverged.
Secrets in GitOps
Managing Kubernetes Secrets Without Committing Them to Git
What Three production-grade options: (1) External Secrets Operator (ESO): commit an
ExternalSecret manifest (contains only the AWS Secrets Manager ARN reference, not the value). ESO's controller reads from Secrets Manager via IRSA and creates a real Kubernetes Secret inside the cluster. The actual secret value never touches Git. (2) Sealed Secrets: encrypt the secret with a cluster-specific public key using kubeseal, commit the encrypted SealedSecret manifest. Only the cluster's controller can decrypt it. (3) ArgoCD Vault Plugin: substitutes secret values from HashiCorp Vault or AWS Secrets Manager at sync time.Why ESO + IRSA is the recommended pattern for AWS-based EKS deployments. It integrates naturally with IAM, Secrets Manager rotation (secrets auto-refresh in the cluster when the Secrets Manager value changes), and the existing IRSA setup from Phase 7.
Gotcha
- If you accidentally commit a secret to Git — even briefly — rotate the secret immediately. Git history is permanent; the value is exposed even after deletion from the working tree.
- ESO creates a standard Kubernetes Secret from the external source. Configure ArgoCD to ignore ESO-managed secrets in its diff to avoid spurious OutOfSync status — ArgoCD did not create them, so they will always appear as drifted without this exclusion.
Hands-on Tasks
Interview Q&A
What is GitOps and why is Git the source of truth rather than the cluster state?
GitOps is an operational framework where the desired state of your infrastructure is declared in Git, and an automated agent continuously reconciles the actual system state to match it.
Git is the source of truth because: it provides a complete, immutable audit trail of every change (who, what, when, why via commit messages and PR reviews); it enables rollback by reverting commits; it enables disaster recovery by recreating the cluster from the repo; and it enforces peer review (PRs) before production changes.
The alternative — treating the cluster as the source of truth — means infrastructure state is tribal knowledge, rollback requires remembering what you changed, and disaster recovery is manual reconstruction. The key insight: the cluster's actual state is derived from Git, not the other way around.
Git is the source of truth because: it provides a complete, immutable audit trail of every change (who, what, when, why via commit messages and PR reviews); it enables rollback by reverting commits; it enables disaster recovery by recreating the cluster from the repo; and it enforces peer review (PRs) before production changes.
The alternative — treating the cluster as the source of truth — means infrastructure state is tribal knowledge, rollback requires remembering what you changed, and disaster recovery is manual reconstruction. The key insight: the cluster's actual state is derived from Git, not the other way around.
Your CI pipeline builds a new image. Walk through the full GitOps flow to get it deployed.
(1) GitHub Actions builds the JAR, builds the Docker image, pushes to ECR with a tag (e.g., the git SHA). (2) ArgoCD Image Updater polls ECR, detects the new tag, and commits an updated
No human ran
image.tag value to the manifests repository. (3) ArgoCD's Application Controller polls the manifests repo (every 3 minutes by default, or immediately via a webhook) and detects the new commit. (4) ArgoCD marks the Application as OutOfSync. (5) If auto-sync is enabled (or an engineer presses Sync), ArgoCD applies the updated Deployment manifest to the cluster. (6) Kubernetes performs a rolling update — new pods start (readiness probe gates traffic), old pods terminate gracefully.No human ran
kubectl — the entire flow was driven by a Git commit.How do you handle Kubernetes Secrets in a GitOps workflow — you can't commit plaintext secrets to Git?
Three options in production:
(1) External Secrets Operator (ESO): store an
(2) Sealed Secrets: encrypt the secret with a cluster-specific public key (using
(3) ArgoCD Vault Plugin: templating plugin that substitutes secret values from HashiCorp Vault or AWS Secrets Manager at sync time.
ESO + IRSA is the recommended pattern for AWS-based EKS deployments — it integrates naturally with IAM, Secrets Manager rotation, and the existing IRSA setup from Phase 7.
(1) External Secrets Operator (ESO): store an
ExternalSecret manifest in Git (it contains only the reference to an AWS Secrets Manager secret ARN, not the value). ESO's controller runs in the cluster, reads from Secrets Manager via IRSA, and creates a Kubernetes Secret. The actual secret value never touches Git.(2) Sealed Secrets: encrypt the secret with a cluster-specific public key (using
kubeseal), commit the encrypted SealedSecret manifest. Only the cluster's controller can decrypt it.(3) ArgoCD Vault Plugin: templating plugin that substitutes secret values from HashiCorp Vault or AWS Secrets Manager at sync time.
ESO + IRSA is the recommended pattern for AWS-based EKS deployments — it integrates naturally with IAM, Secrets Manager rotation, and the existing IRSA setup from Phase 7.
My Notes
Saved to browser storage automatically as you type.
★
Capstone Project
0%
Full target architecture
Internet → ALB → EKS (Order, Inventory, Document services)
Order Service → RDS PostgreSQL (private subnet) + SNS → SQS → Inventory Service
Inventory Service → ElastiCache Redis (cache-aside) + RDS + DynamoDB (idempotency)
Document Service → S3 (presigned URL pattern) · All services → ECR (images)
All services → CloudWatch (logs + custom metrics) + AWS X-Ray (distributed traces)
GitHub → GitHub Actions → ECR → EKS rolling deploy (CI/CD)
Infrastructure provisioned via Terraform (VPC, RDS, ElastiCache, SQS/SNS, EKS)
This is the architecture to draw on paper first. Use draw.io with AWS Architecture Icons. Drawing it forces you to think through the connectivity, security groups, and IAM roles before building.
Order Service → RDS PostgreSQL (private subnet) + SNS → SQS → Inventory Service
Inventory Service → ElastiCache Redis (cache-aside) + RDS + DynamoDB (idempotency)
Document Service → S3 (presigned URL pattern) · All services → ECR (images)
All services → CloudWatch (logs + custom metrics) + AWS X-Ray (distributed traces)
GitHub → GitHub Actions → ECR → EKS rolling deploy (CI/CD)
Infrastructure provisioned via Terraform (VPC, RDS, ElastiCache, SQS/SNS, EKS)
Cost-conscious alternative: ECS Fargate
The target architecture uses EKS, which costs $0.10/hr (~$72/month) for the control plane before any EC2 worker nodes. If that's a concern, or if Phase 7 felt overwhelming, this entire capstone works equally well with ECS Fargate:
Replace EKS Deployments → ECS Task Definitions + ECS Services. Replace
Replace EKS Deployments → ECS Task Definitions + ECS Services. Replace
kubectl apply in CI/CD → aws ecs update-service --force-new-deployment. The architecture (VPC, private subnets, ALB, RDS, S3, CloudWatch, IAM roles) and CI/CD concepts are identical. EKS adds Kubernetes portability; ECS Fargate is simpler, cheaper, and AWS-native. The AWS skills you build are the same either way.
What you're building
Order Service
Spring Boot service handling order creation and lifecycle. Stores orders in RDS PostgreSQL. On order creation, publishes an
OrderCreated event to an SNS topic → SQS queue → downstream consumers. Implements a Transactional Outbox table to guarantee event publication even if the downstream queue is temporarily unavailable. Deployed on EKS with IRSA (no embedded AWS credentials).Inventory Service
Consumes
OrderCreated events from SQS. Checks and reserves inventory (stored in RDS). Implements idempotency: uses the orderId as an idempotency key with a DynamoDB conditional write to prevent double-deduction on message redelivery. Publishes InventoryReserved or InsufficientStock events. Caches frequently-queried stock levels in ElastiCache Redis (cache-aside pattern, 60-second TTL).Document Service
Spring Boot service for file upload/download. Stores files in S3 using presigned URLs (clients upload directly to S3 — this service never handles file bytes). Returns presigned GET URLs. Deployed on EKS with IRSA granting S3 access to only this service's bucket prefix.
Observability
CloudWatch Container Insights for EKS pod metrics and logs. Micrometer → CloudWatch custom metrics (order throughput, inventory cache hit rate, SQS processing latency). AWS X-Ray distributed traces to correlate requests across all three services. Alarms on SQS queue depth (backpressure signal), error rate, and p95 latency.
CI/CD Pipeline
GitHub Actions workflow: on merge to main → build JAR → build Docker image → push to ECR → update EKS Deployment image tag → rolling deploy with readiness/liveness probes guarding against bad deploys. Infrastructure provisioned via Terraform (VPC, RDS, ElastiCache, SQS/SNS, S3, EKS).
Capstone Tasks
Interview Q&A — Architecture-level questions
How do you guarantee exactly-once order processing even if the Order Service crashes mid-transaction?
The Transactional Outbox pattern: when the Order Service creates an order, it writes both the order record AND an outbox event to the same local RDS transaction. A separate Outbox Poller reads unpublished events and publishes them to SNS, then marks them published. If the service crashes after committing the DB transaction but before publishing to SNS, the Outbox Poller will retry publishing on restart — the event is never lost.
On the consumer side (Inventory Service), idempotency prevents double-deduction: before processing, the service writes the
On the consumer side (Inventory Service), idempotency prevents double-deduction: before processing, the service writes the
orderId to a DynamoDB deduplication table with a conditional expression (attribute_not_exists(orderId)). If the condition fails (orderId already processed), the message is discarded. Combined, these two patterns give you exactly-once semantics across an unreliable distributed system.Your X-Ray trace shows the order endpoint has p95 latency of 1.8 seconds. Walk me through how you investigate.
In the X-Ray trace timeline, identify which subsegment is the slowest. If it's the RDS subsegment: check CloudWatch RDS metrics for CPU, IOPS, and connection count — a saturated connection pool (Hikari at 100%) is the most common cause. Query Performance Insights for the slow query. If it's an SQS publish subsegment: check if the SNS/SQS call is synchronous and blocking — consider making it async.
If Redis cache hit rate (visible in CloudWatch custom metrics) dropped recently, that would cascade into more RDS queries. Compare the trace segment timestamps against the cache hit metric to correlate. Filter X-Ray traces for the slowest 5% to find if there's a specific code path or input size causing it. The resolution might be adding a read replica, tuning a query, increasing the Hikari pool size, or pre-warming the Redis cache on startup.
If Redis cache hit rate (visible in CloudWatch custom metrics) dropped recently, that would cascade into more RDS queries. Compare the trace segment timestamps against the cache hit metric to correlate. Filter X-Ray traces for the slowest 5% to find if there's a specific code path or input size causing it. The resolution might be adding a read replica, tuning a query, increasing the Hikari pool size, or pre-warming the Redis cache on startup.
Walk me through the production architecture you built in this project.
The architecture starts at the edge with an Application Load Balancer that routes HTTP/HTTPS traffic into an EKS cluster running inside a VPC. The cluster has worker nodes across two Availability Zones in private subnets for resilience. Three Spring Boot microservices run as Kubernetes Deployments — Order Service (order creation + Transactional Outbox), Inventory Service (SQS consumer, Redis cache-aside, DynamoDB idempotency), and Document Service (S3 presigned URLs). All pods use IRSA — IAM roles bound to Kubernetes ServiceAccounts — so no AWS credentials are embedded anywhere.
The event-driven layer: Order Service publishes
The data layer: RDS PostgreSQL in a private subnet reachable only from EKS pods via Security Group rules. ElastiCache Redis and DynamoDB in private subnets for the Inventory Service.
Observability: CloudWatch Container Insights collects pod metrics and logs; Micrometer custom metrics track order throughput, cache hit rate, and SQS lag; AWS X-Ray provides distributed traces across all three services. Infrastructure is provisioned via Terraform with remote state in S3. CI/CD runs in GitHub Actions with OIDC federation — merge to main triggers a build, ECR push, and rolling deployment to EKS automatically.
The event-driven layer: Order Service publishes
OrderCreated events to an SNS topic; the Inventory Service consumes from an SQS queue subscribed to that topic, deduplicates via DynamoDB, and caches stock levels in ElastiCache Redis. The Document Service stores files in S3 and returns presigned URLs to clients so file transfers bypass the application servers entirely.The data layer: RDS PostgreSQL in a private subnet reachable only from EKS pods via Security Group rules. ElastiCache Redis and DynamoDB in private subnets for the Inventory Service.
Observability: CloudWatch Container Insights collects pod metrics and logs; Micrometer custom metrics track order throughput, cache hit rate, and SQS lag; AWS X-Ray provides distributed traces across all three services. Infrastructure is provisioned via Terraform with remote state in S3. CI/CD runs in GitHub Actions with OIDC federation — merge to main triggers a build, ECR push, and rolling deployment to EKS automatically.
How would you handle a database migration with zero downtime?
Use an expand-contract pattern with Flyway (or Liquibase): first deploy a migration that adds the new column or table without removing anything (expand). The old application version ignores the new column; the new version uses it. After all instances are running the new version (confirmed via rolling deploy), deploy a second migration that removes the old column (contract).
For high-risk migrations: take a manual RDS snapshot immediately before the migration window. Enable RDS Multi-AZ if not already on. Have a tested rollback plan — Flyway supports undo scripts, and the RDS snapshot is the last resort. Run the migration during low-traffic hours. Monitor error rates in CloudWatch during and after. Never run a migration that locks a table for minutes in production without testing its impact on a clone of the production database first.
For high-risk migrations: take a manual RDS snapshot immediately before the migration window. Enable RDS Multi-AZ if not already on. Have a tested rollback plan — Flyway supports undo scripts, and the RDS snapshot is the last resort. Run the migration during low-traffic hours. Monitor error rates in CloudWatch during and after. Never run a migration that locks a table for minutes in production without testing its impact on a clone of the production database first.
How would you scale this architecture to handle 10× traffic?
At the application layer: Kubernetes Horizontal Pod Autoscaler (HPA) scales pods based on CPU/memory — the EKS cluster already handles pod-level scaling. For node-level scaling, add Karpenter (AWS's node autoscaler) to provision new EC2 worker nodes automatically when pods are unschedulable.
At the database layer: RDS read replicas for read-heavy traffic (direct reporting queries to replicas). For write-heavy load, consider Aurora PostgreSQL — it separates storage from compute and scales reads across up to 15 replicas with automatic failover. RDS Proxy handles connection pooling if Lambda or many pods are hammering the DB.
At the network edge: CloudFront CDN in front of the ALB caches API responses where possible (rare for a CRUD API but useful for read-heavy public endpoints). S3 files served via CloudFront edge locations rather than presigned S3 URLs directly — reduces latency globally.
For the Document Service specifically: large file uploads should use S3 multipart upload initiated directly from the client (presigned multipart upload URLs), bypassing the application entirely for file bytes.
At the database layer: RDS read replicas for read-heavy traffic (direct reporting queries to replicas). For write-heavy load, consider Aurora PostgreSQL — it separates storage from compute and scales reads across up to 15 replicas with automatic failover. RDS Proxy handles connection pooling if Lambda or many pods are hammering the DB.
At the network edge: CloudFront CDN in front of the ALB caches API responses where possible (rare for a CRUD API but useful for read-heavy public endpoints). S3 files served via CloudFront edge locations rather than presigned S3 URLs directly — reduces latency globally.
For the Document Service specifically: large file uploads should use S3 multipart upload initiated directly from the client (presigned multipart upload URLs), bypassing the application entirely for file bytes.
My Notes
Saved to browser storage automatically as you type.