From Java dev to AWS engineer. Hands-on, free, no fluff.
AWS Roadmap for Java Spring Boot Engineers
Build genuine, hands-on AWS experience across 12 phases, from VPC fundamentals to production-grade Kubernetes, event-driven serverless, observability, and GitOps. Each phase teaches the concepts, gives you tasks to build the real thing, and prepares you to talk about it in a senior-level interview.
12 phases + capstone · Move at your own pace, the hands-on tasks are the real measure of progress, not the clock.
Best for mid-to-senior Java engineers who've built applications but have never owned AWS infrastructure.
EC2 IAM VPC STS RDS Aurora RDS Proxy S3 ElastiCache CloudWatch X-Ray Prometheus Grafana Docker ECR ECS EKS IRSA Helm ArgoCD Lambda SQS SNS API Gateway DynamoDB Terraform CloudFormation CDK Secrets Manager
Found this useful? Buy me a coffee ☕
or
1
AWS Foundations
0%
Before You Start, Prerequisites
This roadmap assumes the following. If any are missing, cover them first, they are assumed throughout every phase.
- Linux command line basics: navigating directories (
cd,ls), reading/writing files, running commands withsudo. You will SSH into EC2 instances from Phase 2 onward. - Java 21 + Maven or Gradle: you should be able to build a runnable Spring Boot JAR locally before Phase 2. Java 21 is the current LTS baseline used throughout this roadmap.
- Git: committing and pushing to a remote repository. Required for the CI/CD pipeline in the capstone.
- Basic networking concepts: what an IP address and port number are, what TCP/IP means. You do not need to be a network engineer, Phase 1 teaches the AWS-specific networking layer on top of this.
- A brand-new AWS account: use a dedicated email address. Do not use a corporate or shared account, you need root access for initial setup.
- A small budget, managed deliberately: AWS accounts created after July 15, 2025 use the credit-based Free Tier: $100 in credits at signup, up to $200 total by completing onboarding activities, and a choice between a free plan and a paid plan. The free plan closes the account after 6 months or when credits run out; the paid plan keeps the account open and consumes the credits first. The old 750-hours-per-month free tier no longer exists for new accounts, so treat every resource in this roadmap as billable. Done with discipline (run the teardown steps at the end of each phase, do the expensive phases 7, 11, 12 and the capstone in compressed sittings), the whole roadmap fits inside those credits. If you expect to take longer than 6 months, pick the paid plan. Check current terms at aws.amazon.com/free.
Architect this phase
VPC → Public Subnet (has route to Internet Gateway) · VPC → Private Subnet (no internet route, outbound via NAT Gateway)
Draw this yourself for better retention:
draw.io ·
Official AWS Icons Topics
AWS Global Infrastructure
Regions
What
A Region is a geographic area that contains multiple Availability Zones. Examples:
us-east-1 (N. Virginia), eu-west-1 (Ireland), ap-southeast-1 (Singapore). AWS continuously adds Regions, check the AWS Global Infrastructure page for the current count.
Why
You pick a Region based on three things: data residency (legal requirements to keep data in a country), latency (pick closest to your users), and service availability (not all services launch in all regions simultaneously).
Gotcha
- Resources are region-scoped. An EC2 instance in us-east-1 has nothing to do with one in eu-west-1, they're completely separate.
us-east-1(N. Virginia) is where AWS launches new services first. Good for learning, but for production choose a region close to your actual users.
Availability Zones (AZs)
What
AZs are one or more physically separate data centres within a Region, each with independent power, cooling, and network connectivity. They are connected to each other within a Region via low-latency links. A Region typically has 3 or more AZs; every new Region launches with at least 3.
Why
Deploy your application across 2+ AZs so a data centre failure (fire, power outage, network issue) does not take down your service. This is the foundational pattern for high availability in AWS.
Gotcha
- AZ names are randomly mapped per AWS account. Your
us-east-1amay point to a different physical data centre than someone else'sus-east-1a. This is intentional to spread load. - Each subnet lives in exactly one AZ. To span AZs, create one subnet per AZ.
Edge Locations
What
Edge locations are AWS Points of Presence (PoPs), servers positioned close to end users worldwide. There are 600+ edge locations globally, far more than the number of regions. Multiple services operate here: CloudFront (CDN), Route 53 (DNS), Lambda@Edge, CloudFront Functions, AWS WAF, AWS Shield, and AWS Global Accelerator.
Why
When a user in Tokyo requests content from your S3 bucket in us-east-1, CloudFront serves it from the nearest Tokyo edge location instead, dramatically reducing latency.
Gotcha
- Edge locations are not for full compute. You cannot run EC2 instances or RDS here. However, Lambda@Edge and CloudFront Functions can execute lightweight code at edge locations in response to CloudFront events, useful for auth, redirects, and header manipulation at the CDN layer.
Networking
VPC (Virtual Private Cloud)
What
A VPC is your own isolated virtual network within AWS. You define an IP address range using CIDR notation (e.g.,
10.0.0.0/16 gives you 65,536 IP addresses), then carve it into subnets, configure routing, and attach gateways. A VPC is regional, it spans all AZs in a region.
Why
Without a VPC, your resources would share a flat network with no isolation. VPC is the foundation of all AWS network security, everything runs inside one.
Gotcha
- Every AWS account has a default VPC (
172.31.0.0/16) in every region. If you accidentally delete it, you can recreate it from the VPC console (Actions → Create default VPC) or viaaws ec2 create-default-vpc, no support ticket required. - For production and learning, always create a custom VPC. The default VPC has public subnets by default, which is not ideal for production.
- Two VPCs that need to communicate via VPC Peering cannot have overlapping CIDR ranges. Plan CIDR blocks upfront.
Public Subnet
What
A subnet whose associated route table has a route directing internet-bound traffic (
0.0.0.0/0) to an Internet Gateway. Resources in this subnet with a public IP can send and receive internet traffic.
Why
Internet-facing resources belong here: Application Load Balancers, NAT Gateways, and bastion hosts. You generally do not put application servers or databases here.
Gotcha
- Being in a public subnet does not automatically expose a resource. The resource also needs a public IP address, and the Security Group must allow the traffic.
Private Subnet
What
A subnet with no direct route to the internet. There is no
0.0.0.0/0 → IGW route in its route table. Resources here cannot receive inbound connections from the internet. Outbound internet access (e.g., for OS updates) goes through a NAT Gateway that sits in a public subnet.
Why
Defence in depth. Your RDS database or internal service belongs here. Even if a Security Group rule is accidentally misconfigured, there is no network path from the internet to the resource.
Gotcha
- Private subnet resources can still initiate outbound connections (to download packages, call external APIs) via NAT Gateway. NAT is outbound-only, it does not allow inbound connections from the internet.
Route Tables
What
A route table is a set of rules that tells the VPC where to send network traffic. Every subnet is associated with exactly one route table. A route of
0.0.0.0/0 → igw-xxx means: "send all internet-bound traffic to the Internet Gateway."
Why
The route table is what makes a subnet "public" or "private." Adding the IGW route is the single configuration change that gives a subnet internet access.
Gotcha
- The local route (e.g.,
10.0.0.0/16 → local) is automatically added and cannot be removed. This allows all resources within the VPC to communicate with each other. - Multiple subnets can share one route table, but a subnet can only be associated with one route table at a time.
Elastic IP (EIP)
What
An Elastic IP is a static public IPv4 address you own in an AWS region until you explicitly release it. By default, a public EC2 instance receives a random public IP that changes every time the instance is stopped and restarted. An EIP stays the same across reboots. You associate it with a specific EC2 instance (or ENI). NAT Gateways require an EIP.
Why
DNS records and firewall allowlists that reference your server's IP break every time the instance restarts without an EIP. Any service that external partners allowlist by IP needs a stable address.
Gotcha
- An EIP that is not associated with a running instance costs $0.005/hr (~$3.65/month). AWS charges for idle EIPs to discourage hoarding the scarce IPv4 space. Release any EIP you are not using.
- An EIP associated with a running instance is free. Stop the instance or disassociate the EIP and the charge starts immediately.
- Each AWS account has a default limit of 5 EIPs per region. This is a soft limit; request an increase via Service Quotas if needed.
VPC Flow Logs
What
VPC Flow Logs capture metadata about IP traffic going to and from network interfaces in your VPC. Each record includes: source and destination IP, source and destination port, protocol, packet count, byte count, and whether the traffic was
ACCEPTed or REJECTed by a Security Group or NACL. Logs are sent to CloudWatch Logs or an S3 bucket. You can enable them at the VPC, subnet, or individual ENI level. Aggregation interval is either 1 minute or 10 minutes.
Why
Flow Logs are the primary tool for diagnosing "why can't X reach Y" questions at the network layer. A
REJECT record tells you exactly which Security Group or NACL dropped the packet, the source IP, and the destination port. Without Flow Logs, network debugging is guesswork.
Gotcha
- Flow Logs do not capture the packet payload, only metadata. They also do not capture traffic to/from the VPC DNS resolver (the base CIDR + 2 address) or DHCP traffic.
- Flow Logs require an IAM role with permission to publish to CloudWatch Logs (or an S3 bucket policy if logging to S3). Forgetting to create this role is the most common setup failure.
- A
REJECTentry points to a Security Group or NACL rule. AnACCEPTentry followed by no response points to an application-level issue (the packet arrived but the process did not respond). Knowing this distinction saves significant debugging time.
Internet Gateway (IGW)
What
A horizontally-scaled, redundant, highly-available gateway that enables communication between your VPC and the internet. You attach it to the VPC itself (not to a subnet). Then you reference it in a subnet's route table to make that subnet public.
Why
Without an IGW, your VPC is completely isolated. Nothing can reach the internet, and the internet cannot reach anything in the VPC.
Gotcha
- One VPC can only have one IGW attached at a time.
- The IGW itself has no cost, charges come from data transfer through it.
NAT Gateway
What
A managed AWS service that allows resources in private subnets to initiate outbound internet connections (OS patches, external API calls) while blocking unsolicited inbound connections. The NAT Gateway itself lives in a public subnet and routes through the IGW. You put a route of
0.0.0.0/0 → nat-gw-xxx in the private subnet's route table.
Why
Your EC2 in a private subnet needs to run
yum update or call a third-party API. NAT Gateway enables this without exposing the instance to inbound internet traffic.
Gotcha
- NAT Gateway costs money: ~$0.045/hr per gateway plus data processing charges. For a learning account, delete it when not in use to avoid surprise bills.
- For high availability, deploy one NAT Gateway per AZ and have each AZ's private subnets route to their local NAT Gateway. Routing all AZs through one NAT Gateway means an AZ failure takes down all outbound internet access.
VPC Peering
What
A point-to-point private network connection between two VPCs. Traffic routes through the AWS backbone, not the internet. Peering works within a region, across regions (inter-region peering), and across AWS accounts. Each peering connection is a dedicated pair: VPC A ↔ VPC B.
Why
A common pattern is a shared-services VPC (monitoring, artifact registry, internal tooling) peered with multiple application VPCs so they can reach shared resources over a private path without internet routing.
Gotcha
- VPC Peering is non-transitive. If VPC A peers with B, and B peers with C, A cannot reach C through B. You must create a direct A ↔ C peering connection.
- Peered VPCs cannot have overlapping CIDR blocks. Plan your IP address space before peering: two VPCs both using
10.0.0.0/16cannot be peered. - The peering connection itself does nothing. You must add routes in both VPCs' route tables: a route in VPC A pointing to VPC B's CIDR via the peering connection ID, and a matching route in VPC B. Forgetting the return route is the most common misconfiguration.
- Cross-region peering charges data transfer at ~$0.01-$0.02/GB depending on regions. Within a region, traffic between AZs across peered VPCs is billed at the standard cross-AZ rate ($0.01/GB each direction).
Transit Gateway (TGW)
What
A managed regional hub-and-spoke network router. Each VPC, VPN connection, or Direct Connect Gateway attaches to the TGW once. The TGW routes traffic between all attached networks using its own TGW Route Tables. Unlike VPC Peering, routing through a TGW is transitive: VPC A and VPC C can communicate through the hub without a direct connection between them.
Why
VPC Peering does not scale. Fully connecting 10 VPCs requires 45 peering connections; 30 VPCs require 435. TGW replaces that mesh with one attachment per VPC. It also consolidates VPN and Direct Connect connectivity so your on-premises network peers with one gateway instead of every VPC individually.
Gotcha
- TGW costs $0.05/hr per attachment plus $0.02/GB of data processed. For a learning account with 2-3 VPCs, VPC Peering is cheaper. Use TGW when the peering mesh is the actual problem.
- TGW route tables are separate from VPC route tables. A common misconfiguration: the TGW route table has the correct entries but the VPC's route table does not have a route pointing toward the TGW attachment ID. Both sides must be configured.
- TGW is regional. Cross-region connectivity requires inter-region TGW peering (a separate attachment between two TGWs), which adds latency and data transfer cost.
Security (IAM)
IAM User
What
An IAM User represents a person or application that needs long-term AWS credentials. There are two credential types: a username + password for the AWS Console, and an Access Key ID + Secret Access Key for CLI/API access. The root account (your signup email) is not an IAM User, it has unrestricted access to everything.
Why
You should never use the root account for daily work. Create an IAM User with only the permissions needed. If credentials are compromised, you can disable the user without losing the account.
Gotcha
- Access Keys are long-term credentials. If they are leaked (e.g., committed to GitHub), rotate them immediately and assume they were used maliciously.
- Never embed Access Keys in source code or Docker images. Use IAM Roles for EC2/Lambda and environment variables or a secrets manager for external systems.
IAM Group
What
A collection of IAM Users. You attach permission policies to the group once, and all users in the group inherit those permissions. Groups cannot contain other groups, they are flat.
Why
Managing permissions at the group level avoids attaching the same policies to dozens of individual users. When a new developer joins, add them to the "Developers" group and they get all required permissions immediately.
Gotcha
- A user can belong to multiple groups. Their effective permissions are the union of all policies from all groups they belong to, plus any policies attached directly to the user.
IAM Role
What
An IAM Role provides temporary credentials (via AWS STS) to whoever assumes it, AWS services (EC2, Lambda, ECS tasks), other AWS accounts, or federated users. A role has two policies: a Trust Policy (who is allowed to assume this role) and one or more Permission Policies (what the role can do). EC2 instances use a wrapper called an Instance Profile to assume a role.
Why
Roles are the correct way for AWS services to access other AWS services. An EC2 instance running your Spring Boot app should have an IAM Role with S3 read permission, no embedded access keys required. The AWS SDK automatically picks up the temporary credentials from the EC2 instance metadata.
Config
This is the Trust Policy of a role that EC2 instances can assume. For a Lambda role, only the
Principal changes (lambda.amazonaws.com):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "ec2.amazonaws.com" },
"Action": "sts:AssumeRole"
}
]
} Gotcha
- Temporary credentials from a role automatically expire (15 minutes to 12 hours) and are automatically rotated. A leaked set of temporary credentials has a built-in time limit, far safer than long-term Access Keys.
- If you SSH into an EC2 and run
aws s3 lswithout configuring credentials, it works because the Instance Profile role is used automatically. This is the intended behaviour.
IAM Policies
What
JSON documents that define what is allowed or denied. Each policy contains one or more Statements. A Statement has:
Effect (Allow or Deny), Action (e.g., s3:GetObject), and Resource (e.g., arn:aws:s3:::my-bucket/*). Policies are attached to Users, Groups, or Roles.
Why
Policies enforce principle of least privilege, give identities only the minimum permissions they need to do their job, nothing more.
Gotcha
- An explicit Deny always overrides an Allow, regardless of which policy it comes from. If any policy denies an action, that action is denied even if another policy allows it.
- AWS Managed Policies (like
AmazonS3ReadOnlyAccess) are maintained by AWS. Prefer these for standard roles. Avoid usingAdministratorAccessfor anything other than your admin user.
STS (Security Token Service) & AssumeRole
What
AWS STS issues temporary security credentials. When EC2 uses an Instance Profile, or when Lambda invokes, or when one account accesses another account's resources, STS is what generates the short-lived Access Key + Secret Key + Session Token combination behind the scenes. The core API call is
sts:AssumeRole: a principal requests temporary credentials to act as a different role.
Why
Cross-account access, federated identity (SSO), and service-to-service calls all flow through STS. A Kafka consumer in Account A calls STS AssumeRole with the ARN of a role in Account B, receives temporary credentials, and uses them to read/write S3 in Account B. The Trust Policy attached to that Account B role specifies exactly which principals, such as the consumer's IAM role or Account A's root, are allowed to assume it.
Gotcha
- Temporary credentials from STS expire (15 min-12 hrs depending on role session duration). The AWS SDK's credential provider chain automatically refreshes them before expiry, your application code never needs to handle credential rotation.
- The External ID is an optional parameter used in cross-account scenarios to prevent the "confused deputy" problem, where a malicious actor tricks a trusted service into assuming a role on their behalf. When a third party (e.g., a vendor) assumes a role in your account, always require an External ID in the trust policy.
VPC Endpoints, Interface Endpoints & AWS PrivateLink
What
VPC Endpoints let resources in your private subnet reach AWS services without going through the internet or NAT Gateway. Two types: Gateway Endpoints (S3 and DynamoDB only, free, added as a route table entry) and Interface Endpoints (most other services including Secrets Manager, ECR, SQS, and SSM, each creates an ENI in your subnet, costs ~$0.01/hr per AZ). Interface Endpoints are powered by AWS PrivateLink. PrivateLink is also the mechanism for exposing your own service to other VPCs or AWS accounts: you front your service with a Network Load Balancer, create a VPC Endpoint Service backed by that NLB, and consumers create an Interface Endpoint in their VPC to connect. No VPC peering, no internet routing required.
Why
Cost and security. Without a Gateway Endpoint, a private-subnet EC2 reading S3 goes: EC2 → NAT Gateway ($0.045/hr + $0.045/GB) → Internet → S3. With a Gateway Endpoint: EC2 → S3 directly, traffic stays on the AWS backbone, NAT cost eliminated. PrivateLink-based Interface Endpoints are how SaaS vendors (Snowflake, Datadog) deliver services privately into customer VPCs without requiring peering or public IPs.
Gotcha
- Add the S3 Gateway Endpoint to every VPC immediately. It is free and private-subnet instances accessing S3 without it pay unnecessary NAT costs.
- Interface Endpoints use private DNS: when enabled, the SDK resolves
secretsmanager.us-east-1.amazonaws.comto a private IP inside your VPC. ConfirmenableDnsHostnamesandenableDnsSupportare both enabled on the VPC, or the private DNS override won't work. - When building a PrivateLink Endpoint Service, the NLB must preserve the client source IP if your backend needs it, because PrivateLink terminates the connection at the NLB.
Security Groups
What
Virtual firewalls at the resource level (EC2 instances, RDS databases, ECS tasks, Lambda in VPC). Rules are stateful: if you allow inbound TCP port 8080, the response traffic is automatically allowed without a separate outbound rule. You can only add Allow rules, there is no explicit Deny at the Security Group level.
Why
Security Groups are your primary tool for controlling which traffic can reach which resource. Every resource in a VPC has at least one Security Group.
Gotcha
- Security Groups can reference other Security Groups as the traffic source. Example: allow port 5432 from "App-SG", any EC2 in that Security Group can connect to RDS, regardless of its IP address. This is better than using IP-based rules.
- Security Group changes take effect immediately. No instance restart needed.
NACLs (Network Access Control Lists)
What
Stateless firewalls at the subnet level. Unlike Security Groups, NACLs: are stateless (you must explicitly allow both inbound AND return outbound traffic), support both Allow and Deny rules, and evaluate rules in numeric order (lowest number first, first match wins).
Why
NACLs are a second firewall layer at the subnet boundary. Useful for blanket-blocking a specific IP range across all resources in a subnet, something you cannot do with Security Groups (which only allow, never deny).
Gotcha
- Stateless means: if you allow inbound port 80, you must also allow outbound ephemeral ports 1024-65535 so the HTTP response can leave the subnet. Forgetting this is a classic misconfiguration.
- The default NACL allows all traffic in both directions. If you create a custom NACL, it denies everything by default, you must add explicit allow rules.
- In practice, most teams keep the default NACLs and rely on Security Groups. Know the difference for interviews but don't over-engineer NACLs in practice.
Security Groups vs IAM Roles: When to Use Which
What
These two mechanisms control access in completely separate dimensions. A Security Group is a network firewall: it controls which TCP/UDP connections can reach a resource, identified by port and protocol. It has no knowledge of AWS services or identities. An IAM Role is an authorization control: it determines which AWS API calls an identity (EC2 instance, Lambda function, ECS task) is permitted to make. It has no knowledge of ports or network connections.
Why
They fail in distinguishably different ways. An IAM failure returns an immediate
AccessDeniedException. A Security Group failure produces a connection timeout or a connection refused error. Knowing which dimension to check first cuts your debugging time in half.
Gotcha
- RDS connection fails → check the Security Group, not IAM. Connecting to PostgreSQL on port 5432 is a TCP connection. IAM does not control TCP connections. The RDS Security Group must allow port 5432 inbound from the EC2's Security Group. If IAM is not involved, adding an IAM policy does nothing.
- S3 call fails → check IAM, not the Security Group. Calling S3 is an HTTPS request to an AWS API endpoint. Security Groups allow all outbound traffic by default, so port 443 is already open. The EC2 Instance Profile needs
s3:PutObject(or equivalent) on the bucket. If the Security Group is not involved, adding an SG rule does nothing. - Some resources require both simultaneously. EC2 in a private subnet calling Secrets Manager needs: (1) the Instance Profile with
secretsmanager:GetSecretValue(IAM), and (2) either a NAT Gateway or a Secrets Manager VPC Interface Endpoint so the API request can leave the subnet (network). A missing IAM policy gives you an immediateAccessDeniedException. A missing network path gives you a connection timeout after ~20 seconds. Different error, different fix, both required. - IAM Database Authentication is not a substitute for Security Groups on RDS. This opt-in feature lets your app authenticate to RDS using a short-lived IAM token instead of a password. It requires
rds-db:connectin the IAM policy. The TCP connection to port 5432 still must be permitted by the Security Group. IAM handles the authentication step; the Security Group handles whether the packet can arrive at the database at all. Both are required regardless of which authentication method you use. - The quick mental test: is your code opening a socket to a specific port on a specific host? That is a Security Group question. Is your code calling an AWS service API (S3, SQS, DynamoDB, Secrets Manager, etc.)? That is an IAM question.
Hands-on Tasks
Interview Q&A, Expand each to see the answer
What is a VPC and why does every AWS deployment need one?
A VPC (Virtual Private Cloud) is your own isolated virtual network within AWS. You define the IP address range using CIDR notation (e.g.,
10.0.0.0/16), create subnets within that range, configure route tables to control traffic flow, and attach gateways to connect to the internet or other networks. Most resources you launch in AWS, such as EC2 and RDS, run inside a VPC. Lambda runs in an AWS-managed environment by default and can optionally be connected to your VPC when it needs to reach private resources like RDS or ElastiCache. The VPC is the networking foundation that provides isolation, routing control, and security boundaries.
Why use private subnets? What do they protect against?
Private subnets have no direct route to the internet, there is no
0.0.0.0/0 → IGW entry in their route table. A database or internal service placed in a private subnet is network-unreachable from the internet, even if you accidentally open a Security Group rule. This is defence in depth: you are not relying on a single misconfigurable firewall rule to protect sensitive resources. The internet has no network path to the resource, period. In contrast, a public subnet resource could be exposed if a Security Group is misconfigured, private subnets eliminate that risk.
What is the difference between Security Groups and NACLs?
Security Groups are stateful firewalls at the resource level (EC2, RDS, ECS task). Stateful means: allow inbound port 8080 → the response traffic is automatically allowed. Security Groups support Allow rules only. They are the primary tool teams use.
NACLs are stateless firewalls at the subnet level. Stateless means: you must explicitly allow both inbound request traffic AND outbound response traffic (on ephemeral ports 1024-65535). NACLs support both Allow and Deny rules, evaluated in numeric order. They act as a second layer of defence at the subnet boundary.
In practice: customise Security Groups for every resource. NACLs are usually left at default (allow all) unless you need to block a specific IP range at the subnet level.
NACLs are stateless firewalls at the subnet level. Stateless means: you must explicitly allow both inbound request traffic AND outbound response traffic (on ephemeral ports 1024-65535). NACLs support both Allow and Deny rules, evaluated in numeric order. They act as a second layer of defence at the subnet boundary.
In practice: customise Security Groups for every resource. NACLs are usually left at default (allow all) unless you need to block a specific IP range at the subnet level.
Why should databases never be publicly accessible?
A publicly accessible database is reachable from every IP address on the internet. This exposes it to: automated credential brute-force attacks, exploitation of known database engine vulnerabilities (CVEs), and accidental data exposure if credentials are weak or reused. Databases should only be reachable from your application tier (EC2, ECS, Lambda) within the same VPC, enforced by both private subnet placement (no internet route) and a Security Group that only allows connections from the application's Security Group on the DB port. There is no legitimate reason for a production database to have a public IP or be in a public subnet.
What is the difference between an IAM User and an IAM Role?
IAM User: long-term credentials. A person (or application) with a fixed username/password and/or Access Key ID + Secret. The credentials persist until explicitly rotated or deleted. If leaked, they remain valid indefinitely until you act.
IAM Role: temporary credentials issued by AWS STS when the role is assumed. They have an expiry (15 minutes to 12 hours) and are automatically renewed by the AWS SDK. Roles are assumed by AWS services (EC2 Instance Profile, Lambda execution role, ECS task role) or by federated users.
Best practice: EC2 instances should use an IAM Role via Instance Profile, never embed Access Keys in code or config files. The AWS SDK automatically uses the role credentials from the EC2 instance metadata endpoint. If the instance is compromised, the temporary credentials expire on their own.
IAM Role: temporary credentials issued by AWS STS when the role is assumed. They have an expiry (15 minutes to 12 hours) and are automatically renewed by the AWS SDK. Roles are assumed by AWS services (EC2 Instance Profile, Lambda execution role, ECS task role) or by federated users.
Best practice: EC2 instances should use an IAM Role via Instance Profile, never embed Access Keys in code or config files. The AWS SDK automatically uses the role credentials from the EC2 instance metadata endpoint. If the instance is compromised, the temporary credentials expire on their own.
When would you use VPC Peering vs Transit Gateway?
VPC Peering is a point-to-point connection between exactly two VPCs. It is free to create (data transfer is billed at standard rates). It is the right choice when you have a small, fixed set of VPCs that need to communicate: 2-3 VPCs with well-known relationships, such as a dev VPC peering to a shared-services VPC.
Transit Gateway is a managed regional hub. Every VPC, VPN, and Direct Connect Gateway attaches to the TGW once, and the TGW routes traffic transitively between all of them. It costs $0.05/hr per attachment plus $0.02/GB. The break-even point is roughly 3+ VPCs that all need to communicate with each other: at 4 VPCs, a full peering mesh requires 6 connections; TGW requires 4 attachments at a fraction of the operational complexity.
The other differentiator: TGW supports VPN and Direct Connect centralisation. If you have on-premises connectivity, a single TGW with a VPN attachment shares that connection across all attached VPCs. With peering, you'd need a VPN connection per VPC.
Key distinction for interviews: VPC Peering is non-transitive (A ↔ B and B ↔ C does not give A ↔ C). TGW is transitive by design.
Transit Gateway is a managed regional hub. Every VPC, VPN, and Direct Connect Gateway attaches to the TGW once, and the TGW routes traffic transitively between all of them. It costs $0.05/hr per attachment plus $0.02/GB. The break-even point is roughly 3+ VPCs that all need to communicate with each other: at 4 VPCs, a full peering mesh requires 6 connections; TGW requires 4 attachments at a fraction of the operational complexity.
The other differentiator: TGW supports VPN and Direct Connect centralisation. If you have on-premises connectivity, a single TGW with a VPN attachment shares that connection across all attached VPCs. With peering, you'd need a VPN connection per VPC.
Key distinction for interviews: VPC Peering is non-transitive (A ↔ B and B ↔ C does not give A ↔ C). TGW is transitive by design.
My Spring Boot app on EC2 can't reach a resource. How do you decide whether to fix the Security Group or the IAM Role?
The error message tells you which layer to check.
An IAM failure returns an immediate
A Security Group failure manifests as a connection timeout, connection refused, or SSL handshake failure after a delay. The packet never reaches its destination. Fix: add an inbound rule to the target resource's Security Group allowing the source on the correct port (e.g., port 5432 for PostgreSQL from the EC2's SG).
The underlying distinction: Security Groups control TCP/UDP network connectivity (ports, protocols, source IPs or Security Groups). IAM controls AWS API authorization (which service API calls are permitted). A Security Group has no concept of S3 or DynamoDB. An IAM policy has no concept of port 5432.
For resources that require both (e.g., EC2 in a private subnet calling Secrets Manager): fix network connectivity first. If you have no route to the API endpoint, you will never see the IAM error because the request never leaves the subnet.
An IAM failure returns an immediate
AccessDeniedException (or in the S3 SDK, a 403 Forbidden). It arrives fast because the AWS service received the request and rejected it. Fix: add the required permission (e.g., s3:PutObject, secretsmanager:GetSecretValue) to the EC2 Instance Profile.A Security Group failure manifests as a connection timeout, connection refused, or SSL handshake failure after a delay. The packet never reaches its destination. Fix: add an inbound rule to the target resource's Security Group allowing the source on the correct port (e.g., port 5432 for PostgreSQL from the EC2's SG).
The underlying distinction: Security Groups control TCP/UDP network connectivity (ports, protocols, source IPs or Security Groups). IAM controls AWS API authorization (which service API calls are permitted). A Security Group has no concept of S3 or DynamoDB. An IAM policy has no concept of port 5432.
For resources that require both (e.g., EC2 in a private subnet calling Secrets Manager): fix network connectivity first. If you have no route to the API endpoint, you will never see the IAM error because the request never leaves the subnet.
My Notes
Saved to browser storage automatically as you type.
2
Deploy Spring Boot on EC2
0%
Topics
EC2 Core Concepts
EC2 (Elastic Compute Cloud)
What Virtual machines in AWS. You choose an instance type (determines CPU and RAM), an AMI (the OS), storage (EBS), and which VPC/subnet to place it in. You manage everything from the OS upward, AWS manages the physical hardware underneath.
Why EC2 is the most direct way to run a server in AWS. Full SSH access, install anything, configure however you want. The right mental model before moving to containers or serverless.
Gotcha
- Instance type naming:
t3.micromeanst= burstable family,3= generation,micro= size. Common families: t (burstable), m (general), c (compute-optimised), r (memory-optimised). - Burstable instances (t-series) earn CPU credits at idle and spend them under load. If credits run out, CPU is throttled to a baseline rate. Fine for dev; watch out for sustained load in production.
- Stop ≠ Terminate. Stopped: instance paused, EBS persists, compute billing stops (EBS still billed). Terminated: instance deleted, EBS deleted by default. You cannot un-terminate.
AMI (Amazon Machine Image)
What A pre-built OS image used to launch EC2 instances. Think of it as a disk snapshot that becomes your instance's root volume. AWS provides AMIs for Amazon Linux 2023, Ubuntu, Windows Server, and more. You can create custom AMIs from your own configured instances.
Why Every EC2 instance starts from an AMI. The AMI determines the OS, pre-installed packages, and starting configuration. Custom AMIs let you bake in your Java runtime and config so new instances are ready immediately.
Gotcha
- AMIs are region-specific. An AMI from us-east-1 cannot be directly used in eu-west-1, you must copy it first.
- Use Amazon Linux 2023 (AL2023), not Amazon Linux 2, which reaches end of support on June 30, 2026. AL2023 uses
dnfas its package manager (thoughyumstill works as an alias).
EBS (Elastic Block Store)
What Network-attached persistent storage for EC2. Every instance has a root EBS volume (the OS disk). Default root volume for AL2023 is 8 GB (gp3 type). EBS volumes can be detached from one instance and attached to another.
Why Data written to EBS persists when you stop and restart the instance, unlike instance store (ephemeral) storage which is wiped. Your Spring Boot JAR and logs live on EBS.
Gotcha
- By default the root EBS volume is deleted when the instance is terminated. Change "Delete on Termination" to false if you want to keep the volume after termination.
- EBS volumes are AZ-specific. An EBS volume in us-east-1a cannot be attached to an instance in us-east-1b.
- You are billed for EBS storage even while the instance is stopped.
SSH Key Pairs
What When launching an EC2 instance, AWS places the public key on the instance (in
~/.ssh/authorized_keys for the default user). You download the private key file (.pem) once. This is the only way to SSH in, there is no password login by default.Why SSH key auth is more secure than passwords. The private key never leaves your machine. AWS never stores it, if you lose the
.pem file, the key cannot be recovered.Gotcha
- Set permissions immediately after download:
chmod 400 key.pem. Without this, SSH refuses to use the key: "WARNING: UNPROTECTED PRIVATE KEY FILE". - Default usernames:
ec2-user(Amazon Linux),ubuntu(Ubuntu),admin(Debian). Notroot. - Connect command:
ssh -i /path/to/key.pem ec2-user@<public-ip>
Elastic IP
What A static public IPv4 address you allocate to your account and associate with an EC2 instance. By default, a public IP assigned on launch changes every time the instance is stopped and restarted. An Elastic IP stays fixed regardless of instance state.
Why Useful when you need a predictable IP, e.g., if you whitelist it in a firewall rule or point a DNS A record to it.
Gotcha
- Every public IPv4 address bills $0.005/hr, including Elastic IPs attached to running instances (a 2024 change; older tutorials still claim public IPs are free). Accounts created before July 15, 2025 keep 750 hours/month of free public IPv4 under the legacy Free Tier; newer accounts pay from the first hour, with credits absorbing it. Check aws.amazon.com/free for current terms, and release any public IPv4 addresses you are not actively using.
Linux & Deployment
Linux Administration on Amazon Linux 2023
What AL2023 uses
dnf (RPM-based, like RHEL/Fedora). AWS provides Amazon Corretto, a free, production-ready OpenJDK build, in the AL2023 package repos.Why You need to install Java, Git, and other dependencies before deploying your app. Knowing basic Linux commands is essential for EC2 work.
Key commands
- Install Java 21:
sudo dnf install java-21-amazon-corretto -y - Install Git:
sudo dnf install git -y - Verify Java:
java -version→ should showCorretto-21.x.x - Copy JAR from laptop:
scp -i key.pem app.jar ec2-user@<ip>:~/ - Check disk space:
df -h· Check memory:free -h· Check processes:ps aux | grep java
Running Spring Boot as a Background Service (systemd)
What Running
java -jar app.jar directly dies when your SSH session closes. systemd is the Linux init system that manages long-running services, starting them on boot and restarting them on failure.Why For any persistent deployment: your app must survive reboots and SSH disconnects, and recover automatically from crashes.
Service file
Create
Check logs:
Add the following to
/etc/systemd/system/springapp.service:
[Unit] Description=Spring Boot Application After=network.target [Service] User=ec2-user WorkingDirectory=/home/ec2-user ExecStart=/usr/bin/java \ -Xms128m -Xmx256m \ -XX:+UseG1GC \ -Dspring.application.name=springapp \ -jar /home/ec2-user/app.jar StandardOutput=append:/home/ec2-user/app.log StandardError=append:/home/ec2-user/app.log SuccessExitStatus=143 Restart=on-failure RestartSec=10 TimeoutStopSec=60 [Install] WantedBy=multi-user.targetThen:
sudo systemctl daemon-reload && sudo systemctl enable springapp && sudo systemctl start springappCheck logs:
tail -f /home/ec2-user/app.logAdd the following to
application.properties before packaging the JAR: server.shutdown=graceful and spring.lifecycle.timeout-per-shutdown-phase=30s, this lets in-flight requests complete during a deploy or ALB deregistration before the JVM exits.
Gotcha
-Xmxis not total JVM memory. The JVM's actual RSS is heap + metaspace (~80-150MB) + thread stacks (~1MB per thread) + GC overhead. With-Xmx512m, expect 650-750MB of physical RAM consumed. On a t3.micro (1GB total, ~916MB available), that leaves under 200MB for the OS, CloudWatch agent (~54MB), and SSM agent (~36MB). The OOM killer hits the JVM, the ALB health check fails, the CW agent stops collecting, and SSM loses connectivity. One oversized heap flag causes all of it. Use-Xmx256mon 1GB instances.- Configure swap on any instance running multiple agents. Even 512MB prevents hard OOM kills on memory spikes:
sudo dd if=/dev/zero of=/swapfile bs=128M count=4 && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile. Persist it by adding/swapfile swap swap defaults 0 0to/etc/fstab. - If OOM kills the JVM and you lose both SSH and SSM access, the EC2 system log is still readable from the console without any instance access: EC2 Console → select instance → Actions → Monitor and troubleshoot → Get system log. Look for
oom-killlines. This works even when the instance appears completely unresponsive.
Security Groups for a Spring Boot App
What For a Spring Boot app exposed directly on EC2, the Security Group needs: inbound SSH (port 22) from your IP, inbound app traffic (port 8080) from wherever users connect, all outbound open. Spring Boot's default port is 8080; change with
server.port in application.properties.Why Without adding port 8080, your app runs correctly on the instance but is completely unreachable from outside, the Security Group silently blocks it.
Gotcha
- Never open SSH (port 22) to
0.0.0.0/0. Use "My IP" in the console. Bots scan for open SSH ports constantly. - Opening port 8080 to
0.0.0.0/0is acceptable for initial testing, but the tasks below will have you lock this down behind an ALB, which is how production always looks.
Application Load Balancer (ALB)
ALB, Target Groups, and Listeners
What An Application Load Balancer sits in public subnets and distributes incoming HTTP/HTTPS traffic to backend targets (EC2 instances, ECS tasks, Lambda). It operates at Layer 7, it can route requests based on URL path, hostname, and HTTP headers. Key components: Listener (port + protocol the ALB listens on), Rules (how traffic is routed), Target Group (the set of backends with health checks).
Why In production, application servers should never be directly internet-facing. The ALB lives in the public subnet; your EC2 instances live behind it. The ALB handles health checks (removing unhealthy instances from rotation automatically), SSL termination, and is the attachment point for WAF, access logs, and sticky sessions.
Gotcha
- An ALB requires subnets in at least two AZs. Even with one EC2 target, you must specify two subnets in different AZs when creating the ALB.
- Health check path matters. If your Spring Boot app has Spring Actuator enabled, use
/actuator/health. If the path returns anything other than HTTP 2xx, targets are marked unhealthy and receive no traffic, your app appears "down" even though it's running. - The ALB has its own Security Group. Pattern: ALB SG allows 80/443 inbound from internet. EC2 SG allows 8080 inbound only from the ALB SG, not from
0.0.0.0/0. This is the correct production security model. - Startup timing trap: If Spring Boot takes 15-20 seconds to start and the ALB health check fires every 5 seconds with an unhealthy threshold of 2, the target is flagged unhealthy during boot, and once an ASG manages the fleet, the instance gets replaced mid-startup. Set the health check interval and thresholds so (interval × unhealthy threshold) exceeds your app's startup time, and pair it with the ASG health check grace period. Monitor ALB target health status when a deployment stalls.
- Graceful shutdown: Set ALB's deregistration delay (default 300 seconds) appropriately, it's how long the ALB continues sending in-flight requests to a de-registering target before cutting it off. Your Spring Boot app must have
server.shutdown=gracefuland a timeout that matches. Without this, rolling deploys cause 502s for active users.
HTTPS with ACM (AWS Certificate Manager)
What ACM provides free SSL/TLS certificates for use with AWS services (ALB, CloudFront, API Gateway). You request a certificate for your domain, verify ownership via DNS (recommended) or email, and attach it to an ALB HTTPS listener. ACM auto-renews certificates, no manual renewal or private key management required.
Why All production traffic must be HTTPS. The ALB terminates TLS, your EC2 and Spring Boot app receive plain HTTP internally on port 8080, and the ALB handles the encryption layer. You do not need to configure SSL in Spring Boot or install certificates on EC2.
Gotcha
- ACM certificates are free when attached to ALB, CloudFront, or API Gateway. The free certificates cannot be exported for use on your own EC2 server (ACM does sell exportable public certificates as a separate paid option, added in June 2025).
- DNS validation is preferred: it adds a CNAME record to your domain's DNS and auto-renews without human action. Email validation requires you to click a link before the cert expires.
- If you do not own a domain yet, skip the HTTPS task, use the ALB's default DNS name (
my-alb-123.us-east-1.elb.amazonaws.com) with HTTP for now. The ALB + Target Group pattern is what matters here.
Auto Scaling
Auto Scaling Group (ASG) and Launch Templates
What An Auto Scaling Group manages a fleet of EC2 instances defined by three integers: min (floor), desired (current target), and max (ceiling). The ASG launches instances from a Launch Template (a versioned EC2 configuration: AMI, instance type, key pair, Security Group, user data), registers them into a Target Group, and continuously monitors health. When an instance fails a health check, the ASG terminates it and launches a replacement. When a scaling policy fires, the ASG adjusts the desired count within the min/max bounds.
Why This is the core mechanism for running a resilient, scalable EC2-based service. Your Spring Boot fleet self-heals from instance failures and adjusts capacity to load without operator involvement. Every ECS, EKS, and managed service you encounter later uses the same model internally.
Gotcha
- Health check type defaults to EC2 (instance is running, not stopped or terminated). After attaching a Target Group, change the health check type to ELB. Without it, the ASG has no visibility into your application's health. A crashed Spring Boot app inside a running instance will never be replaced.
- Use Launch Templates, not Launch Configurations. Launch Configurations are deprecated and AWS does not accept new ones; a tutorial that teaches them is outdated. Launch Templates support versioning, mixed instance types, Graviton (Arm64) instances, and Spot capacity.
- Health check grace period: the ASG ignores ELB health check failures during this window after an instance launches. Set it to your app's worst-case startup time, typically 90-120 seconds for Spring Boot. Too short means the ASG terminates an instance that is still starting. Too long means a genuinely broken instance stays in the fleet.
- Max is the only cost guard. The ASG itself has no cost, but it will launch EC2 instances without hesitation. A misconfigured scaling policy with no upper bound can create hundreds of instances. Set a realistic max and enable AWS Billing alerts before enabling auto scaling in any account.
- Termination policy: the default first picks the AZ with the most instances (to rebalance), then terminates the instance launched from the oldest Launch Template version. Terminating the oldest instance specifically is the opt-in
OldestInstancepolicy. Either way, scale-in terminates instances carrying live traffic: if your Spring Boot app holds in-memory session state, those sessions are silently dropped. Externalise session state to ElastiCache (covered in Phase 10) before enabling scale-in. - On a scale-out event, new instances do not contribute to the fleet's CloudWatch metrics until their instance warm-up period expires. Set warm-up equal to your app's startup time. Without this, slow-starting JVMs drag down the average CPU metric and prematurely cancel further scale-out before the fleet has absorbed the load.
Scaling Policies
What A scaling policy defines when and how the ASG adjusts capacity. Three types matter in practice: Target Tracking specifies a target metric value (e.g., average CPU at 50%) and the ASG creates and manages CloudWatch alarms automatically, adjusting capacity continuously to maintain the target. Step Scaling defines explicit tiers (CPU 70-90%: add 2 instances; CPU over 90%: add 4 instances), giving fine-grained control. Scheduled Scaling sets desired capacity at specific times, useful for predictable traffic peaks (business hours, batch windows).
Why Target tracking is the recommended default for most workloads. It handles both gradual ramps and sudden spikes without manual alarm configuration. Step scaling makes sense when you have load-tested your app and know exactly how it behaves under different traffic levels.
Gotcha
- Target tracking creates CloudWatch alarms automatically. Do not delete these alarms manually. The ASG owns them. Deleting them silently disables scaling without any error or warning.
- CPU is not always the right metric. A Spring Boot app doing heavy database I/O or waiting on downstream services can be saturated while CPU stays below 10%. Consider publishing custom CloudWatch metrics from your app: active HTTP connections, JVM thread pool queue depth, request latency p99.
- ALB RequestCountPerTarget is a built-in predefined metric available directly in target tracking configuration. For HTTP services, scaling on request count per instance is often more responsive than CPU because it tracks actual traffic load, not processing cost.
- Cooldown period (default 300 seconds) applies to simple scaling policies: after a scaling activity, the ASG waits before evaluating another. Step scaling does not pause between activities; it relies on instance warm-up instead, and target tracking manages its own stabilisation. If you do use simple scaling on a bursty workload, reduce the cooldown to 60-120 seconds or you will lag behind traffic spikes.
Lifecycle Hooks
What A lifecycle hook pauses an instance at a state transition and fires a notification. Two transitions matter:
autoscaling:EC2_INSTANCE_LAUNCHING (instance in Pending:Wait before it enters service) and autoscaling:EC2_INSTANCE_TERMINATING (instance in Terminating:Wait before it is terminated). The hook delivers a message to an SNS topic, SQS queue, or EventBridge. Your handler performs its work, then signals the ASG to proceed:
aws autoscaling complete-lifecycle-action \ --lifecycle-hook-name <hook-name> \ --auto-scaling-group-name <asg-name> \ --lifecycle-action-result CONTINUE \ --instance-id <instance-id>Use
ABANDON instead of CONTINUE to abort the transition.Why A terminating hook is the correct way to ensure zero-downtime deployments with an ASG. It gives your Spring Boot app time to: drain in-flight HTTP requests, deregister from service discovery, flush buffered writes to S3 or a queue, and close database connections cleanly. Without a hook, the ASG terminates the instance on its own schedule regardless of in-flight traffic.
Gotcha
- The default heartbeat timeout is 3600 seconds (one hour). An instance stuck in
Terminating:Waitcontinues to accrue EC2 billing for the full duration. Set the timeout to your expected drain time plus a margin, typically 60-180 seconds for Spring Boot apps. - If your hook handler is not deployed or crashes, every instance termination hangs for the full timeout. Test hook handling explicitly in staging before enabling in production.
- Spring Boot's
server.shutdown=graceful(configured in the systemd tasks above) handles in-flight HTTP requests at the application layer. A lifecycle hook operates at the ASG layer, one level up. Both are needed for a fully clean shutdown under load: the hook pauses instance termination, Spring Boot drains its request threads, then the hook signals CONTINUE.
Hands-on Tasks
Interview Q&A, Expand each to see the answer
Walk me through provisioning an EC2 instance for a Spring Boot application.
Start by choosing an AMI, Amazon Linux 2023 for a Java app. Pick the instance type based on expected load;
t3.micro for dev/learning, t3.small or m6i.large for production. Place it in a public subnet (if internet-facing) within your VPC. Create or assign an SSH key pair. Configure a Security Group: SSH on port 22 from your IP, app port 8080 from the internet or from an ALB's Security Group. Attach an IAM Role if the app needs to access other AWS services (S3, Secrets Manager). After launch, SSH in, install Java (Amazon Corretto), copy the JAR via SCP, and run it as a systemd service.
What's the difference between stopping and terminating an EC2 instance?
Stopping an instance pauses it. The EBS root volume is preserved and the instance can be restarted. Compute billing stops, but EBS storage billing continues. The public IP (if not an Elastic IP) is released and a new one is assigned on next start.
Terminating an instance deletes it. By default, the root EBS volume is also deleted ("Delete on Termination" = true). This action is irreversible, a terminated instance cannot be recovered. If you need the data, create a snapshot or change the Delete on Termination setting before terminating.
Terminating an instance deletes it. By default, the root EBS volume is also deleted ("Delete on Termination" = true). This action is irreversible, a terminated instance cannot be recovered. If you need the data, create a snapshot or change the Delete on Termination setting before terminating.
How do you run a Spring Boot app on EC2 so it starts automatically and restarts on failure?
Use systemd, the Linux service manager. Create a unit file at
A quick alternative is
/etc/systemd/system/springapp.service that defines the app's start command, working directory, user, and restart policy (Restart=on-failure). Run sudo systemctl enable springapp to register it for automatic start on boot, and sudo systemctl start springapp to start it now. Logs go to /home/ec2-user/app.log via the StandardOutput and StandardError directives in the unit file. View them with tail -f /home/ec2-user/app.log.A quick alternative is
nohup java -jar app.jar > app.log 2>&1 &, but systemd is correct for anything production-like because it handles restarts and boot integration.
What Security Group rules does a Spring Boot app on EC2 need?
Minimum rules: inbound SSH (port 22) restricted to your IP (never 0.0.0.0/0 in production), and inbound TCP on port 8080 (Spring Boot's default) from wherever users connect. Leave outbound traffic fully open so the instance can download packages, call external APIs, etc.
In a production setup with an Application Load Balancer in front, the EC2 Security Group should allow port 8080 only from the ALB's Security Group, not from the internet directly. This way, all traffic is routed through the ALB, which handles HTTPS termination and health checks.
In a production setup with an Application Load Balancer in front, the EC2 Security Group should allow port 8080 only from the ALB's Security Group, not from the internet directly. This way, all traffic is routed through the ALB, which handles HTTPS termination and health checks.
What is an AMI and when would you create a custom one?
An AMI (Amazon Machine Image) is a snapshot of an EC2 instance's state that you use to launch new instances. It includes the OS, installed packages, and configuration baked in at the time the AMI was created.
You'd create a custom AMI when you want new instances to start already configured, for example, with Java, your application dependencies, and monitoring agents pre-installed. This speeds up Auto Scaling: new instances are ready in seconds rather than running a lengthy bootstrap script on every launch. Custom AMIs are also used for Golden Image pipelines, a security-hardened base image that teams build their application AMIs from.
You'd create a custom AMI when you want new instances to start already configured, for example, with Java, your application dependencies, and monitoring agents pre-installed. This speeds up Auto Scaling: new instances are ready in seconds rather than running a lengthy bootstrap script on every launch. Custom AMIs are also used for Golden Image pipelines, a security-hardened base image that teams build their application AMIs from.
What is an Auto Scaling Group and what are the three capacity parameters?
An Auto Scaling Group is a managed fleet of EC2 instances that launches and terminates instances automatically based on health checks and scaling policies. The three capacity parameters are: min (the floor; the ASG never terminates below this count), desired (the current target count the ASG maintains), and max (the ceiling; the ASG never launches above this count).
In practice: set min=1 so the fleet always has at least one instance. Set desired=1 to start. Set max to whatever your budget and load testing says is reasonable. The ASG adjusts desired automatically when scaling policies fire, always staying within min and max.
The ASG launches instances using a Launch Template, a versioned EC2 configuration (AMI, instance type, Security Group, key pair). It then registers new instances with a Target Group and monitors their health. Unhealthy instances are terminated and replaced without operator involvement.
In practice: set min=1 so the fleet always has at least one instance. Set desired=1 to start. Set max to whatever your budget and load testing says is reasonable. The ASG adjusts desired automatically when scaling policies fire, always staying within min and max.
The ASG launches instances using a Launch Template, a versioned EC2 configuration (AMI, instance type, Security Group, key pair). It then registers new instances with a Target Group and monitors their health. Unhealthy instances are terminated and replaced without operator involvement.
Why do you need to set the ASG health check type to ELB when using an Application Load Balancer?
The default health check type is EC2, which only checks whether the instance is in the
With ELB health checks, the ASG delegates health determination to the ALB's Target Group. The Target Group sends HTTP requests to
Without ELB health checks: a Spring Boot instance whose JVM has crashed, run out of heap, or hung on a deadlock still appears healthy to the ASG. It stays in the fleet indefinitely, serving 502s to every request routed to it. The ALB does know it is unhealthy (via its own health checks) and stops routing new requests to it, but the broken instance is never replaced. Over time, your fleet shrinks as instances fail and are never replaced.
running state, not stopped or terminated. It has no knowledge of your application.With ELB health checks, the ASG delegates health determination to the ALB's Target Group. The Target Group sends HTTP requests to
/actuator/health every N seconds and marks the target unhealthy if it gets non-2xx responses or times out. The ASG polls the Target Group's health status and, when a target is unhealthy, terminates the instance and launches a replacement.Without ELB health checks: a Spring Boot instance whose JVM has crashed, run out of heap, or hung on a deadlock still appears healthy to the ASG. It stays in the fleet indefinitely, serving 502s to every request routed to it. The ALB does know it is unhealthy (via its own health checks) and stops routing new requests to it, but the broken instance is never replaced. Over time, your fleet shrinks as instances fail and are never replaced.
What is the difference between the health check grace period, instance warm-up, and cooldown period?
These three parameters all involve waiting, but at different stages and for different reasons.
Health check grace period: set on the ASG itself. After a new instance launches, the ASG ignores ELB health check failures for this many seconds. Prevents the ASG from terminating an instance that is still booting and starting the JVM. Set this to your app's worst-case startup time (90-120 seconds for most Spring Boot apps).
Instance warm-up: set on the scaling policy. During a scale-out event, new instances are excluded from the ASG's CloudWatch metric calculations until this period expires. Prevents a slow-starting JVM from dragging down the average CPU and prematurely ending the scale-out before the fleet has absorbed the load. Set this equal to your app's startup time.
Cooldown period: applies to simple scaling policies. After a scaling activity completes, the ASG waits this many seconds before evaluating another scale event. Prevents thrashing. Step scaling does not pause between activities (it relies on instance warm-up), and target tracking manages its own stabilisation internally via alarm evaluation periods.
Health check grace period: set on the ASG itself. After a new instance launches, the ASG ignores ELB health check failures for this many seconds. Prevents the ASG from terminating an instance that is still booting and starting the JVM. Set this to your app's worst-case startup time (90-120 seconds for most Spring Boot apps).
Instance warm-up: set on the scaling policy. During a scale-out event, new instances are excluded from the ASG's CloudWatch metric calculations until this period expires. Prevents a slow-starting JVM from dragging down the average CPU and prematurely ending the scale-out before the fleet has absorbed the load. Set this equal to your app's startup time.
Cooldown period: applies to simple scaling policies. After a scaling activity completes, the ASG waits this many seconds before evaluating another scale event. Prevents thrashing. Step scaling does not pause between activities (it relies on instance warm-up), and target tracking manages its own stabilisation internally via alarm evaluation periods.
How would you achieve zero-downtime deployments with an Auto Scaling Group?
Zero-downtime deployments with an ASG require aligning three mechanisms: the ALB deregistration delay, the Spring Boot graceful shutdown, and the ASG instance replacement strategy.
At the application layer: server.shutdown=graceful in Spring Boot tells the server to stop accepting new requests and wait for in-flight requests to complete before the JVM exits. The wait timeout must be configured (
At the load balancer layer: the ALB's deregistration delay (default 300 seconds) is how long the ALB continues draining in-flight requests to a de-registering target before cutting it off. Set this to a value that covers your longest expected request duration.
At the ASG layer: a terminating lifecycle hook pauses instance termination in
For the instance replacement strategy itself: a rolling update replaces instances in batches. An instance refresh (the AWS-native rolling deployment feature on ASGs) handles this automatically: it replaces a configurable percentage of the fleet at a time, waits for replacements to pass health checks, and pauses if the healthy instance count drops below your threshold.
At the application layer: server.shutdown=graceful in Spring Boot tells the server to stop accepting new requests and wait for in-flight requests to complete before the JVM exits. The wait timeout must be configured (
spring.lifecycle.timeout-per-shutdown-phase=30s).At the load balancer layer: the ALB's deregistration delay (default 300 seconds) is how long the ALB continues draining in-flight requests to a de-registering target before cutting it off. Set this to a value that covers your longest expected request duration.
At the ASG layer: a terminating lifecycle hook pauses instance termination in
Terminating:Wait state, giving the app time to drain before the instance is stopped. Without the hook, the ASG terminates the instance at its own schedule and the ALB's deregistration delay may not complete cleanly.For the instance replacement strategy itself: a rolling update replaces instances in batches. An instance refresh (the AWS-native rolling deployment feature on ASGs) handles this automatically: it replaces a configurable percentage of the fleet at a time, waits for replacements to pass health checks, and pauses if the healthy instance count drops below your threshold.
What scaling policy type would you choose for a Spring Boot REST API and what metric would you use?
Target tracking is the right default. It requires no CloudWatch alarm configuration, handles both gradual ramps and sudden spikes, and adjusts continuously rather than in discrete steps.
For the metric, CPU Utilization is the most accessible but not always accurate. A Spring Boot app blocked on downstream I/O (RDS queries, S3 calls) can be saturated with CPU at 10%. ALBRequestCountPerTarget (a predefined target tracking metric) is often more representative: it measures how many requests per second each instance is handling and scales when that count exceeds your target, regardless of CPU state.
For production: publish a custom CloudWatch metric from your application, specifically the JVM thread pool queue depth or active request count from Spring Actuator's
For the metric, CPU Utilization is the most accessible but not always accurate. A Spring Boot app blocked on downstream I/O (RDS queries, S3 calls) can be saturated with CPU at 10%. ALBRequestCountPerTarget (a predefined target tracking metric) is often more representative: it measures how many requests per second each instance is handling and scales when that count exceeds your target, regardless of CPU state.
For production: publish a custom CloudWatch metric from your application, specifically the JVM thread pool queue depth or active request count from Spring Actuator's
/actuator/metrics/http.server.requests. That is the most direct signal of whether the app is under pressure.
My Notes
Saved to browser storage automatically as you type.
3
Security, Encryption & Identity
0%
Architect this phase
AWS Organization → Management Account → Member Account (your learning account) · CloudTrail → all API calls → S3 bucket (audit log) · KMS CMK → envelope-encrypts Secrets Manager, EBS, RDS (upcoming phases) · IAM Identity Center → SSO portal → developer → short-lived temporary credentials
Draw this yourself for better retention:
draw.io ·
Official AWS Icons Topics
Networking (continued from Phase 1)
DNS in VPC & Route 53 Resolver
What
Every VPC has a built-in DNS resolver reachable at the base CIDR + 2 address (e.g.,
10.0.0.2 for a 10.0.0.0/16 VPC). Two VPC attributes control it: enableDnsSupport (enables the resolver; must be on) and enableDnsHostnames (assigns public DNS hostnames to instances; required for Interface Endpoint private DNS). Route 53 Private Hosted Zones let you register custom DNS names (e.g., postgres.internal.company.com) that resolve only inside associated VPCs. Route 53 Resolver Endpoints extend DNS across hybrid networks: an Inbound Endpoint lets on-premises DNS servers forward queries into AWS; an Outbound Endpoint lets EC2 instances forward queries to on-premises resolvers.
Why
Services in a VPC find each other by name, not by IP. Private Hosted Zones are how a Spring Boot app resolves
postgres.internal to an RDS endpoint without hardcoding IPs. Resolver endpoints are required in hybrid deployments where on-premises systems must reach AWS-hosted services by name and vice versa.
Gotcha
- If
enableDnsHostnamesis off on your VPC, Interface Endpoints private DNS override silently stops working. The SDK resolves the service endpoint to the public IP instead of the ENI in your subnet. This is the root cause when an Interface Endpoint exists but traffic still routes to the internet. - A Private Hosted Zone must be explicitly associated with a VPC before its records resolve inside that VPC. Creating the zone is not enough.
- Route 53 Resolver Endpoints cost $0.125/hr per endpoint plus $0.40 per million queries. For small deployments, consider this cost before adding endpoints purely for convenience.
IAM (Advanced)
Resource-based Policies
What
An IAM policy attached to an AWS resource rather than to an identity. Common examples: S3 bucket policies, SQS queue policies, SNS topic policies, KMS key policies, Lambda resource policies, and Secrets Manager resource policies. A resource-based policy specifies a Principal (who) and the actions they can perform on that specific resource. The principal can be an IAM role, an AWS account, an AWS service, or
* (everyone).
Why
Resource-based policies enable cross-account access without an AssumeRole call. If Account B puts a bucket policy allowing Account A's role
s3:GetObject, that role can read from the bucket using its own existing credentials. No extra STS call required. This is the standard pattern for SaaS services serving data to customer accounts.
Gotcha
- KMS key policies are mandatory. Unlike S3 bucket policies, a KMS key without an explicit key policy grants access to nobody. The default key policy AWS creates includes a root-account delegation statement; if you write a custom key policy from scratch and omit that statement, you can permanently lock yourself out of the key.
- For same-account access, identity-based policies alone are usually sufficient. Resource-based policies become necessary for cross-account access and for expressing conditions tied to the resource itself (e.g., deny delete unless MFA is present).
- When both policy types exist for the same account, AWS grants access if either allows it and no explicit Deny overrides it. Across accounts, both the identity-based policy on the caller and the resource-based policy on the target must allow the action.
IAM Identity Center (formerly AWS SSO)
What
A managed service for human access to AWS accounts in an organization. Instead of creating individual IAM users in each account, you connect IAM Identity Center to your identity provider (Okta, Microsoft Entra ID, Google Workspace, or the built-in directory), define Permission Sets (role templates with IAM policies), and assign users or groups to specific accounts. Developers get short-lived credentials via the AWS access portal or
aws sso login on the CLI, not via long-lived Access Keys.
Why
A team with individual IAM users and Access Keys has as many long-lived credentials as employees, each needing rotation, each being a leak risk, each requiring individual revocation when someone leaves. IAM Identity Center centralises provisioning and revocation: remove someone from the IdP, and access across all AWS accounts disappears immediately.
Gotcha
- IAM Identity Center requires AWS Organizations to be enabled. It is a multi-account service by design.
- Permission Sets become IAM roles in each assigned account under the naming pattern
AWSReservedSSO_PermissionSetName_xxx. Do not modify these roles directly in IAM; they are managed by Identity Center and changes get overwritten. aws sso logincredentials are short-lived (typically 1-12 hours). Automated scripts that run unattended cannot use SSO credentials; they need a dedicated service account role with long-lived credentials stored in Secrets Manager or a dedicated IAM role assumed by the automation service (e.g., a Lambda execution role or a GitHub Actions OIDC role).
AWS Organizations & Service Control Policies (SCPs)
What
AWS Organizations lets you manage multiple AWS accounts under a single management account. Accounts are organized into Organizational Units (OUs). Service Control Policies (SCPs) are maximum-permission guardrails attached to OUs or accounts. An SCP does not grant permissions; it limits the ceiling. Even an IAM Role with
AdministratorAccess cannot exceed what the SCP permits. SCPs apply to all identities in member accounts, including root users of those accounts.
Why
SCPs enforce compliance rules at the account boundary without modifying any IAM policy. Example: an SCP on all production OUs that denies any EC2 action outside
us-east-1 and eu-west-1. No developer can accidentally spin up instances in a prohibited region regardless of their IAM permissions.
Gotcha
- SCPs do not apply to the management account. The management account has full access regardless of SCP configuration. This is why the management account should hold no workloads: it is a privileged account for organization administration only.
- Organizations consolidates billing: all charges across member accounts roll up to the management account, and volume discounts aggregate across the whole org. This benefit alone is worth enabling Organizations even for small teams.
- AWS service-linked roles are exempt from SCPs when they need specific permissions to function on your behalf. This prevents SCPs from accidentally breaking managed services like EKS or RDS.
Security Operations
AWS CloudTrail
What
CloudTrail records every AWS API call in your account: who called it (IAM user, role, or service), from where (IP address, console or CLI), when, what was requested, and what was returned. Management events (control-plane actions such as create, delete, or modify resources) are captured by default and viewable in Event History for 90 days at no cost. Data events (S3 object reads/writes, Lambda invocations) and Insights events (anomaly detection) are opt-in and billed separately. A Trail sends events to an S3 bucket for long-term retention and optionally to CloudWatch Logs for alerting.
Why
When a security incident happens ("who deleted that S3 bucket?", "which role changed this Security Group at 2am?", "why are there 1,000 STS calls in 60 seconds?"), CloudTrail is the first place you look. It is also a prerequisite for most compliance frameworks: SOC 2, PCI-DSS, HIPAA, and ISO 27001 all require API audit logs.
Gotcha
- Event History shows only management events for the last 90 days. For S3 object access logs or records older than 90 days, you must create a Trail. The first copy of management events per region is free; additional copies and data events are billed.
- CloudTrail logs land in S3 with up to a 15-minute delay. For near-real-time alerting on specific API calls (e.g., alert when anyone calls
DeleteTrail), send the Trail to CloudWatch Logs and create a metric filter and alarm. - Protect the Trail's S3 bucket. An attacker who compromises your account will attempt to delete the CloudTrail logs to cover their tracks. Enable S3 Object Lock on the Trail bucket and restrict
cloudtrail:StopLoggingandcloudtrail:DeleteTrailwith an SCP or a deny policy on all non-admin roles.
AWS KMS (Key Management Service)
What
KMS manages cryptographic keys used to encrypt data at rest. A Customer Managed Key (CMK) is a key you create and control ($1/month + $0.03 per 10,000 API calls). An AWS Managed Key is created by AWS on behalf of a specific service (e.g.,
aws/s3, aws/rds) at no charge. KMS uses envelope encryption: KMS generates a one-time data key (AES-256 symmetric), your application or the AWS service encrypts the data locally using that key, and then the data key itself is encrypted with the CMK. The encrypted data key is stored alongside the ciphertext. To decrypt, KMS decrypts the data key and returns the plaintext key in memory for the calling application to use.
Why
Nearly every later phase uses KMS: RDS encrypts storage at rest, EBS volumes encrypt by default in most regions, S3 uses SSE-KMS for object encryption, and Secrets Manager encrypts secrets with a CMK. Understanding envelope encryption explains why rotating a CMK does not require re-encrypting all your data (only the data keys are re-wrapped), and why a CMK's raw key material cannot be exported from KMS hardware.
Gotcha
- Deleting a CMK has a mandatory 7-30 day waiting period and is irreversible. Any data encrypted with that key and not separately backed up becomes permanently unrecoverable after deletion. AWS cannot help you recover it. Before scheduling deletion, run a last-access audit in the KMS console.
- Cross-account EBS snapshot sharing requires the snapshot to be encrypted with a CMK whose key policy explicitly grants access to the target account. Snapshots encrypted with AWS Managed Keys cannot be shared cross-account.
- KMS is rate-limited; the default in most regions is 30,000 requests/second across all cryptographic operations. High-throughput applications that call
GenerateDataKeyper database record can approach this limit. The correct pattern is to cache the plaintext data key in memory and call KMS only when the cache expires or the key is rotated.
AWS Secrets Manager vs SSM Parameter Store
What
Both store sensitive configuration values so applications never embed credentials in code or config files. AWS Secrets Manager: purpose-built for secrets, $0.40/secret/month + $0.05 per 10,000 API calls, built-in automatic rotation for RDS, Redshift, and DocumentDB, plus Lambda-based custom rotation for any other secret. AWS Systems Manager Parameter Store: free for standard parameters (up to 4 KB, up to 10,000 parameters per account), $0.05/advanced-parameter/month, no automatic rotation. Both encrypt values using KMS and integrate with IAM for access control.
Why
Your Spring Boot application in Phase 2 must connect to Phase 4's RDS PostgreSQL. The database password must not be in
application.properties, an environment variable, or baked into an AMI. The correct pattern: store it in Secrets Manager, grant the EC2 IAM Role secretsmanager:GetSecretValue on that specific secret ARN, and retrieve it at startup using the AWS SDK.
Gotcha
- When Secrets Manager rotates a secret, the value changes in place. Your application must re-fetch the credential on the next connection attempt rather than caching it indefinitely at startup. The AWS Secrets Manager JDBC driver wrapper for Java handles this automatically for database connections.
- SSM Parameter Store standard tier is throttled at 40 transactions/second by default (soft limit). Applications that fetch many parameters at startup can hit this. Use
GetParametersByPathto batch fetches, or use the advanced tier for higher throughput. - Use Secrets Manager for anything that rotates or is a credential (database passwords, API keys, TLS private keys). Use SSM Parameter Store for non-secret configuration (feature flags, service URLs, environment-specific config values). Mixing them per-secret type is fine; mixing them arbitrarily per-project creates confusion during incident response.
Hands-on Tasks
Interview Q&A, Expand each to see the answer
What is the difference between an identity-based and a resource-based IAM policy?
An identity-based policy is attached to an IAM user, group, or role. It defines what that identity is allowed to do. It says nothing about who else can access a particular resource.
A resource-based policy is attached to an AWS resource (S3 bucket, KMS key, SQS queue, Lambda, etc.). It specifies a
The key practical difference: resource-based policies enable cross-account access without an AssumeRole step. An IAM role in Account A can access an S3 bucket in Account B if the bucket policy grants Account A's role permission. With identity-based policies alone, Account A's role would need to first call
For same-account access, identity-based policies suffice. For cross-account access, use either a resource-based policy on the target or an STS AssumeRole into the target account (the latter is more auditable).
A resource-based policy is attached to an AWS resource (S3 bucket, KMS key, SQS queue, Lambda, etc.). It specifies a
Principal: who can perform which actions on that specific resource. The principal can be in a different AWS account.The key practical difference: resource-based policies enable cross-account access without an AssumeRole step. An IAM role in Account A can access an S3 bucket in Account B if the bucket policy grants Account A's role permission. With identity-based policies alone, Account A's role would need to first call
sts:AssumeRole into Account B.For same-account access, identity-based policies suffice. For cross-account access, use either a resource-based policy on the target or an STS AssumeRole into the target account (the latter is more auditable).
How should a team of 10 engineers access the AWS console and CLI in production?
Through IAM Identity Center, connected to the corporate identity provider (Okta, Active Directory, Google Workspace).
Engineers authenticate with their corporate credentials. IAM Identity Center issues short-lived temporary credentials (valid 1-12 hours). CLI access uses
Why this beats individual IAM users with Access Keys:
- No long-lived Access Keys to rotate, lose, or have stolen
- Offboarding is instant: remove the user from the IdP, and all AWS access disappears immediately across every account
- Permission Sets define role templates centrally; you assign a person to an account with a role, not a pile of policies
- All session activity in CloudTrail is tied to the individual's identity, not to a shared role ARN
In a solo learning account, an IAM user with an Access Key is acceptable. In any team setting, IAM Identity Center is the correct architecture.
Engineers authenticate with their corporate credentials. IAM Identity Center issues short-lived temporary credentials (valid 1-12 hours). CLI access uses
aws sso login; console access uses the AWS access portal.Why this beats individual IAM users with Access Keys:
- No long-lived Access Keys to rotate, lose, or have stolen
- Offboarding is instant: remove the user from the IdP, and all AWS access disappears immediately across every account
- Permission Sets define role templates centrally; you assign a person to an account with a role, not a pile of policies
- All session activity in CloudTrail is tied to the individual's identity, not to a shared role ARN
In a solo learning account, an IAM user with an Access Key is acceptable. In any team setting, IAM Identity Center is the correct architecture.
What does CloudTrail capture and how do you use it to investigate an incident?
CloudTrail captures every AWS API call: the caller's identity (
Investigation workflow:
1. Identify the affected resource by its ARN or ID
2. Filter CloudTrail for the relevant action (e.g.,
3. Read
4. Widen the time window and search all API calls by that role ARN to trace what else was accessed
5. Check
Logs land in S3 with up to a 15-minute delay. For real-time alerting (e.g., notify on
userIdentity.arn), source IP (sourceIPAddress), timestamp, service and action (eventSource, eventName), request parameters, and the response (including error codes).Investigation workflow:
1. Identify the affected resource by its ARN or ID
2. Filter CloudTrail for the relevant action (e.g.,
DeleteBucket, AuthorizeSecurityGroupIngress, PutBucketPolicy)3. Read
userIdentity.arn to identify the caller; read sourceIPAddress to see whether it came from a known IP or an unexpected one4. Widen the time window and search all API calls by that role ARN to trace what else was accessed
5. Check
errorCode fields for failed access attempts that preceded the successful oneLogs land in S3 with up to a 15-minute delay. For real-time alerting (e.g., notify on
DeleteTrail or ConsoleLogin from an unknown IP), pipe the Trail to CloudWatch Logs and configure metric filters.
What is envelope encryption and why does KMS use it instead of encrypting data directly?
Envelope encryption is a two-layer scheme:
1. KMS generates a one-time symmetric data key (AES-256) via
2. Your application (or the AWS service) encrypts the data locally with the data key
3. KMS encrypts the data key itself with the CMK
4. The encrypted data key is stored alongside the ciphertext
To decrypt: call KMS to decrypt the data key, then use the plaintext data key in memory to decrypt the data, then discard the plaintext key.
Why not let KMS encrypt data directly? Three reasons:
- Size: KMS direct encryption handles only up to 4 KB. Database records and files are larger
- Performance: every read or write would make a network call to KMS (1-5ms each). Local AES-256 is nanoseconds
- Cost: KMS charges per API call; encrypting every database record individually would be expensive at scale
The rotation insight: rotating a CMK requires only re-wrapping the data keys (a KMS API call per data key), not re-encrypting the actual data. This is why key rotation is cheap even when you have terabytes of encrypted data.
1. KMS generates a one-time symmetric data key (AES-256) via
GenerateDataKey2. Your application (or the AWS service) encrypts the data locally with the data key
3. KMS encrypts the data key itself with the CMK
4. The encrypted data key is stored alongside the ciphertext
To decrypt: call KMS to decrypt the data key, then use the plaintext data key in memory to decrypt the data, then discard the plaintext key.
Why not let KMS encrypt data directly? Three reasons:
- Size: KMS direct encryption handles only up to 4 KB. Database records and files are larger
- Performance: every read or write would make a network call to KMS (1-5ms each). Local AES-256 is nanoseconds
- Cost: KMS charges per API call; encrypting every database record individually would be expensive at scale
The rotation insight: rotating a CMK requires only re-wrapping the data keys (a KMS API call per data key), not re-encrypting the actual data. This is why key rotation is cheap even when you have terabytes of encrypted data.
My Notes
Saved to browser storage automatically as you type.
4
RDS Integration
0%
Architect this phase
Spring Boot (EC2, Private Subnet, behind ALB) → port 5432 → RDS PostgreSQL (Private Subnet)
The diagram shows the production target. This phase continues using the Phase 2 EC2, which sits in a public subnet so you can SSH and SCP to it directly; in production the app server belongs in a private subnet behind the ALB and is never publicly addressable. Draw it: draw.io · AWS Icons Topics
RDS Core Concepts
RDS (Relational Database Service)
What A managed database service. AWS handles: OS patching, DB software installation and upgrades, automated backups, Multi-AZ failover, and storage scaling. You manage only the schema and queries. Supports PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Aurora.
Why Running PostgreSQL on a plain EC2 means you own backups, patching, HA, and failover yourself. In production, that operational burden is enormous. RDS trades cost for time.
Gotcha
db.t3.microis the right size for this phase. On accounts created before July 15, 2025 it is free tier eligible (750 hrs/month for MySQL, PostgreSQL, or MariaDB, not Aurora). Newer accounts pay by the hour (single-digit cents; check the RDS pricing page), drawn from the Free Tier credits.- RDS must go in private subnets. It should never have a public IP. Access it only from within the VPC via Security Groups.
- The RDS endpoint looks like:
my-db.abc123.us-east-1.rds.amazonaws.com. Use this as your JDBC host.
Aurora vs Standard RDS
What Aurora PostgreSQL is AWS's cloud-native database, MySQL/PostgreSQL-compatible but with a fundamentally different architecture: storage is separated from compute and automatically replicated 6 ways across 3 AZs. Aurora offers: faster failover (30 seconds vs 60-120 for Multi-AZ RDS), up to 15 Aurora Replicas that share the cluster storage volume (standard RDS read replicas each carry a full copy of the data), auto-scaling storage (up to 128 TB), and Global Database (one writer region plus read-only secondary regions). Aurora Serverless v2 auto-scales the compute tier in response to load.
Why For production systems requiring sub-30-second failover, many read replicas, or a massive single database, Aurora is the better choice despite being ~20% more expensive than RDS. Standard RDS is the right choice for dev/learning and smaller production workloads.
Gotcha
- Use standard RDS PostgreSQL for this phase's hands-on work: a
db.t3.microis the cheapest managed PostgreSQL you can run, and Aurora costs noticeably more at this scale. - Aurora has its own endpoint types: cluster endpoint (always points to the primary, use for writes), reader endpoint (load-balances across read replicas). Your app should use both if running read replicas.
RDS Proxy
What RDS Proxy is a fully managed connection pooler that sits between your application and RDS. It maintains a pool of long-lived connections to the database and multiplexes thousands of application-side connections into far fewer database connections. Critical for Lambda → RDS scenarios: each Lambda invocation opens a new DB connection, and with 1000 concurrent Lambdas you'd exhaust PostgreSQL's default 100-connection limit immediately.
Why RDS PostgreSQL derives
max_connections from instance memory: roughly 100 on a db.t3.micro, thousands on large instances (capped at 5000). Serverless architectures (Lambda) and container scaling (ECS/EKS with many pods) can each open separate connections and exhaust the limit, causing FATAL: connection limit exceeded. RDS Proxy solves this without changing application code: point the JDBC URL at the proxy endpoint instead of the DB endpoint.Gotcha
- RDS Proxy costs ~$0.015 per vCPU-hour of the target database's capacity, with a 2-vCPU minimum for small instances; check the RDS Proxy pricing page. Worth it in Lambda-heavy architectures; overkill for a single EC2 app with HikariCP.
- RDS Proxy requires IAM authentication or Secrets Manager, it does not accept direct password credentials from the JDBC URL. This is a security improvement: no passwords in JDBC strings.
Multi-AZ Deployment
What RDS maintains a synchronous standby replica in a different AZ. Every write to the primary is simultaneously written to the standby. If the primary fails, RDS automatically promotes the standby. The CNAME endpoint automatically points to the new primary. Typical failover time: 60-120 seconds.
Why Protects against an AZ outage or hardware failure. Without Multi-AZ, a failed DB instance means downtime until you restore from backup.
Gotcha
- The standby in a Multi-AZ instance deployment is not readable. It exists purely for failover. You cannot send read traffic to it to reduce load on the primary, that is what Read Replicas are for. A Multi-AZ DB cluster deployment is different: it runs two readable standbys in other AZs and typically fails over in around 35 seconds.
- Multi-AZ roughly doubles the RDS cost. In dev/learning environments, leave it off.
Read Replicas
What Asynchronous copies of the primary RDS instance. You can create up to 15 read replicas per source instance (PostgreSQL, MySQL, MariaDB). Your application explicitly directs read-heavy queries (reports, analytics) to a replica's endpoint, reducing load on the primary. Replicas can be in the same region, a different region, or promoted to a standalone DB.
Why Scale read throughput horizontally without upgrading the primary instance size.
Gotcha
- Replication is asynchronous, replicas may lag behind the primary by seconds. Never use a replica for anything requiring up-to-date data (e.g., immediately after a write).
- Multi-AZ ≠ Read Replica. Multi-AZ is for availability (automatic failover). Read Replicas are for scalability (read offloading). These are separate features and can both be enabled simultaneously.
Automated Backups & Snapshots
What Automated backups: RDS takes a daily snapshot of the DB during a maintenance window, plus continuously backs up transaction logs to S3 (not visible in your S3 bucket, managed by RDS). This enables point-in-time recovery to any second within the retention window (1-35 days, default 7). Manual snapshots: you trigger these yourself and they persist until you delete them.
Why Automated backups are your safety net for data corruption and accidental deletion. Take a manual snapshot before any major schema migration.
Gotcha
- Automated backups are deleted when you delete the RDS instance (unless you take a final snapshot). Manual snapshots survive instance deletion.
- Restoring a snapshot creates a new RDS instance, it does not restore in place. Plan for the new endpoint in your app config.
AWS Secrets Manager
What A managed service for securely storing, rotating, and retrieving secrets (database passwords, API keys, connection strings). Secrets are encrypted with KMS. Your application retrieves the secret at runtime via the AWS SDK, no plaintext credentials in config files, environment variables hardcoded in systemd units, or Docker images.
Why The most common AWS security mistake: DB credentials hardcoded in
application.properties and committed to Git. Secrets Manager solves this: your EC2 IAM Role has secretsmanager:GetSecretValue permission, the app fetches the secret at startup, and credentials never appear in source control or process environment dumps.Retrieving a secret via AWS SDK v2
SecretsManagerClient client = SecretsManagerClient.create();
GetSecretValueResponse r = client.getSecretValue(
GetSecretValueRequest.builder()
.secretId("prod/myapp/db")
.build());
// r.secretString() → JSON: {"username":"dbadmin","password":"..."}
// Parse with ObjectMapper, inject into DataSource
Alternatively, Spring Cloud AWS (io.awspring.cloud:spring-cloud-aws-starter-secrets-manager) loads the secret into the Spring Environment: add spring.config.import=aws-secretsmanager:/prod/myapp/db to application.properties, and the keys inside the secret JSON resolve as normal placeholders (${username}, ${password}). No boilerplate SDK code needed.
Gotcha
- Secrets Manager costs $0.40 per secret per month + $0.05 per 10,000 API calls. Negligible at learning scale.
- IAM permission required:
secretsmanager:GetSecretValueon the specific secret ARN, not on*. - For local development, use environment variables or a local
.envfile (excluded from Git). Never configure Secrets Manager with hardcoded access keys locally, that defeats the purpose.
Connecting Spring Boot to RDS
What From Spring Boot's perspective, RDS PostgreSQL is a standard PostgreSQL instance. The JDBC URL uses the RDS endpoint. The Security Group on the RDS instance must allow port 5432 from the EC2 instance's Security Group.
Why Understanding the connectivity model (SG → SG, not IP → IP) is important for troubleshooting and for interviews.
Config in application.properties
ssl=true&sslmode=require: RDS supports TLS by default, enforce it in the JDBC URL to encrypt credentials and data in transit.
spring.datasource.url=jdbc:postgresql://my-db.abc123.us-east-1.rds.amazonaws.com:5432/mydb?ssl=true&sslmode=require
spring.datasource.username=dbadmin
spring.datasource.password=${DB_PASSWORD}
# Pool sizing formula: (core_count × 2) + effective_spindle_count
# For db.t3.micro (2 vCPU): (2 × 2) + 1 = 5; 10 is a safe ceiling for light workloads
spring.datasource.hikari.maximum-pool-size=10
spring.datasource.hikari.minimum-idle=2
spring.datasource.hikari.connection-timeout=3000
spring.datasource.hikari.idle-timeout=600000
Store the password in an environment variable or AWS Secrets Manager, never hardcode it in source files.ssl=true&sslmode=require: RDS supports TLS by default, enforce it in the JDBC URL to encrypt credentials and data in transit.
Gotcha
- A connection timeout means the Security Group (or routing), an authentication failure means credentials. Diagnose by error type before changing anything.
- HikariCP defaults to 10 connections per app instance, and a
db.t3.microallows roughly 100 total. A few app instances plus straypsqlsessions reach that ceiling sooner than you expect; size pools deliberately.
Hands-on Tasks
Interview Q&A
Why use RDS instead of running PostgreSQL yourself on EC2?
With RDS, AWS handles automated backups, point-in-time recovery, OS and engine patching, Multi-AZ failover, and storage auto-scaling. On a self-managed EC2 PostgreSQL setup, your team owns all of that, which means writing backup scripts, managing cron jobs, handling failover manually, and staying on top of security patches. For most product teams, that operational burden is not the product they're building. RDS trades higher cost for lower operational overhead. The trade-off shifts when you have very specific PostgreSQL configuration requirements or extreme cost constraints at scale.
What's the difference between Multi-AZ and a Read Replica?
Multi-AZ is for availability. It maintains a synchronous standby replica in a different AZ. If the primary fails, RDS automatically fails over to the standby (60-120 seconds). The standby cannot serve read traffic, it exists solely for failover. Your app needs no changes; the endpoint CNAME switches automatically.
Read Replica is for scalability. It is an asynchronous copy of the primary. Your application must explicitly connect to the replica's separate endpoint for read queries. Replication lag means replicas may not have the most recent writes. Read Replicas can be promoted to standalone databases if needed.
You can (and often do) have both enabled at the same time on the same instance.
Read Replica is for scalability. It is an asynchronous copy of the primary. Your application must explicitly connect to the replica's separate endpoint for read queries. Replication lag means replicas may not have the most recent writes. Read Replicas can be promoted to standalone databases if needed.
You can (and often do) have both enabled at the same time on the same instance.
How do you connect Spring Boot to RDS securely?
Three layers: Network, RDS is in a private subnet with no public IP. The RDS Security Group allows port 5432 only from the application server's Security Group (not from an IP address range). Credentials, database username and password are stored in AWS Secrets Manager or SSM Parameter Store, not in application.properties or source code. The EC2 IAM Role grants permission to retrieve the secret at startup. Transport, enable SSL/TLS on the JDBC connection for encryption in transit (
ssl=true&sslmode=require in the JDBC URL). require encrypts but does not verify the server's identity; to also verify, download the RDS CA bundle and use sslmode=verify-full.What is your backup and disaster recovery strategy for RDS?
Three components: Automated backups with a 7-day retention window provide point-in-time recovery to any second within that window, covers data corruption and accidental deletion. Manual snapshots taken before schema migrations or major releases persist indefinitely and can be used to create a new instance if a migration goes wrong. Multi-AZ handles the infrastructure failure case, if the primary AZ goes down, failover happens automatically without restoring from backup.
For critical systems, also consider cross-region read replicas which can be promoted during a regional outage, giving a lower RTO than restoring from a cross-region snapshot copy.
For critical systems, also consider cross-region read replicas which can be promoted during a regional outage, giving a lower RTO than restoring from a cross-region snapshot copy.
My Notes
Saved to browser storage automatically as you type.
5
S3 Storage
0%
Topics
S3 Fundamentals
Buckets and Objects
What S3 is object storage, not a file system. You store objects (any file up to 5 TB) inside buckets. A bucket is a top-level container. Objects are identified by a key (a string like
uploads/2024/report.pdf), there are no real folders, only naming conventions. S3 is designed for 99.999999999% (11 nines) durability by redundantly storing data across multiple AZs.Why S3 is the standard for file/blob storage in AWS. Infinitely scalable, no capacity planning, pay only for what you store.
Gotcha
- Bucket names are globally unique across all AWS accounts worldwide. If someone else already has
my-app-uploads, you cannot use it. Use a prefix like your company name or account ID. - Bucket names must be 3-63 characters, lowercase, no underscores.
- Buckets are created in a specific region. Data stays in that region unless you explicitly replicate it.
Versioning
What When versioning is enabled on a bucket, every upload creates a new version of the object instead of overwriting the previous one. Each version has a unique version ID. Deleted objects get a delete marker, the object is not gone and can be recovered by removing the marker.
Why Protection against accidental overwrites and deletions. Essential for any bucket that stores important data. Also required for S3 replication.
Gotcha
- Once enabled, versioning cannot be fully disabled, only suspended. Suspended means new uploads no longer create versions, but existing versions are preserved.
- Old versions accumulate storage costs. Pair versioning with a lifecycle rule to expire non-current versions after N days.
Storage Classes
What S3 offers multiple storage tiers with different cost/access-speed trade-offs:
- Standard (~$0.023/GB): frequent access, lowest latency. Default.
- Standard-IA (Infrequent Access, ~$0.0125/GB + retrieval fee): for files accessed less than once a month.
- Glacier Instant Retrieval (~$0.004/GB): archival, millisecond retrieval.
- Glacier Flexible Retrieval (~$0.0036/GB): archival, minutes to hours retrieval.
- Glacier Deep Archive (~$0.00099/GB): cheapest, 12-hour retrieval. For compliance archives.
- Intelligent-Tiering: automatically moves objects between tiers based on access patterns. Monitoring fee per object.
Why Storage classes let you cut costs significantly for data that is not accessed frequently. A lifecycle policy can automate transitions.
Gotcha
- IA classes have a minimum storage duration: 30 days for Standard-IA, 90 days for Glacier Instant Retrieval and Glacier Flexible Retrieval, 180 days for Glacier Deep Archive. Objects deleted before the minimum are still billed for the full duration.
Lifecycle Policies
What Rules that automatically transition objects between storage classes or delete them after a set number of days. Example: transition to Standard-IA after 30 days → Glacier after 90 days → delete after 365 days. Applied at the bucket or prefix level.
Why Lifecycle policies are the hands-off way to manage storage costs. Without them, old objects accumulate in Standard storage indefinitely.
Gotcha
- Objects must sit in Standard for 30 days before a lifecycle rule can transition them to Standard-IA or One Zone-IA. A "transition after 7 days" rule to IA is silently impossible.
- Transitions are billed per request (~$0.01 per 1,000 to IA, more to Glacier tiers). Transitioning millions of tiny objects can cost more than it saves; for small objects, expiry is often the better rule.
Presigned URLs
What A presigned URL is a time-limited URL generated server-side that grants temporary access to a private S3 object without making the bucket or object public. The URL contains an embedded signature with an expiry. Anyone with the URL can GET (download) or PUT (upload) the object until expiry, no AWS credentials needed.
Why Standard pattern for user file uploads and downloads: your backend generates the presigned URL and hands it to the client. The client transfers directly to/from S3, your backend never touches the file bytes, saving bandwidth and compute.
Gotcha
- Presigned URLs inherit the permissions of the IAM identity that generated them. If your EC2 IAM Role has
s3:GetObjecton the bucket, it can generate presigned GET URLs. Without the permission, the URL will fail. - Maximum expiry: 7 days (604800 seconds), and only when the URL is signed with long-lived IAM user credentials. A URL signed with temporary credentials (an EC2 instance role, ECS task role, Lambda execution role) dies when that session expires, typically within 6 hours, regardless of the expiry you request. If you need long-lived URLs, sign them with a dedicated IAM user.
Access Control, Block Public Access
What S3 has four "Block Public Access" settings at the account and bucket level. Enabling all four prevents any object in the bucket from being made publicly readable, regardless of bucket policy or object ACLs. Block Public Access is enabled by default on all new buckets.
Why Public S3 buckets have caused high-profile data breaches. "Block Public Access" is a safety net. Enable it on every bucket that is not intentionally serving public content (e.g., a static website).
Gotcha
- Bucket policies still control which IAM identities (your EC2 role, Lambda function) can access objects, Block Public Access only prevents public (unauthenticated) access.
- Never use bucket ACLs, they are a legacy mechanism. Use bucket policies and IAM policies instead.
S3 Event Notifications & Advanced Patterns
S3 Event Notifications
What S3 can publish events (ObjectCreated, ObjectDeleted, ObjectRestore) to Lambda, SQS, or SNS when objects are uploaded or deleted. This is the foundation of file-processing pipelines: user uploads a CSV → S3 → Lambda → parse and load into RDS. Configure in bucket Properties → Event Notifications.
Why Your backend API never needs to poll for new files. The upload triggers processing automatically and asynchronously, the client's upload completes immediately, and the processing happens in the background. This pattern scales to millions of files without changing the application.
Gotcha
- S3 Event Notifications are at-least-once, a Lambda or SQS consumer may receive the same event twice on rare occasions. Make your processing idempotent (check if the object was already processed before doing work).
- Use an SQS queue between S3 and Lambda rather than invoking Lambda directly, SQS buffers events during Lambda throttling and provides a DLQ for failed processing.
Multipart Upload
What S3 requires multipart upload for objects larger than 5 GB, and recommends it for objects larger than 100 MB. The file is split into parts (minimum 5 MB each) that are uploaded in parallel. Failed parts are retried individually without restarting the entire upload. AWS SDK v2's
S3TransferManager handles multipart upload automatically.Why For files over 100 MB, a single-part upload fails if the connection drops midway and the entire transfer must restart. Parts upload in parallel, saturating available bandwidth more efficiently.
S3TransferManager handles chunking and per-part retry automatically; you call uploadFile() and it handles the rest.S3TransferManager
S3TransferManager manager = S3TransferManager.create();
FileUpload upload = manager.uploadFile(b -> b
.putObjectRequest(r -> r.bucket("my-bucket").key("large-file.csv"))
.source(Path.of("/tmp/large-file.csv")));
upload.completionFuture().join(); Gotcha
- Incomplete multipart uploads accumulate storage charges for the uploaded parts. Add a lifecycle rule: "Abort incomplete multipart uploads after 7 days."
Spring Boot Integration (AWS SDK v2)
Using AWS SDK v2 for Java with S3
What AWS SDK v2 (
software.amazon.awssdk) is the current Java SDK. Use S3Client for synchronous operations or S3AsyncClient for async. The SDK automatically picks up credentials from the EC2 Instance Profile (IAM Role), no hardcoded keys needed.Why The v1 SDK (
com.amazonaws) reached end of support on December 31, 2025 and receives no patches of any kind; treat any tutorial or codebase importing it as legacy code that needs migrating. SDK v2 is async-first, ships non-blocking HTTP clients, and is the only supported version. Use the BOM to avoid pinning individual artifact versions that will become stale.pom.xml dependency
<!-- Use the AWS SDK BOM, no individual version needed -->
<dependencyManagement>
<dependencies>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>bom</artifactId>
<version>2.46.7</version> <!-- current at time of writing; check central.sonatype.com/artifact/software.amazon.awssdk/bom -->
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>s3</artifactId>
<!-- version managed by BOM -->
</dependency>
Key operations: s3Client.putObject(), s3Client.getObject(), s3Presigner.presignGetObject(). The SDK uses the default credential provider chain, on EC2 with an IAM Role, it reads temporary credentials from the instance metadata endpoint automatically. Use the AWS SDK BOM and omit individual version numbers rather than pinning a specific version that will become stale.
Hands-on Tasks
Interview Q&A
Why store files in S3 instead of in the database?
Databases are optimised for structured data and queries, not binary blobs. Storing large files in a DB increases backup size, slows queries unrelated to those files, and doesn't scale economically. S3 is purpose-built for object storage: it costs a fraction of DB storage (~$0.023/GB vs ~$0.115/GB for RDS gp3), scales infinitely, and offers 11 nines of durability. Files stored in S3 can also be served directly to clients via presigned URLs, bypassing your application servers entirely and saving bandwidth and compute.
What are presigned URLs and when do you use them?
A presigned URL is a time-limited, signature-embedded URL that grants temporary access to a private S3 object. Your backend generates it using the AWS SDK (requires IAM permission on the bucket) and returns it to the client. The client then uploads or downloads directly from S3 using that URL, no AWS credentials needed, and your backend never handles the file bytes.
Use them for: user file uploads (presigned PUT URL → client uploads directly), file downloads (presigned GET URL → client downloads directly), and sharing private documents with external parties for a limited time. The key benefit is that large file transfers bypass your application servers completely.
Use them for: user file uploads (presigned PUT URL → client uploads directly), file downloads (presigned GET URL → client downloads directly), and sharing private documents with external parties for a limited time. The key benefit is that large file transfers bypass your application servers completely.
How do you prevent an S3 bucket from being accidentally made public?
Two layers: first, enable all four "Block Public Access" settings on the bucket (and ideally at the account level), this acts as a guardrail that prevents any public bucket policies or public ACLs from taking effect, even if someone accidentally adds one. Second, use AWS Config or IAM SCPs (Service Control Policies) to enforce that Block Public Access stays enabled across all buckets in the account. Never grant
s3:PutBucketPolicy to application IAM roles, only administrators should be able to modify bucket policies.My Notes
Saved to browser storage automatically as you type.
6
Monitoring and Observability
0%
Topics
Application Metrics (Spring Boot → CloudWatch)
Micrometer + CloudWatch (Custom Application Metrics)
What Micrometer is the metrics façade built into Spring Boot Actuator. Add the
micrometer-registry-cloudwatch2 dependency and wire a CloudWatchMeterRegistry bean. Micrometer then pushes metrics (HTTP request rate, latency percentiles, JVM heap, DB connection pool size, your custom counters/timers) to CloudWatch under a namespace you define. This is how your application tells you what it's doing, not only what the VM is doing.Why CloudWatch without Micrometer only shows infrastructure metrics (CPU, disk). It cannot tell you: "95th percentile latency on the /checkout endpoint is 420ms" or "Hikari pool is at 90% capacity." These are the signals that matter for production incidents.
Setup (manual bean, works on any Boot version)
<!-- pom.xml -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-cloudwatch2</artifactId>
</dependency>
// MetricsConfig.java
// Spring Boot core does not auto-configure CloudWatch export;
// auto-configuration for it lives in Spring Cloud AWS (io.awspring.cloud).
// This manual bean needs no extra starter and works on any Boot version.
@Configuration
public class MetricsConfig {
@Bean
public CloudWatchConfig cloudWatchConfig() {
return key -> switch (key) {
case "cloudwatch.namespace" -> "MyApp";
case "cloudwatch.step" -> "PT1M";
default -> null;
};
}
@Bean
public CloudWatchMeterRegistry cloudWatchMeterRegistry(
CloudWatchConfig config) {
return new CloudWatchMeterRegistry(
config, Clock.SYSTEM, CloudWatchAsyncClient.create());
}
}
// Imports needed:
// import io.micrometer.cloudwatch2.CloudWatchConfig;
// import io.micrometer.cloudwatch2.CloudWatchMeterRegistry;
// import io.micrometer.core.instrument.Clock; // NOT java.time.Clock
// import software.amazon.awssdk.services.cloudwatch.CloudWatchAsyncClient;
Custom timer: @Timed("checkout.duration") on a service method, or registry.timer("payment.latency").record(duration). The EC2 IAM Role needs cloudwatch:PutMetricData; that is the only permission the exporter calls. CloudWatchAsyncClient.create() picks up the EC2 instance profile credentials and region automatically via the default credential chain. To pin the region explicitly (safer when running outside EC2 or if IMDS is slow): CloudWatchAsyncClient.builder().region(Region.of("us-east-1")).build(). Add import software.amazon.awssdk.regions.Region;.
Gotcha
- CloudWatch export is not part of Spring Boot core. Unlike Prometheus or OTLP, CloudWatch is absent from Boot's auto-configured registry list, and
management.cloudwatch.metrics.export.*properties take effect only when Spring Cloud AWS (io.awspring.cloud) is on the classpath providing the auto-configuration. Without it, the properties are silently ignored: the app starts fine, metrics never appear. Either add Spring Cloud AWS or wire the manual@Beanabove. Diagnose via/actuator/beans(expose it first:management.endpoints.web.exposure.include=health,beans): if there is nocloudWatchMeterRegistrybean, no exporter is wired. - Micrometer pushes metrics in batches (default every 1 minute). CloudWatch charges per custom metric per month (~$0.30). With many fine-grained tags, costs escalate. Filter which metrics are exported.
- Tag all metrics with the service name for multi-service dashboards. In the manual bean approach, add this to
cloudWatchMeterRegistry:registry.config().commonTags("application", appName); cloudwatch:ListMetricsdenied in CloudTrail or the IAM Policy Simulator looks alarming when debugging metric publishing. It is harmless: the Micrometer CloudWatch exporter only callsPutMetricDataand never callsListMetrics. Do not addcloudwatch:ListMetricsto the role policy based on this denial. It will not fix anything.
AWS X-Ray: Distributed Tracing
What X-Ray records every request as a trace, a tree of segments (one per service) and subsegments (one per downstream call: RDS query, S3 call, Lambda invocation). You see exactly how long each part took, which service caused a slowdown, and the full call graph for any individual request.
Why CloudWatch Logs tells you something went wrong. X-Ray tells you where and why, across service boundaries. Essential once you have more than one service. You cannot answer "why was this checkout request slow?" without distributed tracing.
Spring Boot Setup, recommended: AWS Distro for OpenTelemetry (ADOT)
IAM permission needed:
<!-- Recommended: OpenTelemetry starter (ADOT-compatible) --> <dependency> <groupId>io.opentelemetry.instrumentation</groupId> <artifactId>opentelemetry-spring-boot-starter</artifactId> <version>2.16.0</version> <!-- current at time of writing; check Maven Central --> </dependency> <!-- Configure OTEL_EXPORTER_OTLP_ENDPOINT to point at the ADOT collector. --> <!-- The ADOT collector runs as a sidecar or daemon and forwards traces to X-Ray. -->The AWS X-Ray SDK for Java (
com.amazonaws:aws-xray-recorder-sdk-spring) entered maintenance mode in February 2026. For new code, use OpenTelemetry (shown above).IAM permission needed:
xray:PutTraceSegments, xray:PutTelemetryRecords. X-Ray console shows a Service Map, a live topology of your system.
Gotcha
- X-Ray uses sampling by default (5% of requests, or 1 req/sec minimum). Don't panic when you can't find a specific trace, it may not have been sampled. Increase the rate in the X-Ray sampling rules for debugging.
- Instrument with OpenTelemetry (ADOT). The X-Ray SDKs and daemon entered maintenance mode in February 2026 and AWS directs all new tracing work to OpenTelemetry; ADOT is AWS's supported distribution of it, and it stays vendor-neutral (the same instrumentation feeds Jaeger or Grafana Tempo). Use the X-Ray SDK only in codebases already wired to it.
- CloudWatch Application Signals is AWS's lead APM offering: ADOT auto-instrumentation plus the CloudWatch agent's OTLP endpoint, surfacing service health, latency SLOs, and dependency maps without custom metrics code. If you are running EKS or ECS, check whether Application Signals covers your requirements before wiring a custom pipeline.
Structured Logging (JSON) + Correlation IDs
What Instead of plain-text logs like
INFO Processing order 123, structured logging emits JSON: {"level":"INFO","orderId":"123","traceId":"abc-def","duration_ms":45}. CloudWatch Logs Insights can then query individual fields. A correlation ID (or trace ID) is a unique identifier generated per request and attached to every log line via MDC (Mapped Diagnostic Context), allowing you to filter all log lines for one specific request across all service instances.Why Plain-text logs are unsearchable at scale. With JSON and a correlation ID, a single CloudWatch Logs Insights query returns every log line for one specific request across all instances. Without correlation IDs, tracing one request through multiple service instances means manually grep-ing unstructured text across dozens of log streams during an incident.
Spring Boot JSON logging
# application.properties (Spring Boot 3.4+)
# alternative format: logstash
logging.structured.format.console=ecs
# For older Boot versions, use logstash-logback-encoder:
# logback-spring.xml with <encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
// In a servlet filter (Java), set the correlation ID per request:
MDC.put("traceId", UUID.randomUUID().toString());
// X-Ray or OTel auto-injects trace IDs into MDC when tracing is enabled Gotcha
logging.structured.format.consoleformats the console appender only. The Phase 2 systemd setup pipes console output toapp.log, so it covers you there; if you later add a dedicated file appender, also setlogging.structured.format.file.- Put the
traceIdinto MDC in a filter that runs before everything else (highest precedence), or the first log lines of each request will be missing the field. MDC values must be Strings.
CloudWatch
CloudWatch Metrics
What Time-series data points representing the health of your AWS resources. Metrics have a namespace (e.g.,
AWS/EC2), a name (e.g., CPUUtilization), and dimensions (e.g., InstanceId=i-xxx). CloudWatch ages data to coarser resolution automatically: sub-minute points are kept 3 hours, 1-minute points 15 days, 5-minute points 63 days, and 1-hour points 455 days. This is not configurable per metric.Why Metrics are the foundation of all monitoring. Without them you are flying blind, you cannot know if an instance is struggling until users report problems.
What EC2 sends by default (free, 5-min intervals)
CPUUtilization,NetworkIn,NetworkOut,DiskReadOps,DiskWriteOps- NOT included by default: memory usage, disk space used. These require the CloudWatch Agent.
- Detailed monitoring (1-minute intervals): costs extra. Enable per instance in the console.
CloudWatch Agent
What A software agent you install on EC2 that collects metrics and logs beyond what AWS sends by default. Collects: memory usage (
mem_used_percent), disk space (disk_used_percent), and any log files you point it at (e.g., your Spring Boot log file).Why Memory and disk space are the two most common causes of production outages. Without the agent, CloudWatch has no visibility into either.
Setup on Amazon Linux 2023
sudo dnf install amazon-cloudwatch-agent -y # Use the wizard to generate config: sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard # Start the agent: sudo systemctl enable amazon-cloudwatch-agent sudo systemctl start amazon-cloudwatch-agentThe EC2 IAM Role must have the
CloudWatchAgentServerPolicy managed policy attached.
Gotcha
- A single data point in the
CWAgentnamespace followed by silence means the agent started once and was then killed. On small instances with a Spring Boot app, the cause is almost always OOM: the kernel kills the largest process (the JVM), which cascades to kill the CW agent as well. Before restarting the agent, confirm there is enough free memory:free -h. If RAM is exhausted, fixing the JVM heap size (or resizing the instance) must come first, otherwise the agent will be killed again immediately. - The EC2 system log is accessible from the EC2 Console without SSH or SSM: select the instance → Actions → Monitor and troubleshoot → Get system log. It shows kernel OOM kills (
oom-kill: task=java), boot errors, and SSM authentication failures. Use it as the first diagnostic step when you cannot connect to an instance.
CloudWatch Logs
What Centralised log storage and search. Logs are organised into Log Groups (one per application or service) and Log Streams (one per instance or task). The CloudWatch Agent ships log files from EC2. Lambda, ECS, and EKS can push logs automatically.
Why Without centralised logging, you must SSH into each instance to read logs, impossible when you have multiple instances or when an instance has crashed.
Gotcha
- Log retention defaults to Never Expire. This accumulates storage costs indefinitely. Always set a retention policy (e.g., 30 or 90 days) on every log group.
- CloudWatch Logs Insights: SQL-like query language for searching and analysing logs. Example:
filter @message like /ERROR/ | stats count(*) by bin(5m)
CloudWatch Alarms
What An alarm watches a single metric over a time window and changes state when the metric crosses a threshold. Three states: OK, ALARM, INSUFFICIENT_DATA. When an alarm enters ALARM state, it can trigger: an SNS notification (→ email, Slack, PagerDuty), an EC2 action (stop, reboot), or an Auto Scaling action.
Why Alarms turn metrics into actionable notifications. Without alarms you would have to watch dashboards constantly, impractical at scale.
Gotcha
- Set the evaluation period and datapoints carefully. CPU > 80% for 1 out of 1 datapoints at 1-minute intervals will fire on transient spikes. 2 out of 3 datapoints reduces false positives.
- An alarm in
INSUFFICIENT_DATAstate means CloudWatch is not receiving metric data, which itself can indicate a problem (agent stopped, instance down).
CloudWatch Dashboards
What Custom visualisations of metrics and alarm states on a single screen. Add widgets for line graphs, stacked area charts, numbers, and alarm status. Dashboards are shared across the team and are the first thing oncall checks during an incident.
Why At-a-glance system health. One dashboard should show the key health signals for your entire application stack.
Gotcha
- Dashboards cost $3/month per dashboard (first 3 are free). Not a budget concern at learning scale. Verify current pricing at aws.amazon.com/cloudwatch/pricing, the free tier and rates change.
Hands-on Tasks
Interview Q&A
What metrics does EC2 send to CloudWatch by default, and what requires the CloudWatch Agent?
By default (free, 5-minute intervals), EC2 sends:
Not included by default: memory usage and disk space utilisation. These require the CloudWatch Agent installed on the instance. Memory and disk are the two metrics most responsible for production incidents, so installing the agent is a baseline operational requirement, not optional.
CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps, and DiskReadBytes/DiskWriteBytes.Not included by default: memory usage and disk space utilisation. These require the CloudWatch Agent installed on the instance. Memory and disk are the two metrics most responsible for production incidents, so installing the agent is a baseline operational requirement, not optional.
How do you monitor a Spring Boot application on EC2?
Three layers: Infrastructure metrics via CloudWatch Agent (CPU, memory, disk, the instance health). Application logs via CloudWatch Agent shipping the Spring Boot log file to CloudWatch Logs, search with Logs Insights for errors and latency patterns. Application metrics via Spring Boot Actuator + Micrometer: expose custom metrics (request count, latency, DB pool size) and push them to CloudWatch using the Micrometer CloudWatch registry. Then build a dashboard combining all three and set alarms on the signals that indicate real user impact (high error rate, high latency) rather than infrastructure metrics alone.
How would you investigate a production incident using CloudWatch?
Start with the time the alarm fired. Open the CloudWatch Dashboard and look for correlated metric spikes at that time, CPU, memory, network, error count. Then go to CloudWatch Logs Insights and query the application log group for ERROR-level messages in that time window. Check if a deployment happened around the same time (correlate with your CI/CD pipeline). Look at RDS metrics (CPU, database connections, read/write latency) if the issue could be DB-related. The goal is to narrow from "something went wrong" to "this specific component hit this specific limit at this time." Document findings and consider adding more targeted alarms for the root cause so the same issue triggers an alert faster next time.
My Notes
Saved to browser storage automatically as you type.
7
Containers and ECS
0%
Topics
Docker
Images and Containers
What A Docker image is a read-only, layered template built from a Dockerfile. It packages your OS base, runtime, and application into a single artefact. A container is a running instance of an image, an isolated process with its own filesystem, network, and process space. Images are immutable; you don't patch them, you build new ones.
Why Containers eliminate "it works on my machine", the same image runs identically in dev, staging, and production. They start in seconds (unlike VMs), use less RAM, and pack many containers onto one host.
Gotcha
- Container filesystem is ephemeral. Anything written inside the container is lost when the container stops. Use volumes or external storage (S3, RDS) for persistence.
- Each layer in a Docker image is cached. Put infrequently-changing layers (base image, dependencies) early in the Dockerfile, and your app code last, this speeds up rebuilds significantly.
Dockerfile for Spring Boot
What A text file with instructions to build a Docker image. For Spring Boot, the recommended approach is a multi-stage build or a simple single-stage build using an official JRE base image.
Why The Dockerfile is the portable build specification for your application. Any environment with Docker produces an identical container image from the same Dockerfile. Using a JRE base image (not JDK) on Alpine keeps the image small (~175 MB), which reduces ECR storage cost and speeds up Fargate task start time.
Recommended Dockerfile
FROM eclipse-temurin:21-jre-alpine WORKDIR /app # Maven names the JAR target/<artifactId>-<version>.jar; the wildcard handles it COPY target/*.jar app.jar ENTRYPOINT ["java", "-jar", "app.jar"]
eclipse-temurin:21-jre-alpine is the Eclipse Foundation's OpenJDK build on Alpine Linux, small (typically under 200 MB) and production-safe. The wildcard COPY works because Maven leaves exactly one JAR in target/. Gradle is different: build/libs/ contains both the boot JAR and a -plain.jar, so either disable the plain JAR (jar { enabled = false } in build.gradle) or COPY the explicit filename. Build: docker build -t my-app . Run: docker run -p 8080:8080 my-app Gotcha
- Use a JRE image (not JDK) in production containers, JDK includes the compiler which you don't need at runtime and adds unnecessary image size.
- Alpine images use musl libc instead of glibc. This is fine for most Spring Boot apps but can cause issues with certain JNI-based libraries or specific DNS configurations. If you hit unexplained runtime issues (DNS failures, TLS oddities), switch to
eclipse-temurin:21-jre(Debian slim) as a baseline to rule out musl compatibility. - Spring Boot's layered JAR feature creates separate Docker layers for dependencies and app code, making rebuilds faster. The
spring-boot-maven-pluginrepackagegoal produces a layered JAR by default; you only add<layers>plugin config to customise the layer list. To benefit, use a multi-stage Dockerfile that extracts the layers (the JAR ships a jarmode for this) and copies dependencies before app code. Worth exploring after basics are solid.
AWS Container Services
ECR (Elastic Container Registry)
What AWS's private Docker image registry. Like Docker Hub but private, integrated with IAM, and in your AWS account. ECS and EKS pull images from ECR automatically using the task/pod's IAM Role, no registry credentials to manage.
Why You need a place to store your Docker images that ECS can pull from. Public Docker Hub images should not be used in production (rate limits, supply chain risk).
Auth + push commands
# Authenticate Docker to ECR aws ecr get-login-password --region us-east-1 | \ docker login --username AWS --password-stdin \ 123456789.dkr.ecr.us-east-1.amazonaws.com # Tag and push docker tag my-app:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
ECS (Elastic Container Service)
What AWS's container orchestration service. You define what to run (Task Definition) and how many copies to keep running (Service). ECS handles placement, health checks, rolling deployments, and integration with load balancers. No Kubernetes knowledge required. Tasks run on one of two launch types: Fargate (serverless, AWS provisions and manages the underlying compute, you pay per task runtime) or EC2 launch type (you own and manage the cluster instances). For most Spring Boot services, use Fargate.
Why On raw EC2, you manually manage deployment scripts, restart logic, and load balancer registration. ECS handles all of that. It is significantly simpler than EKS for teams that do not need Kubernetes portability.
Gotcha
- There is no SSH into Fargate containers. Debugging uses CloudWatch Logs (stdout/stderr from the container) and ECS Exec (an optional feature that provides a shell into a running task).
- ECS tasks are not persistent. A stopped task is replaced by a new one. Any state must be in external storage (RDS, S3, ElastiCache).
Task Definitions
What The blueprint for your container in ECS. Defines: Docker image URI (ECR), CPU and memory allocation, port mappings, environment variables, log configuration, and the IAM Task Role (permissions the container has). Task Definitions are versioned, every change creates a new revision.
Why Versioning separates the definition from the running instance. Update a Task Definition with a new image or adjusted memory, and the ECS Service rolls it out gradually without downtime. Rollback is pointing the Service at a previous revision number.
CPU and memory units
- Fargate CPU units: 256 (0.25 vCPU), 512 (0.5 vCPU), 1024 (1 vCPU), up to 16384 (16 vCPU)
- Memory: 512 MB minimum, must be compatible with chosen CPU. E.g., 512 CPU → 1-4 GB memory; 1024 CPU → 2-8 GB memory.
- Spring Boot typically needs at least 512 MB; 1 GB is comfortable for a small service.
Gotcha
- Two roles, constantly confused. The Task Execution Role is used by the ECS agent itself: pulling the ECR image and writing logs to CloudWatch. The Task Role is what your application code receives through the SDK credential chain, the container equivalent of the EC2 instance profile.
s3:GetObjectfor your app belongs on the Task Role, never on the execution role. An app that getsAccessDenieddespite "the role having the permission" usually has it on the wrong one of the two.
CI/CD Pipeline: GitHub Actions → ECR → ECS
What In production, engineers never manually run
docker build and docker push. A CI/CD pipeline (GitHub Actions, Jenkins, or CodePipeline) triggers automatically on Git commits, builds and pushes the image to ECR, and updates the ECS Service to deploy the new image.Why Manual builds from a developer laptop are not repeatable, not auditable, and don't scale past one person. Automating the build-push-deploy cycle through GitHub Actions means every merge to main produces a traceable, tested deployment with no manual steps.
GitHub Actions workflow (.github/workflows/deploy.yml)
on:
push:
branches: [main]
permissions:
id-token: write # required for OIDC; without it role assumption fails
contents: read
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::ACCOUNT:role/github-deploy-role
aws-region: us-east-1
- name: Login to ECR
id: ecr-login
uses: aws-actions/amazon-ecr-login@v2
- name: Build and push
env:
ECR_URI: ${{ steps.ecr-login.outputs.registry }}/my-app
run: |
IMAGE_TAG=${{ github.sha }}
docker build -t $ECR_URI:$IMAGE_TAG -t $ECR_URI:latest .
docker push $ECR_URI:$IMAGE_TAG
docker push $ECR_URI:latest
- name: Deploy to ECS
run: |
aws ecs update-service --cluster my-ecs-cluster \
--service my-service --force-new-deployment
Three details make this work. The permissions block grants the OIDC token; omit it and role-to-assume fails with "Credentials could not be loaded". ECR_URI comes from the login step's registry output; it is not defined anywhere by default. And --force-new-deployment only redeploys whatever tag the Task Definition references, which is why the build also pushes :latest; the sha tag exists for traceability and rollback. Production pipelines register a new Task Definition revision pinned to the sha tag instead (see aws-actions/amazon-ecs-deploy-task-definition). Use GitHub's OIDC provider with an IAM role (no long-lived access keys in GitHub secrets). Scope the role to ECR push actions (ecr:GetAuthorizationToken, ecr:BatchCheckLayerAvailability, ecr:InitiateLayerUpload, ecr:UploadLayerPart, ecr:CompleteLayerUpload, ecr:PutImage, ecr:BatchGetImage) on your repository ARN plus ecs:UpdateService on the service; ecr:* on * violates the least-privilege rule this roadmap teaches everywhere else.
Gotcha
- Enable ECR image scanning on push:
aws ecr put-image-scanning-configuration --repository-name my-app --image-scanning-configuration scanOnPush=true. It flags known CVEs in your base image layers. Block deployments on CRITICAL findings using the scan results in your pipeline.
ECS Fargate vs EC2 Launch Type
What Fargate: serverless compute for containers. AWS provisions and manages the underlying EC2 instances. You pay per vCPU-second and GB-second of memory while the task runs. No instances to manage, patch, or right-size. EC2 launch type: you manage a cluster of EC2 instances. ECS places containers on them. More control, potentially cheaper at high sustained utilisation, but more operational overhead.
Why Fargate removes EC2 management entirely. For most Spring Boot microservice deployments, Fargate is the right starting point.
Gotcha
- Fargate has a slightly slower cold start than EC2 launch type (~10-30 seconds to start a new task). Not usually a problem for long-running services, but relevant for burst scaling.
- Fargate does not support privileged containers or GPU workloads. For GPUs on ECS, use the EC2 launch type with GPU instances (e.g.,
g6); the same constraint pushes EKS users to GPU node groups.
Hands-on Tasks
Interview Q&A
Why use containers instead of deploying a JAR directly on EC2?
Containers package the application with its runtime and all dependencies into a single immutable artefact. This eliminates environment drift, the same container runs identically in dev, CI, staging, and production. They enable faster deployments (push a new image, ECS rolls it out), easier rollbacks (redeploy the previous image tag), and consistent behaviour regardless of what else is installed on the host. Containers also allow multiple services to run on shared infrastructure with isolation, making better use of resources compared to one-app-per-EC2.
What's the difference between ECS Fargate and the EC2 launch type?
Fargate: serverless. AWS provisions the underlying compute invisibly. You pay per vCPU-second and GB-second of memory while tasks run. No instances to manage, patch, or monitor. Simpler operationally, slightly more expensive at high sustained load, and slightly slower to scale (task startup takes ~10-30 seconds).
EC2 launch type: you manage a cluster of EC2 instances that ECS places containers on. More operational overhead (patching, right-sizing, capacity management), but more control and potentially cheaper at consistently high utilisation. Also required for GPU workloads or containers that need privileged mode.
Starting point for most teams: Fargate. Move to EC2 launch type when you have a concrete cost or capability reason.
EC2 launch type: you manage a cluster of EC2 instances that ECS places containers on. More operational overhead (patching, right-sizing, capacity management), but more control and potentially cheaper at consistently high utilisation. Also required for GPU workloads or containers that need privileged mode.
Starting point for most teams: Fargate. Move to EC2 launch type when you have a concrete cost or capability reason.
How does ECS handle a rolling deployment?
When you update an ECS Service (e.g., new Task Definition revision), ECS starts new tasks with the updated version while keeping old tasks running. The deployment is controlled by two parameters:
minimumHealthyPercent (e.g., 100, never go below 100% of desired count) and maximumPercent (e.g., 200, allow up to double the tasks temporarily). ECS waits for new tasks to pass health checks before stopping old ones. If new tasks fail health checks, the deployment stops and old tasks remain running. If you have a load balancer attached, traffic drains from old tasks before they are stopped.My Notes
Saved to browser storage automatically as you type.
8
Kubernetes with EKS
0%
Topics
Kubernetes Core Objects
Pod
What The smallest deployable unit in Kubernetes. A pod wraps one or more containers that share the same network namespace (same IP, same localhost) and storage volumes. In practice, most pods contain exactly one container. Pods are ephemeral, they are created and destroyed constantly.
Why You never create pods directly in production. You define a Deployment and Kubernetes manages the pods for you, ensuring the desired number are always running.
Gotcha
- Pods have dynamic IPs. Never hardcode a pod's IP, use a Service to get a stable endpoint.
- When a pod crashes, Kubernetes restarts it (according to the
restartPolicy). It does not move to a new IP or name unless it is rescheduled to a different node.
Deployment
What A Deployment declares the desired state: "run 3 replicas of this container image." Kubernetes continuously reconciles actual state to desired state. A Deployment manages a ReplicaSet, which manages the pods. Rolling updates and rollbacks are built in.
Why Without a Deployment, a crashed pod stays down. The Deployment controller maintains your desired replica count automatically, handles rolling updates (new pods start before old ones stop), and supports one-command rollback to any previous revision.
Production-ready Deployment manifest (Spring Boot)
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 2
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
serviceAccountName: my-app-sa # for IRSA (AWS permissions)
containers:
- name: my-app
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
ports:
- containerPort: 8080
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
# readinessProbe: gate traffic until app is ready
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 20
periodSeconds: 5
# livenessProbe: restart if app is deadlocked
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 40
periodSeconds: 10 Why probes matter: Without readinessProbe, Kubernetes sends traffic to pods still initialising their Spring context (5-10 seconds), causing 500s during rolling deploys. Without livenessProbe, a deadlocked app stays in service indefinitely. Enable both Actuator endpoints: management.health.probes.enabled=true in application.properties.
Service
What A stable network endpoint that load-balances traffic to a set of pods matching a label selector. Three main types: ClusterIP (default, internal-only, for service-to-service traffic), NodePort (exposes on each node's IP at a static port, mostly for testing), LoadBalancer (provisions an AWS load balancer for internet-facing exposure; on EKS the in-tree default is a legacy Classic Load Balancer, and you get an NLB by installing the AWS Load Balancer Controller and annotating the Service with
service.beta.kubernetes.io/aws-load-balancer-type: external).Why Services decouple callers from pods. Even as pods are replaced during deployments, the Service endpoint stays stable.
Gotcha
- A
LoadBalancerService in EKS creates an AWS load balancer that bills ~$0.0225/hr (~$16/month) plus capacity units, per Service. For HTTP routing to multiple services, use an Ingress with the AWS Load Balancer Controller instead, it creates a single ALB shared across services.
ConfigMap and Secret
What ConfigMap: stores non-sensitive configuration (e.g., database URL, feature flags) as key-value pairs. Injected into pods as environment variables or mounted as files. Secret: same structure but for sensitive data (passwords, API keys). Base64-encoded (not encrypted) by default in etcd.
Why Separating config from the container image lets the same image run in dev and prod with different settings. Change a database URL or a feature flag by updating the ConfigMap, no image rebuild required.
Gotcha
- Kubernetes Secrets are not encrypted at rest by default in etcd. For production, integrate with AWS Secrets Manager using the External Secrets Operator, or use the AWS Secrets and Configuration Provider (ASCP) with the Secrets Store CSI Driver.
- Base64 is encoding, not encryption. Anyone with kubectl access can decode a Secret trivially:
kubectl get secret my-secret -o jsonpath='{.data.password}' | base64 -d
Ingress
What HTTP/HTTPS routing rules that direct external traffic to different Services based on host or path. Example:
api.example.com/users → user-service, api.example.com/orders → order-service. Requires an Ingress Controller to implement the rules. In EKS, the AWS Load Balancer Controller creates an ALB from Ingress resources.Why Instead of one LoadBalancer per service (one load balancer each = expensive), a single ALB can route to all services based on path, much more cost-efficient.
Gotcha
- An Ingress without
ingressClassName: alband thealb.ingress.kubernetes.io/scheme: internet-facingannotation provisions nothing, with no error. Check the controller's logs (kubectl logs -n kube-system deploy/aws-load-balancer-controller) when no ALB appears. - Default target type is
instance(NodePort-based). Setalb.ingress.kubernetes.io/target-type: ipto route straight to pod IPs; it is also the only mode that works for Fargate pods.
Helm & Autoscaling
Helm: Kubernetes Package Manager
What Helm is the package manager for Kubernetes. A chart is a packaged collection of YAML manifests with templating (
{{ .Values.image.tag }}). Instead of maintaining separate YAML files per environment, you define one chart and override values per environment. Community charts for common infrastructure (nginx-ingress, cert-manager, kube-prometheus-stack, aws-load-balancer-controller) are published across vendor repositories and indexed at Artifact Hub.Why Teams don't write raw Kubernetes YAML in production, they use Helm. Installing the AWS Load Balancer Controller (required for Ingress) is a Helm chart install. Most enterprise deployments use Helm for parameterisation and rollback history.
Key commands
helm install my-app ./my-chart --values values-prod.yaml helm upgrade my-app ./my-chart --set image.tag=v1.2.3 helm rollback my-app 1 # rollback to revision 1 helm list # list installed releases
HPA (Horizontal Pod Autoscaler)
What HPA automatically adjusts the number of pods in a Deployment based on observed metrics (CPU utilisation, memory, or custom metrics from Prometheus). You define a minimum replica count, maximum, and a target metric value. The HPA controller checks metrics every 15 seconds and scales up or down accordingly.
Why Manual scaling means over-provisioning for normal traffic or under-provisioning during spikes. HPA keeps replica count matched to actual load, cutting cost during quiet periods and absorbing bursts automatically without human intervention.
Minimal HPA manifest
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Node-level autoscaling (adding/removing EC2 worker nodes) is handled by Karpenter, the AWS-recommended node autoscaler that provisions nodes in response to unschedulable pods.
Gotcha
- HPA requires the Kubernetes Metrics Server to be running for CPU-based scaling. On EKS, install it first:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml. Without Metrics Server,kubectl get hpashows<unknown>/60%in the TARGETS column and HPA never triggers a scale event.
EKS (Elastic Kubernetes Service)
EKS Cluster and Node Groups
What EKS is a managed Kubernetes control plane. AWS runs and maintains etcd, the API server, controller manager, and scheduler. You manage (or let AWS manage) the worker nodes via Managed Node Groups, EC2 instances that EKS provisions, registers, and drains during updates automatically.
Why Self-managing the Kubernetes control plane (etcd HA, API server upgrades, scheduler availability) is significant infrastructure work before a single application runs. EKS delegates all control plane operations to AWS; your team focuses on workloads, not cluster internals.
Cost warning
- EKS control plane: $0.10/hour (~$72/month) per cluster while it runs a Kubernetes version in standard support. A cluster left on a version past standard support bills $0.60/hour extended support, six times the price; keep the version current. Plus the EC2 cost of worker nodes. Delete the cluster when not actively learning, this is the biggest cost trap in this roadmap.
- Alternatively, use EKS with Fargate profiles to avoid managing EC2 worker nodes entirely. Pay per pod CPU/memory instead.
IRSA (IAM Roles for Service Accounts)
What By default, every pod on a node can reach the node's EC2 Instance Profile credentials through the instance metadata endpoint, which means all pods share whatever permissions the node role has: a least-privilege failure. IRSA fixes this: you annotate a Kubernetes Service Account with an IAM Role ARN, and only pods that reference that Service Account receive temporary credentials for that specific role. Credentials are injected as environment variables and rotated automatically, no code changes needed, the AWS SDK picks them up via the standard credential provider chain.
Why Per-pod IAM identity is the production pattern for giving AWS permissions to pods (EKS Pod Identity, covered in the gotchas, is its successor for new clusters). Without it, pods either ride on the shared node role or teams embed Access Keys in Kubernetes Secrets, both serious security problems. Production clusters also block pod access to the node's metadata endpoint (IMDS hop limit 1) so the node role cannot be reached at all. This is a day-1 requirement on any real EKS deployment and a common interview question.
Setup with eksctl
# Step 1: Associate OIDC provider with the cluster (once per cluster) eksctl utils associate-iam-oidc-provider \ --cluster my-cluster --region us-east-1 --approve # Step 2: Create an IAM service account in your cluster namespace eksctl create iamserviceaccount \ --cluster my-cluster \ --namespace default \ --name my-app-sa \ --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \ --approve # Step 3: Reference it in your Deployment manifest # spec.template.spec.serviceAccountName: my-app-sa
Gotcha
- The OIDC provider must be associated with the cluster before creating IRSA service accounts. Without this, the IAM role trust policy cannot validate the pod's identity token.
- The AWS SDK inside the pod automatically uses IRSA credentials. On EC2, it uses the Instance Profile. Both flow through the same credential provider chain, your Spring Boot code works identically in both environments without any changes.
- EKS Pod Identity is AWS's recommended approach for new clusters: no OIDC provider association, simpler setup. It requires the EKS Pod Identity Agent add-on running on the cluster, then:
eksctl create podidentityassociation --cluster my-cluster --namespace default --service-account-name my-app-sa --role-arn arn:aws:iam::ACCOUNT:role/my-role. Use IRSA for existing clusters or tools whose docs still lead with it; this phase uses IRSA because the Load Balancer Controller setup does. - When a team member gets "Unauthorized" on
kubectl: grant access via EKS Access Entries (AWS Console → EKS cluster → Access → Create access entry), the API-driven approach and the default on current clusters. Clusters still using the legacyaws-authConfigMap:kubectl edit configmap aws-auth -n kube-systemand add their IAM principal ARN.
eksctl and kubectl
What eksctl: an open-source CLI, officially endorsed by AWS, for creating and managing EKS clusters. Maintained at github.com/eksctl-io/eksctl. Abstracts the complex CloudFormation stacks that EKS requires. kubectl: the standard Kubernetes CLI for all cluster operations, deploying, scaling, inspecting, and debugging.
Why eksctl removes the need to hand-wire the CloudFormation stacks EKS requires for VPC, node groups, and IAM roles. kubectl is the universal Kubernetes interface; the same commands work on EKS, GKE, or any other distribution.
Key commands
# Create cluster (takes ~15-20 minutes) eksctl create cluster --name my-cluster --region us-east-1 \ --nodegroup-name workers --node-type t3.small --nodes 2 kubectl get nodes # verify workers are Ready kubectl get pods -A # all pods in all namespaces kubectl apply -f app.yaml # deploy from manifest kubectl get svc # list services and their external IPs kubectl logs pod-name -f # tail pod logs kubectl rollout undo deployment/my-app # rollback
Hands-on Tasks
Interview Q&A
When would you choose ECS over EKS, and vice versa?
Choose ECS when: your team is AWS-focused and doesn't need multi-cloud portability, you want simpler operations with less overhead, you're running a small number of services, or Kubernetes' learning curve isn't justified by your team's size and complexity.
Choose EKS when: you need Kubernetes-native tooling (Helm charts, Argo CD, Istio, Karpenter), your organisation is standardising on Kubernetes across clouds, you need fine-grained scheduling or custom operators, or you're migrating from an on-premises Kubernetes cluster.
EKS adds operational complexity (control plane cost, version upgrades, node group management) and a steeper learning curve. ECS is simpler and tightly integrated with AWS services but less portable. For a new Spring Boot project with a small team: ECS Fargate is often the pragmatic choice.
Choose EKS when: you need Kubernetes-native tooling (Helm charts, Argo CD, Istio, Karpenter), your organisation is standardising on Kubernetes across clouds, you need fine-grained scheduling or custom operators, or you're migrating from an on-premises Kubernetes cluster.
EKS adds operational complexity (control plane cost, version upgrades, node group management) and a steeper learning curve. ECS is simpler and tightly integrated with AWS services but less portable. For a new Spring Boot project with a small team: ECS Fargate is often the pragmatic choice.
What happens when a Kubernetes pod crashes?
Kubernetes restarts the container within the pod according to the pod's
The Deployment controller ensures the desired number of healthy pods always runs. If a pod on a node crashes repeatedly, the scheduler may reschedule it on a different node. Liveness probes detect unhealthy pods (e.g., deadlocked app) and kill them for restart. Readiness probes prevent traffic from reaching pods that are not yet ready to serve requests.
restartPolicy (default: Always for Deployments). After repeated failures, Kubernetes enters CrashLoopBackOff, it restarts the container with exponential backoff (10s, 20s, 40s... up to 5 minutes between restarts) to avoid thrashing.The Deployment controller ensures the desired number of healthy pods always runs. If a pod on a node crashes repeatedly, the scheduler may reschedule it on a different node. Liveness probes detect unhealthy pods (e.g., deadlocked app) and kill them for restart. Readiness probes prevent traffic from reaching pods that are not yet ready to serve requests.
How does a Kubernetes rolling update work?
When you update a Deployment (e.g., change the image tag), Kubernetes creates a new ReplicaSet with the new version. It then scales up the new ReplicaSet and scales down the old one gradually, controlled by
New pods must pass readiness probes before old pods are terminated, ensuring no traffic goes to pods that aren't ready. If the new pods fail readiness probes, the rollout pauses and old pods remain running. Rollback is instant:
maxSurge (how many extra pods can exist during rollout, default 25%) and maxUnavailable (how many pods can be unavailable during rollout, default 25%).New pods must pass readiness probes before old pods are terminated, ensuring no traffic goes to pods that aren't ready. If the new pods fail readiness probes, the rollout pauses and old pods remain running. Rollback is instant:
kubectl rollout undo deployment/my-app switches back to the previous ReplicaSet without rebuilding anything.My Notes
Saved to browser storage automatically as you type.
9
Serverless
0%
Topics
AWS Lambda
Lambda Function Lifecycle
What Lambda runs your code in response to events (HTTP requests, S3 uploads, DynamoDB streams, SQS messages, etc.) without you provisioning or managing servers. You upload code (ZIP or container image), configure a trigger, set memory (128 MB-10 GB), and pay per invocation + duration (GB-seconds). Pricing: $0.20/million requests + $0.0000166667/GB-second. Always-free tier: 1 million requests and 400,000 GB-seconds per month (verify at aws.amazon.com/free). Maximum execution time: 15 minutes.
Why No servers to patch or scale. Lambda auto-scales from 0 to thousands of concurrent executions. Cost is zero when idle, you pay only for actual compute used, down to the millisecond.
Gotcha
- Default concurrent execution limit: 1000 per account per region (soft limit, can be increased). One Lambda function hitting the limit can throttle all other functions in the account. Use reserved concurrency to isolate critical functions.
- Scaling rate: each function scales by up to 1,000 concurrent executions every 10 seconds, independently per function, until the account concurrency limit is reached. (Older tutorials describe a 3,000-burst-plus-500-per-minute regional model; that model is gone.) A function can still throttle during an extreme spike, watch the
Throttlesmetric. - Lambda is not suitable for long-running processes (batch jobs, video encoding) or anything requiring persistent local state.
- Default timeout is 3 seconds. A Java cold start alone can exceed that, so the first invocation of a freshly deployed function can fail before your code runs. Set the timeout to 30 seconds for API-backed functions (maximum 15 minutes) before debugging anything else.
Writing a Java Lambda Handler (plain Java, no Spring)
What The handler is the entry point Lambda invokes. Implement
RequestHandler<Input, Output> from the aws-lambda-java-core library. Input/output types are automatically serialised from/to JSON by the Lambda runtime. Keep the handler class small, initialise expensive objects (SDK clients, DB connections) as instance variables so they survive across warm invocations on the same container.Why Lambda requires a specific entry point interface. Understanding the handler contract, including input/output serialisation and instance variable reuse across warm invocations, is required before writing any Lambda function regardless of the trigger type.
Minimal working handler (Maven dependency:
aws-lambda-java-core + aws-lambda-java-events) public class UserHandler
implements RequestHandler<APIGatewayV2HTTPEvent,
APIGatewayV2HTTPResponse> {
// Initialised ONCE per container, reused on warm starts
private final DynamoDbClient dynamo = DynamoDbClient.create();
@Override
public APIGatewayV2HTTPResponse handleRequest(
APIGatewayV2HTTPEvent event, Context ctx) {
String userId = event.getPathParameters().get("id");
// query dynamo, build response...
return APIGatewayV2HTTPResponse.builder()
.withStatusCode(200)
.withBody("{\"userId\":\"" + userId + "\"}")
.build();
}
}
Handler config in Lambda console: com.example.UserHandler::handleRequest. The event types match the HTTP API (payload format 2.0) used in the tasks below; a REST API sends payload format 1.0, which uses APIGatewayProxyRequestEvent/APIGatewayProxyResponseEvent instead (see the Throttling and Security gotcha).
Gotcha
- Do not use Spring Boot as a Lambda runtime. Spring's application context initialises in 3-8 seconds, that's a multi-second cold start on every scale-out event. Use plain Java for Lambda. If your team requires Spring, use Spring Cloud Function + SnapStart, which snapshots the initialised context to cut cold starts to ~1 second.
- Static initialisers (
static { ... }) run during the init phase, counted as cold start time. Move heavyweight setup into instance variables (initialised once) or lazy-init them on first invocation.
Cold Starts and SnapStart
What A cold start happens when Lambda must provision a new container to handle an invocation, no warm container is available. The sequence: provision container → download code → initialise runtime → run static initialiser code → run handler. For Java, this takes 2-10 seconds. A warm start reuses an existing container and only runs the handler (~10-100ms).
Why Cold starts add latency for the first request and after periods of inactivity. Critical for customer-facing APIs.
Java-specific mitigations
- Lambda SnapStart (available for Java 11, Java 17, Java 21, Python 3.12+, and .NET 8+): AWS takes a snapshot of the initialised execution environment after the init phase, then restores from it on cold starts. Reduces Java cold starts from seconds to ~1 second. It applies only to published versions and aliases, never to
$LATEST. Included at no extra cost for Java; Python and .NET bill for snapshot cache storage and per restore. Most impactful for functions initialising Spring Cloud Function or other heavyweight frameworks. - Provisioned Concurrency: keeps N containers pre-warmed and ready. Eliminates cold starts completely but you pay for the reserved capacity even when idle. Use for latency-sensitive APIs.
- Avoid heavy Spring Boot in Lambda, Spring's context initialisation is slow. Use plain Java, Quarkus (with native compilation), or Micronaut instead. If you need Spring, use Spring Cloud Function with GraalVM native image.
Lambda Powertools for Java
What AWS Lambda Powertools for Java is an open-source library from AWS that provides production utilities via annotations:
@Logging (structured JSON logs with correlation IDs injected automatically), @Tracing (X-Ray subsegments per method), @Metrics (CloudWatch EMF metrics without custom SDK calls), and @Idempotent (deduplication via a DynamoDB idempotency store). Add it to Maven: software.amazon.lambda:powertools-logging, powertools-tracing, powertools-metrics, powertools-idempotency.Why Writing production-grade Lambda handlers without Powertools means manually injecting request IDs into logs, manually creating X-Ray subsegments, and writing your own idempotency table logic. Powertools eliminates all three in under 10 lines. It's the first thing you add to any Java Lambda that handles more than a trivial use case.
Gotcha
@Idempotentrequires a DynamoDB table to store idempotency keys. The annotation handles read-before-write and conditional write automatically, but you must provision the table and pass its name via theIDEMPOTENCY_TABLEenvironment variable (or configure theDynamoDBPersistenceStoreexplicitly).@Metricsuses Embedded Metric Format (EMF): metrics are written to CloudWatch Logs as structured JSON and CloudWatch extracts them asynchronously. This avoids the PutMetricData API call cost, but metrics can lag behind log delivery by a few seconds.
Lambda Function URLs
What A Lambda Function URL is a dedicated HTTPS endpoint built directly into a Lambda function, with no API Gateway required. Format:
https://<url-id>.lambda-url.<region>.on.aws. Supports two auth modes: AWS_IAM (callers sign requests with SigV4) or NONE (public). Configured per function, or per alias/version. Payload format is always 2.0 (same as HTTP API).Why For functions that only need an HTTPS endpoint and nothing else (no custom domain, no caching, no usage plans, no WAF), Function URLs skip the API Gateway entirely and cost nothing beyond Lambda invocation pricing. Typical use cases: webhooks, single-endpoint microservices, internal service-to-service calls with IAM auth.
Gotcha
- Function URLs do not support custom domains natively. If you need a vanity domain (
api.yourdomain.com), put CloudFront in front of the Function URL or use API Gateway. - The auth mode
NONEmakes the URL publicly accessible with no authentication. Only use this for genuinely public endpoints or behind a trusted network boundary.
Memory and Performance
What Lambda CPU allocation scales proportionally with memory. A function at 1024 MB gets approximately 2x the CPU of a function at 512 MB. This means allocating more memory can make a function faster and potentially cheaper overall (faster execution = fewer GB-seconds billed).
Why Java's JVM is memory-heavy. At 256 MB, the JVM itself consumes most of the budget leaving almost nothing for the application. Tuning memory is the primary lever for Lambda performance and cost; at millions of invocations, even 100 MB of over-allocation adds measurable monthly spend.
Gotcha
- The right memory setting is not always the minimum. Use AWS Lambda Power Tuning (an open source Step Functions state machine) to find the optimal memory-to-cost ratio for your function.
- Ephemeral storage (
/tmp): 512 MB by default, configurable up to 10 GB. Files in /tmp persist between warm invocations on the same container, useful for caching, but don't assume it's always available.
Event Sources for Lambda
SQS, SNS, Kinesis, and DynamoDB Streams as Lambda Triggers
What Lambda's primary use case in enterprise systems is event-driven processing. HTTP is one trigger type among many. Four critical triggers:
- SQS: Lambda polls the queue and invokes your handler with a batch of messages (default batch size 10; standard queues go up to 10,000 if you also set a batching window, FIFO queues cap at 10). Built-in retry (message returns to queue on failure), DLQ support.
- SNS: Fan-out pattern, SNS topic → multiple SQS queues → each with its own Lambda. Decouples producer from many consumers.
- Kinesis Data Streams: Lambda processes records in order within a shard. Use for real-time streaming (log processing, IoT data, event sourcing). Each shard gets one concurrent Lambda invocation.
- DynamoDB Streams: Every write to a DynamoDB table appears as an event. Lambda can react to inserts/updates/deletes for change data capture (CDC), cross-table sync, or event sourcing.
Why In production, most Lambda invocations come from SQS, SNS, Kinesis, and DynamoDB Streams, not API Gateway. Decoupled event-driven processing via SQS + Lambda is more resilient than synchronous chains: failures are retried automatically, messages don't block the producer, and processing can scale independently.
Gotcha
- Idempotency is mandatory for SQS and Kinesis consumers. Lambda may invoke your handler more than once for the same message (at-least-once delivery). Your handler must produce the same result on re-processing, use a DynamoDB conditional write or idempotency key table to detect duplicates.
- SQS batch failures: If one message in a batch fails, the entire batch is retried by default (returning all messages to the queue, including the ones that succeeded). Use
ReportBatchItemFailuresin your Lambda response to return only the failed message IDs for retry. - A DLQ (Dead Letter Queue) captures messages that fail after all retries. Always configure one, without it, poisoned messages cycle forever and block the queue.
- Lambda Destinations vs DLQ: these are different mechanisms. A DLQ is configured on the SQS queue (or the Lambda function for async invocations) and captures raw failed messages. Lambda Destinations (configured on the Lambda function) route the full invocation record, including the request payload, response, and error details, to SQS, SNS, EventBridge, or another Lambda. Use Destinations when you need the original event payload plus failure context for debugging; use DLQ when the SQS queue owns retry and archiving.
- Event Source Mapping Filters: for SQS, Kinesis, and DynamoDB Streams triggers, you can add a filter pattern so Lambda is only invoked when messages match specific criteria (e.g., only process events where
eventType = "ORDER_PLACED"). Filters evaluate before your handler runs. Unmatched messages are discarded (SQS), or skipped (Kinesis/DynamoDB Streams). Configure via "Additional settings" in the trigger or in theFilterCriteriafield of the EventSourceMapping API.
API Gateway
HTTP API vs REST API
What API Gateway offers two main API types for Lambda integrations: HTTP API (simpler, ~70% cheaper, lower latency) and REST API (more features, more expensive). Both create managed HTTP endpoints backed by Lambda.
Why Most Spring Boot API use cases fit HTTP API at roughly 70% lower cost. REST API is only needed for specific features: response caching, WAF integration, or usage plans. Knowing the difference prevents defaulting to REST API because it superficially resembles a familiar controller layer.
Feature comparison
- HTTP API: JWT authorisers, Lambda proxy integration, CORS, per-route throttling, lower cost (~$1/million requests). Missing: caching, API keys/usage plans, AWS WAF integration.
- REST API: all HTTP API features plus response caching, usage plans, API keys, resource policies, AWS WAF, mock integrations, request/response transformations. Costs ~$3.50/million requests.
- Default recommendation: use HTTP API unless you need a specific REST API feature.
Throttling and Security
What API Gateway throttles requests to protect your backend. Account-level default: 10,000 requests/second steady-state rate, 5,000 burst (token bucket capacity). Returns HTTP 429 (Too Many Requests) when exceeded. For REST API, you can set per-route throttle limits via usage plans.
Why An unprotected public API endpoint is a cost risk as much as a security risk. A single traffic spike or scraper can trigger thousands of Lambda invocations per second. Configure an authoriser and throttle limits before sharing any API Gateway URL outside your team.
Gotcha
- The API Gateway endpoint is public by default. Anyone can call it. Add an authoriser (JWT/Cognito for HTTP API, Lambda authoriser for custom logic) or at minimum use an API key to prevent open access in production.
- Payload format version mismatch: HTTP API uses payload format version 2.0 by default; REST API uses 1.0. The Java event types are different. For HTTP API (format 2.0), use
APIGatewayV2HTTPEvent(fromaws-lambda-java-events): path is atevent.getRawPath(). For REST API (format 1.0), useAPIGatewayProxyRequestEvent: path is atevent.getPath(). Using the wrong event class silently produces null path parameters and empty bodies.
DynamoDB
Partition Keys and Sort Keys
What DynamoDB is a fully managed NoSQL key-value and document database. Every item has a primary key. Simple primary key: partition key only (must be unique per item). Composite primary key: partition key + sort key (the combination must be unique; the partition key alone can repeat). DynamoDB uses the partition key to determine which physical partition stores the item (sharding). The sort key enables range queries within a partition.
Why DynamoDB provides single-digit millisecond performance at any scale, with no schema to define beyond the key structure. Perfect for high-throughput, simple access patterns.
Gotcha
- Choose a high-cardinality partition key (e.g., user ID, order ID, values that are unique and evenly distributed). A low-cardinality partition key (e.g., status = "active"/"inactive") creates hot partitions that throttle performance.
- DynamoDB has no joins. Design your table access patterns upfront, think in queries, not entities. One table design (putting multiple entity types in one table) is the advanced but optimal pattern.
- Scan vs Query: a
Scanreads every item in the table and applies a filter after reading. You are billed for every item read, regardless of how many match. On a 10-million-item table, a Scan that returns 1 item still bills for reading 10 million items. Always useQuerywith a partition key condition. Only use Scan for backfill/migration scripts or admin tooling that runs once. - Item size limit: 400 KB. If an item exceeds 400 KB (e.g., storing user-uploaded content inline), the PutItem call fails. The standard pattern is to store large payloads in S3 and keep only the S3 object key in DynamoDB.
- Cost at this phase's scale is effectively zero: 25 GB of storage and a provisioned-capacity allowance sit in AWS's always-free offerings, and on-demand requests at learning volume cost cents. Verify current terms at aws.amazon.com/free before building cost assumptions.
Capacity Modes
What On-Demand: pay per request. No capacity planning. Scales instantly to any traffic level, and is the default mode for new tables. ~$1.25 per million write request units, $0.25 per million read request units (us-east-1; check aws.amazon.com/dynamodb/pricing/on-demand for your region). Provisioned: you specify Read Capacity Units (RCUs) and Write Capacity Units (WCUs). Cheaper at predictable, sustained load. Auto Scaling adjusts RCU/WCU automatically within min/max bounds.
Why At equivalent sustained throughput, On-Demand costs roughly 3-4x more than fully utilised Provisioned capacity; in practice provisioned capacity is rarely fully utilised, so the real gap is smaller. Start On-Demand. Move to Provisioned with Auto Scaling only when traffic is sustained and predictable enough that the numbers clearly favour it.
Gotcha
- 1 RCU = 1 strongly consistent read per second for items up to 4 KB, or 2 eventually consistent reads per second. 1 WCU = 1 write per second for items up to 1 KB.
- Provisioned mode bills for the capacity you reserve whether you use it or not; under-utilised provisioned tables often cost more than On-Demand would. Compare against your real utilisation before switching.
DynamoDB Transactions & Conditional Writes
What TransactWriteItems makes up to 100 write operations atomic across multiple items or tables, all succeed or all fail, no partial state. Conditional writes add optimistic locking: a write only applies if a specified attribute matches a condition. Example:
ConditionExpression: "attribute_not_exists(orderId)" prevents duplicate order creation even under concurrent Lambda invocations.Why Distributed systems need exactly-once semantics. Idempotent API design is a requirement in production; conditional writes in DynamoDB are the mechanism. Without them, race conditions create duplicate records when retries overlap.
Gotcha
- Transactions consume 2× the capacity units (reads and writes are doubled). Budget accordingly.
TransactionConflictExceptionoccurs when two concurrent transactions touch the same item. Implement exponential backoff and retry in your Lambda handler.
Global Secondary Index (GSI)
What A GSI lets you query a DynamoDB table on attributes other than the primary key. A GSI has its own partition key (and optional sort key) that can be any table attribute. Data is replicated asynchronously from the base table to the GSI. You can have up to 20 GSIs per table. You pay separately for GSI storage and throughput.
Why Without a GSI, every query on a non-key attribute requires a full table scan that reads and bills for every item in the table. GSIs let you support multiple access patterns on a single table without scan costs, the primary mechanism for keeping DynamoDB queries fast and cheap as the table grows.
Gotcha
- GSI reads are eventually consistent, the GSI may lag behind the base table by milliseconds to seconds. Never use a GSI for reads that must reflect the very latest write.
LSI vs GSI
What A Local Secondary Index (LSI) shares the same partition key as the base table but uses a different sort key. It must be created at table creation time; you cannot add it later. Maximum 5 LSIs per table. An LSI supports strongly consistent reads. A Global Secondary Index (GSI) can use any attribute as its partition key and can be added or deleted at any time. GSIs support only eventually consistent reads.
Why The LSI/GSI choice is a design constraint you cannot change later for LSIs. Get it wrong and you rebuild the table. The practical rule: if you need to query within a partition on multiple sort keys and strong consistency matters, LSI. For every other alternate access pattern, GSI.
Gotcha
- LSIs share the provisioned throughput of the base table. Under heavy read load against an LSI, it competes directly with base table reads. GSIs have their own independent read/write capacity.
- LSIs have a 10 GB per-partition limit (the combined size of all items sharing the same partition key across the base table and all LSIs). Exceeding this cap causes
ItemCollectionSizeLimitExceededException.
Single-Table Design
What Single-table design stores multiple entity types (users, orders, sessions) in one DynamoDB table by overloading the primary key. Pattern: set PK to an entity prefix + ID (e.g.,
USER#userId, ORDER#orderId) and SK to encode the relationship or access pattern (e.g., PROFILE, ORDER#2024-01-15). Different item types coexist in the same table and each GSI serves a specific access pattern.Why DynamoDB charges per read/write, not per table. Multiple tables with cross-entity queries require application-side joins (multiple round trips). A single table with well-designed keys fetches all related entities in one
Query call. The tradeoff: the schema is harder to read and onboard onto compared to relational models.Gotcha
- Changing access patterns after launch means adding GSIs or migrating data. Single-table design front-loads the design effort: map every query your application will ever run before writing a line of code.
- Do not start with single-table design on a new project unless the team has DynamoDB experience. Multiple tables is a valid starting point and easier to reason about during initial development.
DynamoDB TTL (Time to Live)
What TTL automatically deletes items from DynamoDB at a specified time at no extra cost. To enable: choose a Number attribute to use as the TTL attribute (e.g.,
expiresAt), enable TTL on the table, and set that attribute to a Unix timestamp (seconds since epoch) on each item you want to expire. DynamoDB deletes expired items within 48 hours of the expiry time (not exactly at expiry).Why Session tables, OTP tables, rate-limit counters, and idempotency tables accumulate rows indefinitely without TTL. A growing table with millions of stale items increases storage costs and scan costs. TTL is the standard mechanism for bounded-growth tables in serverless architectures.
Gotcha
- TTL deletion is not instantaneous: items can persist up to 48 hours past their expiry timestamp. If your application logic reads the table and must not see expired items, add a filter in your query:
FilterExpression: "#exp > :now"using the current epoch time. - TTL deletions appear in DynamoDB Streams as
REMOVEevents with auserIdentityof{"type": "Service", "principalId": "dynamodb.amazonaws.com"}. If your Lambda processes the stream, filter these out unless you want to react to expirations.
Hands-on Tasks
Interview Q&A
What is a Lambda cold start and how do you mitigate it for Java?
A cold start occurs when Lambda provisions a new execution environment, it must download code, start the JVM, and run initialisation code before handling the request. For Java, this typically adds 2-10 seconds of latency. Subsequent requests to the same warm container run in milliseconds.
Mitigations in order of effectiveness: (1) Lambda SnapStart, available on Java 11+, Python 3.12+, and .NET 8+ runtimes; AWS snapshots the initialised environment and restores from it, reducing cold starts to ~1 second. (2) Provisioned Concurrency, keeps N containers pre-initialised, eliminating cold starts at extra cost. (3) Avoid Spring Boot, Spring's context initialisation is heavy; use Quarkus with native compilation or plain Java. (4) Keep init code minimal, defer expensive initialisation until first use rather than in static blocks.
Mitigations in order of effectiveness: (1) Lambda SnapStart, available on Java 11+, Python 3.12+, and .NET 8+ runtimes; AWS snapshots the initialised environment and restores from it, reducing cold starts to ~1 second. (2) Provisioned Concurrency, keeps N containers pre-initialised, eliminating cold starts at extra cost. (3) Avoid Spring Boot, Spring's context initialisation is heavy; use Quarkus with native compilation or plain Java. (4) Keep init code minimal, defer expensive initialisation until first use rather than in static blocks.
When would you choose DynamoDB over PostgreSQL (RDS)?
Choose DynamoDB when: you need single-digit millisecond latency at massive scale, your access patterns are simple and known upfront (get by ID, query by partition key), you need near-infinite scalability without manual sharding, or you're building session storage, leaderboards, IoT data, or gaming backends.
Stick with PostgreSQL/RDS when: your data has complex relationships requiring joins, you need ad-hoc queries or reporting, you have transactional requirements (ACID across multiple entities), or your team thinks in relational terms, the operational overhead of DynamoDB table design is significant for teams not familiar with NoSQL access patterns. DynamoDB requires you to design your data model around your queries upfront, not the other way around.
Stick with PostgreSQL/RDS when: your data has complex relationships requiring joins, you need ad-hoc queries or reporting, you have transactional requirements (ACID across multiple entities), or your team thinks in relational terms, the operational overhead of DynamoDB table design is significant for teams not familiar with NoSQL access patterns. DynamoDB requires you to design your data model around your queries upfront, not the other way around.
What's the difference between API Gateway REST API and HTTP API?
HTTP API is the simpler and cheaper option (~$1/million requests). It covers the majority of use cases: Lambda proxy integration, JWT authentication, CORS, and custom routes. Lower latency than REST API.
REST API (~$3.50/million requests) adds features not in HTTP API: response caching (reduces Lambda invocations and latency for cacheable responses), usage plans and API keys (for partner/developer APIs with rate limits per key), AWS WAF integration (web application firewall), request/response transformation without Lambda, and resource-level policies.
Default choice: HTTP API. Use REST API specifically when you need caching, usage plans, or WAF.
REST API (~$3.50/million requests) adds features not in HTTP API: response caching (reduces Lambda invocations and latency for cacheable responses), usage plans and API keys (for partner/developer APIs with rate limits per key), AWS WAF integration (web application firewall), request/response transformation without Lambda, and resource-level policies.
Default choice: HTTP API. Use REST API specifically when you need caching, usage plans, or WAF.
My Notes
Saved to browser storage automatically as you type.
10
Infrastructure as Code
0%
Topics
Terraform
Core Concepts: Providers, Resources, Variables
What Terraform is an IaC tool by HashiCorp using HCL (HashiCorp Configuration Language). It is source-available under the BUSL licence, free to use for almost everyone except HashiCorp's competitors; the Linux Foundation's OpenTofu is the open-source fork with compatible syntax. Providers are plugins that interact with APIs (the
aws provider calls AWS APIs). Resources are the infrastructure you declare (aws_s3_bucket, aws_instance). Variables parameterise your configuration so the same code can provision dev, staging, and prod.Why Every Terraform codebase in production uses this provider/resource/variable model. Understanding it is prerequisite to reading, writing, or reviewing any infrastructure code. Variables are what make the same code provision dev and prod without duplication or drift.
Minimal example
terraform {
required_providers {
aws = { source = "hashicorp/aws", version = "~> 6.0" } # check registry.terraform.io for latest
}
}
provider "aws" {
region = var.region
}
variable "region" {
default = "us-east-1"
}
resource "aws_s3_bucket" "my_bucket" {
bucket = "my-app-uploads-${var.region}"
}
resource "aws_s3_bucket_versioning" "my_bucket" {
bucket = aws_s3_bucket.my_bucket.id
versioning_configuration { status = "Enabled" }
} Terraform Workflow: init → plan → apply
What The standard Terraform workflow is three commands:
terraform init: download providers and modules, initialise the backend. Run once per project or after backend/provider changes.terraform plan: show what will be created, changed, or destroyed. Reads current state and compares to your code. Always review this output before applying.terraform apply: execute the plan. Prompts for confirmation. Writes results to state file.terraform destroy: destroys all resources managed by the current state. Use carefully.
Why The
plan step is what makes Terraform safe. Seeing exactly what will be created, changed, or destroyed before committing is the primary advantage over running AWS CLI commands directly, which execute immediately with no preview.Gotcha
- Never run
terraform applyin CI without first reviewing theplanoutput. A-replaceor-destroyflag in the wrong hands deletes production resources. terraform import: bring an existing manually-created resource under Terraform management without recreating it.
Remote State (S3)
What Terraform state (
terraform.tfstate) is a JSON file that maps your code to real-world resources. By default it is stored locally. For teams, store it remotely in S3 (for shared access and versioning). Since Terraform v1.10+, state locking uses S3-native locking via use_lockfile = true, no DynamoDB table required. The DynamoDB-based locking approach (dynamodb_table) is officially deprecated.Why Local state breaks the moment two engineers work on the same infrastructure. Two simultaneous
apply runs corrupt each other's state. Remote state in S3 gives the team a shared, versioned, access-controlled source of truth, and the lockfile prevents concurrent writes.backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
use_lockfile = true
encrypt = true
}
} Gotcha
- State files can contain sensitive values (RDS passwords, private keys). Enable S3 bucket encryption and restrict access with bucket policies.
- Never edit the state file manually. Use
terraform state mv,terraform state rm, orterraform importfor state surgery.
Modules
What Reusable groups of Terraform resources. A module is a directory of
.tf files. Call it with a module block and pass variables. Modules enforce consistency, define your VPC pattern once and call it for dev, staging, and prod. The Terraform Registry has community modules for common patterns (AWS VPC, EKS, RDS).Why Copy-pasting resource blocks across dev, staging, and prod environments leads to configuration drift: one environment gets a security group rule or a log retention setting the others don't. A module defines the pattern once; every environment calls it with different variables.
Gotcha
- Avoid deeply nested modules, they make debugging hard. Flat module structures are more readable.
- Pin module versions (
version = "5.5.0") to avoid unexpected breaking changes from upstream updates.
AWS CDK (for Java & TypeScript Teams)
AWS Cloud Development Kit (CDK)
What CDK lets you write infrastructure in a real programming language, Java, TypeScript, Python, or Go, instead of YAML or HCL. A CDK app synthesises to CloudFormation templates under the hood. You get loops, conditionals, functions, unit tests, and type safety. The CDK Constructs Library provides high-level abstractions:
new ApplicationLoadBalancedFargateService(this, "Api", {...}) creates the ECS cluster, task definition, ALB, target group, and security groups in one call.Why For engineers coming from a Java/Spring background, CDK is a natural fit, you think in code, not YAML. CDK is increasingly the preferred IaC approach at enterprise AWS shops because infrastructure can be tested with JUnit, reused as libraries, and maintained with the same tooling as application code.
Minimal CDK stack (Java)
public class MyStack extends Stack {
public MyStack(Construct scope, String id) {
super(scope, id);
Bucket bucket = Bucket.Builder.create(this, "MyBucket")
.versioned(true)
.blockPublicAccess(BlockPublicAccess.BLOCK_ALL)
.build();
}
}
// Deploy: cdk synth → cdk deploy Gotcha
- CDK synthesises to CloudFormation, you get Change Sets and rollback protection for free. The downside: CDK's generated templates are verbose and hard to read directly.
- CDK is opinionated and sets non-obvious defaults that vary by construct and version (removal policies, encryption settings, generated IAM policies). Always run
cdk synthand inspect the generated CloudFormation template before deploying, it reveals exactly what CDK will create. Usecdk diffto preview changes before every deploy.
AWS CloudFormation
Templates and Stacks
What CloudFormation is AWS's native IaC service. You define infrastructure in YAML or JSON templates. Deploying a template creates a stack, a collection of AWS resources managed as a unit. AWS determines the correct creation order based on dependencies between resources. CloudFormation is free, you pay only for the resources it creates.
Why CloudFormation requires no additional tooling, it is built into every AWS account. For teams that want AWS-native IaC without a separate backend to manage, it integrates directly with IAM, Service Catalog, and StackSets for multi-account deployments.
Template structure
AWSTemplateFormatVersion: '2010-09-09'
Description: My application stack
Parameters:
InstanceType:
Type: String
Default: t3.micro
LatestAmiId: # resolves the current AL2023 AMI at deploy time; no hardcoded AMI ID
Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
Default: /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64
Resources: # required
MyInstance:
Type: AWS::EC2::Instance
Properties:
InstanceType: !Ref InstanceType
ImageId: !Ref LatestAmiId
Outputs:
InstanceId:
Value: !Ref MyInstance
The LatestAmiId parameter is the standard pattern for AMI selection: CloudFormation resolves the public SSM parameter to the current AL2023 AMI for your region at deploy time. Hardcoded ami-... IDs go stale and differ per region.
Gotcha
- If a stack update fails, CloudFormation automatically rolls back to the previous state. This is generally helpful but means you must investigate why the update failed before retrying.
- Deleting a stack deletes all resources in it (by default). Use DeletionPolicy: Retain on critical resources like RDS and S3 to protect them.
Change Sets and Drift Detection
What A Change Set is a preview of what CloudFormation will do when you update a stack, equivalent to
terraform plan. Always create a change set and review it before updating production stacks. Drift detection identifies resources that have been modified outside of CloudFormation (e.g., someone changed a Security Group in the console). Drift detection does not auto-remediate, it reports differences so you can decide what to do.Why Without reviewing a change set, a CloudFormation update can delete and recreate a production database if the resource's replacement policy triggers. Drift detection closes the gap when someone manually changes a resource in the console; without it, the next stack update silently overwrites that manual change.
Gotcha
- Manual changes to CloudFormation-managed resources cause drift. If you update a resource manually and then run CloudFormation, it may overwrite your manual change or fail with a conflict. Always make changes through CloudFormation or Terraform, never manually in the console for IaC-managed resources.
Hands-on Tasks
Interview Q&A
Why use Infrastructure as Code instead of clicking in the console?
Four reasons: Reproducibility, run the same code and get identical environments. No manual steps that differ between dev, staging, and prod. Version control, infrastructure changes are committed to Git, code-reviewed in PRs, and have a full audit trail. You know what changed, when, and why. Disaster recovery, if an environment is destroyed, recreate it from code in minutes rather than days of manual work. Drift prevention, IaC is the source of truth; manual console changes are detected as drift and can be corrected. The alternative, clicking in the console, produces "snowflake servers" that are impossible to reproduce exactly.
What is Terraform state and why does it need to be stored remotely?
Terraform state (
Remote state (in S3) is required for team use: if each engineer has their own local state file, there is no shared understanding of what is deployed and two people applying simultaneously can corrupt state or create duplicate resources. Since Terraform v1.10+, state locking uses S3-native locking (
terraform.tfstate) is a JSON file mapping your code to real AWS resources. Without it, Terraform cannot know what it already created, it would try to create everything again on the next apply. The state file also tracks resource attributes (IDs, ARNs) needed to create dependencies.Remote state (in S3) is required for team use: if each engineer has their own local state file, there is no shared understanding of what is deployed and two people applying simultaneously can corrupt state or create duplicate resources. Since Terraform v1.10+, state locking uses S3-native locking (
use_lockfile = true in the backend config), no separate DynamoDB table required. The older DynamoDB-based locking approach is deprecated. Encryption on the S3 bucket is important because state files often contain sensitive values like database passwords.What is a CloudFormation change set and why should you always use one?
A change set is CloudFormation's preview of what will happen when you update a stack, it lists which resources will be Added, Modified, or Replaced. Replaced is critical: some property changes require CloudFormation to delete and recreate a resource (e.g., renaming an RDS instance's
Always create and review a change set before updating any production stack. The workflow is: create change set → review (specifically look for any "Replacement: True" entries) → execute if safe, or modify the template if not. This is the IaC equivalent of code review before deployment.
DBInstanceIdentifier, or moving an EC2 instance to a different subnet). Without reviewing the change set, you could accidentally delete a production database by making what looks like a minor configuration change.Always create and review a change set before updating any production stack. The workflow is: create change set → review (specifically look for any "Replacement: True" entries) → execute if safe, or modify the template if not. This is the IaC equivalent of code review before deployment.
My Notes
Saved to browser storage automatically as you type.
11
ElastiCache / Redis
0%
Architect this phase
Spring Boot (EKS/ECS) → Security Group → ElastiCache Redis (Private Subnet) → cache miss only → RDS PostgreSQL
The diagram shows where this wiring lands in container deployments. This phase's tasks run the app on the Phase 2 EC2 instance (rebuild JAR, scp, restart systemd); only the source Security Group differs. Draw it: draw.io · AWS Icons Topics
ElastiCache Fundamentals
ElastiCache for Redis vs Memcached
What ElastiCache is AWS's managed in-memory caching service. It offers three engines: Valkey, Redis OSS, and Memcached. Redis supports rich data structures (lists, sets, sorted sets, hashes), optional persistence (RDB snapshots and AOF), pub/sub messaging, Lua scripting, transactions, and cluster mode for horizontal sharding. Valkey is the Linux Foundation fork of Redis that AWS promotes for new workloads: wire-compatible with Redis clients (Lettuce works unchanged) and priced lower per node-hour on ElastiCache. Memcached is simpler, multi-threaded, and has no persistence or replication.
Why Redis (or Valkey) is the right default: caching, session storage, leaderboards (sorted sets), pub/sub, rate limiting, and distributed locks. Memcached's multi-threaded model can outperform Redis at extreme throughput for pure simple key-value operations, but for a Spring Boot e-commerce application Redis covers all the patterns you will need as requirements evolve. This phase provisions Redis 7.x; choosing Valkey instead changes nothing in the Spring Boot code.
Gotcha
- ElastiCache Redis does not support all Redis commands. In cluster mode, commands that operate across multiple keys (e.g.,
MGETwith keys in different slots,KEYS *, Lua scripts touching multiple keys) are restricted or behave differently. Always test your Redis command usage against a cluster-mode instance before going to production. - Memcached on ElastiCache does not support Multi-AZ automatic failover or replication. If a Memcached node fails, all data in that node is lost.
- ElastiCache Serverless exists for Valkey and Redis: no node sizing, pay per GB stored and per request. Convenient for spiky workloads; for a steady learning workload, a
cache.t3.micronode is the cheaper option.
Cluster Mode Disabled vs Enabled
What Cluster Mode Disabled: a single shard with one primary node and up to 5 read replicas. All data lives on the one shard. Maximum data size is determined by the node type (e.g., cache.r6g.xlarge provides ~26 GB). Cluster Mode Enabled: data is sharded (partitioned by key hash slot) across up to 500 shards, each with its own primary and replicas, enabling both horizontal read and write scaling and datasets larger than a single node can hold.
Why For most Spring Boot applications, cluster mode disabled with 1-2 read replicas in separate Availability Zones is sufficient. It provides Multi-AZ failover (automatic promotion of a replica if the primary fails) without the complexity of cluster-mode key routing. Enable cluster mode only when your dataset genuinely exceeds a single large node or when write throughput requires sharding.
Gotcha
- Spring Boot's default Redis client is Lettuce (not Jedis). Lettuce handles cluster topology changes automatically, when a primary fails and a replica is promoted, Lettuce refreshes the cluster view and reconnects without application restart. Jedis requires more manual cluster configuration.
- You cannot switch a cluster between cluster mode disabled and enabled without recreating it. Plan your topology before provisioning.
- In cluster mode enabled, all keys in a
MULTI/EXECtransaction or a Lua script must hash to the same slot. Use hash tags (e.g.,{user:123}:cartand{user:123}:profile) to force related keys into the same slot.
Caching Patterns
Cache-Aside Pattern (Lazy Loading)
What The application controls the cache explicitly. On a read: check cache → if hit, return cached value; if miss, fetch from database, write the result to cache with a TTL, return the result. The cache is populated lazily, only data that has been requested gets cached. This is the most common caching pattern and the one
@Cacheable in Spring implements.Why Simple to implement and resilient, if the cache fails, the application falls through to the database. Cache contains only data that has been requested, so memory is not wasted on data that is never read. TTL ensures stale entries eventually expire. This pattern can deliver 60%+ read performance improvement on high-read workloads: product and inventory data is read far more often than it is written.
Gotcha
- TTL is mandatory. Without a TTL, cached entries never expire. If the underlying data changes in the database and the cache is not explicitly evicted, the application will serve stale data indefinitely. Always set a TTL appropriate to your data's rate of change.
- Cache stampede (thundering herd): when a popular key's TTL expires (or on a cold start), dozens or hundreds of simultaneous requests all miss the cache and fire concurrent queries to the database, which can overwhelm it. Mitigate with: (1) adding random jitter to TTLs (e.g., 55-65 seconds instead of exactly 60) so hot keys don't expire simultaneously; (2) probabilistic early expiry (PER), start refreshing a key before it expires based on a probability function; (3) a single-flight/request-coalescing pattern where only one thread fetches from the DB and others wait for the result.
Write-Through Pattern
What On every database write, the application also writes the same data to the cache. The cache is always warm and consistent with the database. There is no cache-miss penalty for recently written data, and there is no stampede risk because the cache is pre-populated on write rather than on first read.
Why Best for workloads where data is written and then immediately read (e.g., a user updates their profile and then views it). Guarantees that the cache always reflects the latest database state without relying on TTL expiry or explicit eviction.
Gotcha
- Write penalty: every database write now requires a second write to Redis. Writes are slower, and the write path now has a dependency on Redis availability.
- Cache pollution: data written to the cache may never be read, wasting memory. For write-heavy datasets with unpredictable read access patterns (e.g., audit logs, bulk imports), write-through wastes cache space.
- Hybrid approach: use write-through for hot, frequently-read data (user sessions, current inventory counts for top products) and cache-aside for cold or unpredictable data. Many production systems combine both patterns by data category.
Spring Boot Integration
Spring Boot Integration (
spring-boot-starter-data-redis)What Spring Boot's Redis starter auto-configures a Lettuce connection factory and a
RedisTemplate. Add @EnableCaching to your main class to activate the Spring Cache abstraction. Annotate service methods with @Cacheable, @CacheEvict, and @CachePut to declaratively manage the cache without writing Redis commands manually.Why Spring's Cache abstraction lets you add caching to any service method with a single annotation, no Redis commands or connection management in business logic. The same
@Cacheable annotation works with Redis in production and a simple in-process cache in unit tests, so caching does not affect testability.application.properties configuration
# Fetch host from environment variable injected via Secrets Manager / EKS secret
spring.data.redis.host=${REDIS_HOST}
spring.data.redis.port=6379
spring.data.redis.password=${REDIS_AUTH_TOKEN}
spring.data.redis.ssl.enabled=true # required when TLS is enabled on the cluster
# Connection pool (Lettuce)
spring.data.redis.lettuce.pool.max-active=20
spring.data.redis.lettuce.pool.max-idle=10
spring.data.redis.lettuce.pool.min-idle=2 Cache configuration bean (TTL + JSON serializer)
@Configuration
@EnableCaching
public class CacheConfig {
@Bean
public RedisCacheConfiguration redisCacheConfiguration() {
// Generic serializer: embeds the class name in the JSON so reads
// deserialize back to InventoryItem, not LinkedHashMap
GenericJackson2JsonRedisSerializer serializer =
new GenericJackson2JsonRedisSerializer();
return RedisCacheConfiguration.defaultCacheConfig()
.entryTtl(Duration.ofSeconds(60))
.serializeValuesWith(
RedisSerializationContext.SerializationPair.fromSerializer(serializer));
}
}
// Service layer
@Service
public class InventoryService {
@Cacheable(value = "inventory", key = "#id")
public InventoryItem getInventoryById(Long id) {
return inventoryRepository.findById(id).orElseThrow();
}
@CacheEvict(value = "inventory", key = "#id")
public void updateInventory(Long id, InventoryItem item) {
inventoryRepository.save(item);
}
} Gotcha
- Default JDK serialization is unreadable. Out of the box, Spring's
RedisTemplateuses JDK serialization, cached values appear as binary gibberish inredis-cliand are tied to the exact class structure. Configure a JSON serializer, and pick the right one:Jackson2JsonRedisSerializertyped toObject.classwrites JSON fine but reads it back asLinkedHashMap, producing aClassCastExceptionon the first cache hit.GenericJackson2JsonRedisSerializerembeds the class name in the JSON, so values round-trip to the original type; that is the one to use with@Cacheableon mixed value types. - When using
GenericJackson2JsonRedisSerializer, the serialized JSON includes the fully qualified class name. This means renaming or moving a class will break deserialization of existing cached entries. Plan cache key versioning or useflushdbon deployment when class structure changes.
Security and Monitoring
Security, VPC, TLS, AUTH, Secrets Manager
What ElastiCache clusters run entirely inside your VPC, there is no public endpoint. Access is controlled by Security Groups. Enable in-transit encryption (TLS) and at-rest encryption at cluster creation time. Protect the cluster with a Redis AUTH token (a static password) stored in AWS Secrets Manager.
Why Redis was originally designed for trusted networks with no authentication. An unprotected ElastiCache cluster in a VPC is protected only by Security Groups, if another compromised resource in the same VPC can reach port 6379, it has full access to all cached data. Defence in depth: Security Group restriction + TLS in transit + AUTH token + at-rest encryption.
Security Group rule
# ElastiCache Security Group, inbound rule
Type: Custom TCP
Port: 6379
Source: [Application Security Group ID] # NOT 0.0.0.0/0, NOT the VPC CIDR
# Spring Boot: use rediss:// (double-s) when TLS is enabled
spring.data.redis.url=rediss://:${REDIS_AUTH_TOKEN}@${REDIS_HOST}:6379 Gotcha
- ElastiCache supports IAM authentication on Redis 7.0+ and Valkey clusters, the preferred approach for new clusters: no static password to store or rotate. It has prerequisites: TLS must be enabled and you authenticate as an RBAC user, with short-lived (15-minute) IAM tokens the client library refreshes. Where IAM auth does not fit, use the AUTH token: store it in Secrets Manager and rotate via
aws elasticache modify-replication-group --auth-token <new-token>. - In-transit encryption and at-rest encryption cannot be enabled on an existing cluster, they must be enabled at creation time. Plan for this before provisioning your first cluster.
- Spring's
rediss://URL (double s) enables TLS. Usingredis://against a TLS-enabled cluster will result in a connection error that can be difficult to diagnose.
CloudWatch Metrics for ElastiCache
What ElastiCache publishes metrics to CloudWatch under the
AWS/ElastiCache namespace. Key metrics: CacheHits and CacheMisses (calculate hit rate: CacheHits / (CacheHits + CacheMisses)), CurrConnections (current client connections), Evictions (keys evicted because maxmemory was reached), EngineCPUUtilization (Redis is single-threaded for commands; high CPU here means your commands are slow), FreeableMemory (remaining free memory on the node), and ReplicationLag (seconds the replica is behind the primary, relevant for read-replica reads).Why These metrics are your window into cache health. A hit rate below 80% means the cache is not providing much benefit, investigate TTLs and access patterns. Any evictions mean Redis is running out of memory and ejecting valid cached data to make room for new data, your application suddenly gets more DB read traffic without warning.
CurrConnections growing toward the node's maxclients limit indicates a connection leak in the application.Recommended alarms
- Evictions > 0 for 1 datapoint: any eviction means Redis is memory-constrained and is silently invalidating your cache. Action: upgrade the node type or reduce the TTL of lower-priority keys.
- CurrConnections approaching
maxclients:maxclientsis 65000 on ElastiCache Redis and Valkey nodes. New connections are refused when the limit is hit, causing application errors, and steady growth toward it almost always means a connection leak (Lettuce connection pool misconfiguration), not real load. - ReplicationLag > 1 second: reads from replicas may return stale data older than 1 second. Investigate write throughput, the replica may be falling behind on replication.
Hands-on Tasks
Interview Q&A
When would you add Redis in front of RDS rather than adding a Read Replica alone?
Read Replicas scale read throughput but each read still hits PostgreSQL, they help with concurrent read volume but not latency. Every query to a Read Replica still involves parsing SQL, planning, disk I/O, and network round-trips, producing typical latencies of 5-50ms. Redis serves cached responses from memory in under 1ms.
Use Redis when: the same data is read repeatedly with the same parameters (product catalogue, user profiles, inventory counts for popular items), read latency matters more than perfect consistency, or the database is CPU-bound on read queries that are expensive to compute repeatedly.
Read Replicas are the better choice when you need fresh data on every read (financial balances, live inventory at checkout), need SQL flexibility (ad-hoc queries, joins, aggregations), or your data access patterns are highly unpredictable and cache keys are difficult to define. In practice, most production e-commerce systems use both: Redis in front of RDS for the 80% of reads that are repetitive, and Read Replicas to handle the remaining query load that cannot be cached.
Use Redis when: the same data is read repeatedly with the same parameters (product catalogue, user profiles, inventory counts for popular items), read latency matters more than perfect consistency, or the database is CPU-bound on read queries that are expensive to compute repeatedly.
Read Replicas are the better choice when you need fresh data on every read (financial balances, live inventory at checkout), need SQL flexibility (ad-hoc queries, joins, aggregations), or your data access patterns are highly unpredictable and cache keys are difficult to define. In practice, most production e-commerce systems use both: Redis in front of RDS for the 80% of reads that are repetitive, and Read Replicas to handle the remaining query load that cannot be cached.
How do you handle cache invalidation, the hardest problem in distributed systems?
Three strategies, each with different consistency/complexity trade-offs:
1. TTL-based expiry: accept stale data for the duration of the TTL window and let entries expire naturally. This is simple, requires no coupling between the write path and Redis, and works well for data where brief staleness is acceptable, product prices, user preferences, category lists. The downside is that stale data can be served for up to the TTL duration after a write.
2. Event-driven eviction (
3. Write-through: on every write, update both the database and the cache atomically. The cache is always consistent with the database. Writes are slower and the cache may fill with data that is never re-read.
In practice for inventory-style workloads: use TTL (60 seconds) as the baseline for most inventory reads, combined with event-driven eviction on explicit updates, so a product update is reflected immediately rather than waiting 60 seconds. This hybrid approach is the standard production pattern.
1. TTL-based expiry: accept stale data for the duration of the TTL window and let entries expire naturally. This is simple, requires no coupling between the write path and Redis, and works well for data where brief staleness is acceptable, product prices, user preferences, category lists. The downside is that stale data can be served for up to the TTL duration after a write.
2. Event-driven eviction (
@CacheEvict): on every database write, immediately evict the corresponding cache key. The next read will miss the cache and repopulate it with fresh data. This guarantees consistency but adds write latency (an extra Redis call on every write) and couples the write path to Redis availability, if Redis is down during a write, the eviction may fail.3. Write-through: on every write, update both the database and the cache atomically. The cache is always consistent with the database. Writes are slower and the cache may fill with data that is never re-read.
In practice for inventory-style workloads: use TTL (60 seconds) as the baseline for most inventory reads, combined with event-driven eviction on explicit updates, so a product update is reflected immediately rather than waiting 60 seconds. This hybrid approach is the standard production pattern.
What happens if ElastiCache goes down? How does your Spring Boot app behave?
By default, with
The correct fix is a custom
With this handler, any Redis exception on a GET is logged and suppressed, Spring calls the actual service method (which queries the DB) instead. The application degrades gracefully: slower but correct.
Also configure connection and command timeouts (
spring-boot-starter-data-redis and @Cacheable, a Redis connection failure throws a RedisConnectionException, which propagates up through your service method and returns a 500 error to the client. This means a Redis outage takes down application reads entirely, even though the database is healthy. This is the wrong behaviour: Redis is a performance optimisation, not a critical dependency.The correct fix is a custom
CacheErrorHandler:@Configuration
public class CacheConfig implements CachingConfigurer {
@Override
public CacheErrorHandler errorHandler() {
return new SimpleCacheErrorHandler() {
@Override
public void handleCacheGetError(RuntimeException e, Cache cache, Object key) {
log.warn("Redis GET failed for key {}, falling through to DB", key, e);
// do not rethrow, Spring will call the underlying method
}
// override handleCachePutError, handleCacheEvictError similarly
};
}
}With this handler, any Redis exception on a GET is logged and suppressed, Spring calls the actual service method (which queries the DB) instead. The application degrades gracefully: slower but correct.
Also configure connection and command timeouts (
spring.data.redis.lettuce.command-timeout=500ms) and pool exhaustion settings so a slow Redis does not block application threads indefinitely.My Notes
Saved to browser storage automatically as you type.
12
Prometheus + Grafana
0%
Topics
Prometheus Fundamentals
Prometheus Pull Model and Metrics Types
What Prometheus scrapes HTTP endpoints (your app's
/actuator/prometheus) on a configurable interval. Unlike CloudWatch which you push to, Prometheus pulls, meaning your app doesn't need AWS credentials or SDK to emit metrics. Four metric types: Counter (monotonically increasing: requests served), Gauge (current value: active connections, heap used), Histogram (distribution with buckets: request latency), Summary (quantiles, less flexible than histogram).Why The pull model decouples your application from the metrics infrastructure, your app exposes a passive endpoint and Prometheus does the work of collecting. Your app has no outbound network dependency for metrics, no AWS credentials needed, and if Prometheus goes down your application keeps running unaffected.
Gotcha
- Counters only go up, never reset them. If you need rate, use
rate()in PromQL. Never use a Gauge for something that only increases. - Histograms require pre-defined buckets. If your p99 latency exceeds your highest bucket value,
histogram_quantile()will return+Inf. Configure buckets that cover your expected latency range.
kube-prometheus-stack Helm Chart
What The community standard for Prometheus on Kubernetes. One
helm install deploys: Prometheus Operator, Prometheus (metrics store), Grafana (dashboards), AlertManager (routing), node-exporter (node-level metrics), and kube-state-metrics (Kubernetes object metrics, pod count, deployment replicas, etc.).Why Running Prometheus manually on Kubernetes requires managing StatefulSets, RBAC, scrape configuration, and upgrade paths, a significant operational burden. kube-prometheus-stack bundles all best-practice configuration. The Prometheus Operator introduces custom resources (ServiceMonitor, PrometheusRule) that let you manage scraping and alerting as Kubernetes objects.
Gotcha
- The chart creates ~50 default Kubernetes alert rules. On first install they may fire immediately, for example, the "Watchdog" alert is intentional and proves alerting is working end-to-end.
- Review and tune the default rules for your cluster size. Some rules (e.g., "KubeMemoryOvercommit") require adjustment for small dev clusters.
Spring Boot Integration
Spring Boot → Prometheus: Micrometer Registry
What Add
micrometer-registry-prometheus dependency and expose /actuator/prometheus (add prometheus to management.endpoints.web.exposure.include). Spring Boot auto-configures JVM metrics (heap, GC pause duration, threads), Tomcat/Netty metrics, HikariCP pool metrics, and Spring MVC request latency histograms. Use @Timed("my.service.method") or inject MeterRegistry directly for custom metrics.Why Unlike
micrometer-registry-cloudwatch2 which pushes metrics (requiring AWS credentials and network access), the Prometheus registry formats metrics as Prometheus text at the /actuator/prometheus endpoint. No credentials, no outbound connections, no cost per metric push.pom.xml
<dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency> # application.properties management.endpoints.web.exposure.include=health,info,metrics,prometheus
Gotcha
- Histogram buckets are opt-in. By default Spring exports only
_countand_sumforhttp.server.requests: no_bucketseries exist, andhistogram_quantile()silently returns nothing. Enable them withmanagement.metrics.distribution.percentiles-histogram.http.server.requests=true, or define explicit SLO buckets:management.metrics.distribution.slo.http.server.requests=50ms,100ms,200ms,500ms,1s,2s,5s. If your p99 exceeds the highest bucket, the quantile clamps to that bucket's value.
ServiceMonitor: Telling Prometheus to Scrape Your App
What The Prometheus Operator introduces the
ServiceMonitor custom resource. Create one pointing at your app's Kubernetes Service on port 8080, path /actuator/prometheus. The Operator automatically updates Prometheus's scrape configuration, no manual Prometheus config file editing needed.Why Without ServiceMonitor, adding a new service to Prometheus requires editing its config file and reloading. With ServiceMonitor, a developer deploys a YAML file alongside their app and Prometheus discovers it automatically, GitOps-friendly and self-service.
Example ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: default
labels:
release: monitoring # must match your Helm release name; see gotcha below
spec:
selector:
matchLabels:
app: my-app # matches the labels on your Phase 8 Service
endpoints:
- port: http # the Service port NAME, not a number
path: /actuator/prometheus
interval: 30s Gotcha
- The
releaselabel is required. kube-prometheus-stack configures Prometheus to select only ServiceMonitors labelled with the Helm release name (release: monitoringif you installed withhelm install monitoring ...). A ServiceMonitor without it is silently ignored: no error, no target. - The ServiceMonitor must be in a namespace Prometheus watches (
serviceMonitorNamespaceSelector: {}watches all). Another silent failure mode: the ServiceMonitor exists but Prometheus never picks it up. - The label selector on the ServiceMonitor must match the labels on your Kubernetes Service object, not the Pod labels. And
port:refers to the Service port's name; if your Service defines the port without a name, addname: httpto it.
PromQL and Grafana
PromQL Basics
What PromQL is Prometheus's query language. Essential queries for a Spring Boot app:
- Request rate:
rate(http_server_requests_seconds_count{job="my-app"}[5m]) - Error rate:
rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) - p99 latency:
histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m])) - JVM heap used:
jvm_memory_used_bytes{area="heap"} - HikariCP pool usage:
hikaricp_connections_active / hikaricp_connections_max
Why PromQL's power is in its label filtering and range vector functions. Slicing by URI, status code, or instance, and computing rates over rolling windows, enables questions that CloudWatch Metrics Insights requires complex metric math expressions to answer, PromQL answers them in one line.
Gotcha
rate()requires a counter, not a gauge. Usingrate()on a gauge gives nonsense. Usederiv()ordelta()for gauges instead.histogram_quantile()requires the_bucketmetric, not_countor_sum. Always userate()on the bucket series inside the function.
Grafana Dashboards
What Grafana connects to Prometheus as a data source and renders PromQL results as panels (time-series graphs, gauges, stat tiles, tables). Import community dashboard ID
4701 (JVM Micrometer) for instant Spring Boot visibility. Key panels to build: request rate + error rate (two lines on one graph), p50/p95/p99 latency, active JVM threads, HikariCP pool saturation, pod count via kube_deployment_spec_replicas.Why Grafana's panel library and community dashboards (especially the JVM Micrometer dashboard) are more mature and flexible than CloudWatch dashboards. Grafana supports templating (dropdown to select service/namespace), annotations (mark deployments on graphs), and alerting from within dashboards.
Gotcha
- Grafana stores dashboards in a SQLite database on an emptyDir volume by default, all custom dashboards are lost when the pod restarts. Fix this with one of: (a) configure a PersistentVolumeClaim for Grafana storage, or (b) provision dashboards as Kubernetes ConfigMaps with
grafana.sidecar.dashboards.enabled: truein the Helm values, GitOps-friendly, dashboards live in Git as JSON files and survive pod restarts automatically.
AlertManager: Routing Alerts
What AlertManager receives firing alerts from Prometheus rules and routes them to receivers (Slack, PagerDuty, email). Configure in
values.yaml when installing kube-prometheus-stack. Define routes by label: severity: critical → PagerDuty, severity: warning → Slack. Alerting rules are defined as PrometheusRule custom resources.Why Prometheus evaluates alerting rules and fires alerts when conditions are met, but it does not send notifications directly, AlertManager handles deduplication, grouping, silencing, and routing to the right receiver. This separation keeps Prometheus focused on metrics and lets you change notification channels without touching alert rules.
Gotcha
- AlertManager has "inhibition", a critical alert can suppress related warning alerts for the same component. Configure inhibition rules before going to production or one outage may page you 40 times for the same root cause.
- Test your alerting pipeline end-to-end with the "Watchdog" alert, it fires continuously and proves your full chain (Prometheus → AlertManager → receiver) works.
Prometheus vs CloudWatch: When to Use Each
What CloudWatch: AWS-native, zero in-cluster infrastructure, best for EC2/Lambda/RDS/ECS infrastructure metrics, ALB access logs, VPC Flow Logs, anything AWS manages. Prometheus: best for application-level metrics on Kubernetes, PromQL is far more powerful for complex queries, Grafana is more flexible than CloudWatch dashboards.
Why Production answer: run both. CloudWatch for infra (node CPU, EKS control plane metrics, RDS), Prometheus+Grafana for app metrics (latency, error rate, pool saturation). Each tool is optimal in its domain.
Gotcha
- Prometheus requires cluster resources and operational care, StatefulSet storage, backup, capacity planning. CloudWatch is serverless. Factor operational overhead into the decision.
- You can send metrics to both by adding both
micrometer-registry-prometheusandmicrometer-registry-cloudwatch2, Micrometer fans out to all configured registries simultaneously.
Hands-on Tasks
Interview Q&A
What is Prometheus's pull model and why does it matter for security?
Prometheus polls your app's metrics endpoint on a schedule, your app doesn't push anything anywhere. This means: your app has no outbound network dependency for metrics (doesn't need AWS credentials, SDK, or internet access), and the metrics endpoint is inside the cluster, unreachable from outside by default. It also means the blast radius if Prometheus goes down is zero, your app keeps running without metrics collection.
The tradeoff: your app must be discoverable by Prometheus (via ServiceMonitor) and must be running to be scraped, short-lived batch jobs should use Pushgateway instead, which accepts pushed metrics and holds them until Prometheus scrapes.
The tradeoff: your app must be discoverable by Prometheus (via ServiceMonitor) and must be running to be scraped, short-lived batch jobs should use Pushgateway instead, which accepts pushed metrics and holds them until Prometheus scrapes.
When would you choose Prometheus + Grafana over CloudWatch for a Spring Boot service on EKS?
CloudWatch is excellent for AWS-managed resources (RDS CPU, ALB request count, node-level metrics from Container Insights) but PromQL is far more expressive for application-level analysis.
Grafana's panel library and community dashboards (like the JVM Micrometer dashboard ID 4701) are also more mature than CloudWatch dashboards. Operationally: CloudWatch has zero in-cluster infrastructure cost; Prometheus requires cluster resources and operational care. Production answer: run both, CloudWatch for infra (node CPU, EKS control plane metrics), Prometheus+Grafana for app metrics (latency, error rate, pool saturation).
histogram_quantile(0.99, rate(http_server_requests_seconds_bucket{uri="/checkout"}[5m])), the p99 latency for a specific endpoint over a rolling 5-minute window, is a one-liner in PromQL and requires a complex metric math expression in CloudWatch.Grafana's panel library and community dashboards (like the JVM Micrometer dashboard ID 4701) are also more mature than CloudWatch dashboards. Operationally: CloudWatch has zero in-cluster infrastructure cost; Prometheus requires cluster resources and operational care. Production answer: run both, CloudWatch for infra (node CPU, EKS control plane metrics), Prometheus+Grafana for app metrics (latency, error rate, pool saturation).
A new engineer joins and can't find their service's metrics in Grafana. Walk through your debugging process.
Systematic five-step check: (1) Verify the pod is running and
/actuator/prometheus returns data: kubectl exec -it <pod> -- curl localhost:8080/actuator/prometheus | head -5, if this fails, the app is the problem (missing dependency, endpoint not exposed). (2) Verify a ServiceMonitor exists and its label selector matches the labels on the Kubernetes Service (not the Pod). (3) Check the Prometheus Targets page (/targets), is the target listed? If listed but DOWN, the error message tells you: wrong port, wrong path, or network policy blocking the scrape. (4) If the target is not listed at all, check whether the ServiceMonitor's namespace is watched by Prometheus, the Prometheus resource needs serviceMonitorNamespaceSelector: {} to watch all namespaces. (5) Check RBAC, Prometheus needs ClusterRole permissions to read ServiceMonitor resources in that namespace. Most failures are one of: wrong label selector, wrong namespace configuration, or missing RBAC.My Notes
Saved to browser storage automatically as you type.
13
ArgoCD / GitOps
0%
Architect this phase
GitHub (manifests repo) → ArgoCD (watches + diffs) → EKS (reconciles) | GitHub Actions (builds image) → ECR → ArgoCD Image Updater → commits new tag to manifests repo → ArgoCD syncs
Draw it: draw.io · AWS Icons · This phase requires a running EKS cluster with your app deployed. If you deleted it after Phase 7 or 11, the recreate command is in Phase 11's first task. Topics
GitOps Fundamentals
GitOps Principles
What Git is the single source of truth for desired cluster state. Every change to the cluster happens through a Git commit, no manual
kubectl apply, no --force in CI. The cluster continuously reconciles itself to match what's in Git. Benefits: full audit trail (git log is your change history), instant rollback (revert a commit), disaster recovery (recreate the entire cluster from a repo), and peer review via PR before production changes.Why Imperative deployments (
kubectl apply from a CI step, aws ecs update-service) leave no durable record of what the cluster should look like right now. If a node is replaced or a namespace is accidentally deleted, there is no authoritative source to restore from. GitOps solves this: the Git repo is always the answer to "what should be running?"Gotcha
- GitOps does not mean putting secrets in Git. Use External Secrets Operator or Sealed Secrets to store encrypted secret references in Git while the actual values live in AWS Secrets Manager. Committing a plaintext Secret manifest is a critical security failure.
- GitOps replaces cluster state management, not the entire pipeline. GitHub Actions still builds your image and pushes to ECR, GitOps takes over at the point of deploying to the cluster.
ArgoCD Core Concepts
ArgoCD Architecture
What ArgoCD runs inside your EKS cluster (in the
argocd namespace). Core components: API Server (web UI + CLI + API), Repository Server (clones and renders manifests from Git), Application Controller (continuously compares desired state from Git with live cluster state, detects drift, reconciles). An Application resource defines: source repo + path + target cluster + namespace. ArgoCD can manage multiple clusters from one control plane.Why Instead of your CI pipeline having
kubectl apply at the end (imperative push), ArgoCD pulls from Git continuously (declarative pull). The cluster always self-heals to match the repo. This eliminates configuration drift: no more "what is running in the cluster vs what was deployed last Tuesday?"Gotcha
- ArgoCD itself is a workload in your cluster, if the cluster is down, ArgoCD is down. This is expected: Kubernetes keeps running whatever was last applied regardless of ArgoCD's state. Don't confuse an ArgoCD outage with an inability to serve traffic.
- ArgoCD stores application state in its own CRDs inside the cluster. Back up your ArgoCD Application resources or manage them via Git (App of Apps pattern) so they survive cluster recreation.
Application Resource and Sync Policies
What The
Application CRD is what you create to register a workload with ArgoCD. Key fields: repoURL, targetRevision (branch/tag/commit), path (folder in repo containing manifests), destination.server, destination.namespace. Sync policies: Manual (ArgoCD detects drift and shows it in the UI, but you press Sync to apply), Automatic (ArgoCD applies any detected diff immediately, add selfHeal: true to also revert manual kubectl changes).Why The Application resource is the bridge between a Git repository and a live cluster state. Without it, ArgoCD has nothing to watch. Sync policy is the key operational choice: Automatic sync suits dev and staging; Manual sync with PR review is the standard for production changes.
Minimal Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/manifests.git
targetRevision: main
path: apps/my-app
destination:
server: https://kubernetes.default.svc
namespace: default
syncPolicy:
automated:
selfHeal: true
prune: true # delete resources removed from Git Gotcha
- Start with Manual sync for production, Automatic for dev. With Automatic + selfHeal, a misguided
kubectl editwill be silently reverted, which is correct GitOps behaviour but surprises engineers used to imperative workflows. prune: truetells ArgoCD to delete cluster resources that no longer exist in Git. Without it, removed manifest files leave orphaned resources running in the cluster indefinitely.
Health Checks and Sync Status
What ArgoCD tracks two independent statuses per Application: Sync Status (Synced / OutOfSync, does Git match the cluster?) and Health Status (Healthy / Degraded / Progressing, are the resources working?). ArgoCD has built-in health checks for Deployments (checks rollout progress and available replicas), Services, Ingresses, etc. A Deployment is Healthy only when its desired replicas are all Ready.
Why These two axes give a precise picture of the cluster. Synced + Healthy is the target state. OutOfSync + Healthy means a Git change is pending. Synced + Degraded means the cluster matches Git but something is failing (e.g., a pod is crash-looping). OutOfSync + Degraded means both a pending change and a runtime failure, investigate Degraded first.
Gotcha
- OutOfSync does not mean broken, it means Git and the cluster differ. OutOfSync + Degraded together means something is both different AND not working. The combination matters more than either status alone.
- ArgoCD's health status for a Deployment is Progressing (not Healthy) during a rolling update. Alert on Degraded, not on Progressing, otherwise every deploy triggers an alert.
Automation and Promotion
ArgoCD Image Updater
What An ArgoCD addon that polls ECR (or Docker Hub) for new image tags and automatically commits the updated tag to the manifests repo. This closes the GitOps loop: your GitHub Actions CI builds and pushes a new image tag to ECR → Image Updater detects it → commits the new tag to Git → ArgoCD sees the Git change → syncs to the cluster. No human intervention, no
kubectl set image in your pipeline.Why Without Image Updater, someone must manually update the image tag in the manifests repo every time a new image is pushed, breaking the automation loop. Image Updater automates the "update manifests repo" step so the full cycle (code push → live deployment) is hands-off for dev and staging.
Gotcha
- Image Updater needs IAM permissions to read from ECR, use IRSA so no credentials are embedded in the cluster. It also needs write access to your Git repo (via SSH key or GitHub App token stored in a Kubernetes Secret).
- Know what you are adopting: Image Updater is an argoproj-labs project whose own README advises against critical production workloads. It is well suited to this lab and to dev/staging automation; production teams often have CI commit the manifest bump instead, which achieves the same loop with tooling they already trust.
- Configure the update strategy carefully:
digest(always pull latest digest of a tag, useful forlatesttag workflows) vssemver(update to latest matching semver pattern, e.g.,~1.2updates to any 1.2.x). For production,semverwith explicit constraints is safer than trackinglatest.
Environment Promotion (dev → staging → prod)
What Two common patterns: (1) Directory-based:
apps/my-app/dev/, apps/my-app/staging/, apps/my-app/prod/, each tracked by a separate ArgoCD Application resource pointing at the same repo but different paths. Promote by updating files and committing to the target path. (2) Kustomize overlays: a base/ directory with common manifests and per-environment overlays/dev/, overlays/prod/ that patch image tags and replica counts. ArgoCD natively renders Kustomize, set the source type in the Application spec.Why A single ArgoCD Application tracks one path in one repository. Without a promotion strategy, deploying to multiple environments requires separate tooling outside GitOps. Directory-based and Kustomize overlay patterns keep promotion as a Git commit, no separate pipelines, no environment-specific CI steps.
Gotcha
- Never auto-sync production. Always require a manual PR approval → merge → manual ArgoCD sync for prod. Dev and staging can be auto-synced. This is the most important governance rule in a GitOps workflow.
- Promotion is not ArgoCD's job, it is a Git operation. Promote by opening a PR that updates the prod path. ArgoCD detects the merged commit and either auto-syncs (staging) or waits for a manual sync (prod).
Rollback in ArgoCD
What ArgoCD maintains a history of all synced revisions (configurable depth, default 10). Rolling back is a UI click (History → select revision → Rollback) or
argocd app rollback my-app <revision-id>. ArgoCD deploys the exact manifests from that Git commit, deterministic rollback to a known-good state.Why When a bad deployment hits production, recovery speed matters. ArgoCD rollback is one CLI command referencing a commit hash, faster than re-triggering a CI pipeline and safer than manually locating and re-applying an older manifest.
Gotcha
- Rollback in ArgoCD does not revert the Git repo, it deploys an older revision while the repo still has the newer commit. This creates a deliberate sync divergence (the cluster is behind Git). After rollback, fix the root cause, commit a fix forward, and sync, do not leave the cluster in a rolled-back state indefinitely.
- The correct production process: rollback ArgoCD to restore service → open a hotfix PR → merge → sync forward. Never leave the cluster and Git permanently diverged.
Secrets in GitOps
Managing Kubernetes Secrets Without Committing Them to Git
What Three production-grade options: (1) External Secrets Operator (ESO): commit an
ExternalSecret manifest (contains only the AWS Secrets Manager ARN reference, not the value). ESO's controller reads from Secrets Manager via IRSA and creates a real Kubernetes Secret inside the cluster. The actual secret value never touches Git. (2) Sealed Secrets: encrypt the secret with a cluster-specific public key using kubeseal, commit the encrypted SealedSecret manifest. Only the cluster's controller can decrypt it. (3) ArgoCD Vault Plugin: substitutes secret values from HashiCorp Vault or AWS Secrets Manager at sync time.Why ESO + IRSA is the recommended pattern for AWS-based EKS deployments. It integrates naturally with IAM, Secrets Manager rotation (secrets auto-refresh in the cluster when the Secrets Manager value changes), and the existing IRSA setup from Phase 8.
Gotcha
- If you accidentally commit a secret to Git, even briefly, rotate the secret immediately. Git history is permanent; the value is exposed even after deletion from the working tree.
- ESO creates a standard Kubernetes Secret from the external source. Configure ArgoCD to ignore ESO-managed secrets in its diff to avoid spurious OutOfSync status, ArgoCD did not create them, so they will always appear as drifted without this exclusion.
Hands-on Tasks
Interview Q&A
What is GitOps and why is Git the source of truth rather than the cluster state?
GitOps is an operational framework where the desired state of your infrastructure is declared in Git, and an automated agent continuously reconciles the actual system state to match it.
Git is the source of truth because: it provides a complete, immutable audit trail of every change (who, what, when, why via commit messages and PR reviews); it enables rollback by reverting commits; it enables disaster recovery by recreating the cluster from the repo; and it enforces peer review (PRs) before production changes.
The alternative, treating the cluster as the source of truth, means infrastructure state is tribal knowledge, rollback requires remembering what you changed, and disaster recovery is manual reconstruction. The cluster's actual state is derived from Git, never the other way around.
Git is the source of truth because: it provides a complete, immutable audit trail of every change (who, what, when, why via commit messages and PR reviews); it enables rollback by reverting commits; it enables disaster recovery by recreating the cluster from the repo; and it enforces peer review (PRs) before production changes.
The alternative, treating the cluster as the source of truth, means infrastructure state is tribal knowledge, rollback requires remembering what you changed, and disaster recovery is manual reconstruction. The cluster's actual state is derived from Git, never the other way around.
Your CI pipeline builds a new image. Walk through the full GitOps flow to get it deployed.
(1) GitHub Actions builds the JAR, builds the Docker image, pushes to ECR with a tag (e.g., the git SHA). (2) ArgoCD Image Updater polls ECR, detects the new tag, and commits an updated
No human ran
image.tag value to the manifests repository. (3) ArgoCD's Application Controller polls the manifests repo (every 3 minutes by default, or immediately via a webhook) and detects the new commit. (4) ArgoCD marks the Application as OutOfSync. (5) If auto-sync is enabled (or an engineer presses Sync), ArgoCD applies the updated Deployment manifest to the cluster. (6) Kubernetes performs a rolling update, new pods start (readiness probe gates traffic), old pods terminate gracefully.No human ran
kubectl, the entire flow was driven by a Git commit.How do you handle Kubernetes Secrets in a GitOps workflow, you can't commit plaintext secrets to Git?
Three options in production:
(1) External Secrets Operator (ESO): store an
(2) Sealed Secrets: encrypt the secret with a cluster-specific public key (using
(3) ArgoCD Vault Plugin: templating plugin that substitutes secret values from HashiCorp Vault or AWS Secrets Manager at sync time.
ESO + IRSA is the recommended pattern for AWS-based EKS deployments, it integrates naturally with IAM, Secrets Manager rotation, and the existing IRSA setup from Phase 8.
(1) External Secrets Operator (ESO): store an
ExternalSecret manifest in Git (it contains only the reference to an AWS Secrets Manager secret ARN, not the value). ESO's controller runs in the cluster, reads from Secrets Manager via IRSA, and creates a Kubernetes Secret. The actual secret value never touches Git.(2) Sealed Secrets: encrypt the secret with a cluster-specific public key (using
kubeseal), commit the encrypted SealedSecret manifest. Only the cluster's controller can decrypt it.(3) ArgoCD Vault Plugin: templating plugin that substitutes secret values from HashiCorp Vault or AWS Secrets Manager at sync time.
ESO + IRSA is the recommended pattern for AWS-based EKS deployments, it integrates naturally with IAM, Secrets Manager rotation, and the existing IRSA setup from Phase 8.
My Notes
Saved to browser storage automatically as you type.
★
Capstone Project
0%
Full target architecture
Internet → ALB → EKS (Order, Inventory, Document services)
Order Service → RDS PostgreSQL (private subnet) + SNS → SQS → Inventory Service
Inventory Service → ElastiCache Redis (cache-aside) + RDS + DynamoDB (idempotency)
Document Service → S3 (presigned URL pattern) · All services → ECR (images)
All services → CloudWatch (logs + custom metrics) + AWS X-Ray (distributed traces)
GitHub → GitHub Actions → ECR → EKS rolling deploy (CI/CD)
Infrastructure provisioned via Terraform (VPC, RDS, ElastiCache, SQS/SNS, EKS)
This is the architecture to draw on paper first. Use draw.io with AWS Architecture Icons. Drawing it forces you to think through the connectivity, security groups, and IAM roles before building. Order Service → RDS PostgreSQL (private subnet) + SNS → SQS → Inventory Service
Inventory Service → ElastiCache Redis (cache-aside) + RDS + DynamoDB (idempotency)
Document Service → S3 (presigned URL pattern) · All services → ECR (images)
All services → CloudWatch (logs + custom metrics) + AWS X-Ray (distributed traces)
GitHub → GitHub Actions → ECR → EKS rolling deploy (CI/CD)
Infrastructure provisioned via Terraform (VPC, RDS, ElastiCache, SQS/SNS, EKS)
Cost-conscious alternative: ECS Fargate
The target architecture uses EKS, which costs $0.10/hr (~$72/month) for the control plane before any EC2 worker nodes. The full stack (control plane + 2 worker nodes + NAT Gateway + RDS + ElastiCache + ALB) runs roughly $0.25-0.30/hr, about $200/month if left up around the clock. Build it in a compressed burst and tear it down at the end: 30 focused hours of running stack costs under $10 of your credits. If EKS is a concern, or if Phase 7 felt overwhelming, this entire capstone works equally well with ECS Fargate:
Replace EKS Deployments → ECS Task Definitions + ECS Services. Replace
Replace EKS Deployments → ECS Task Definitions + ECS Services. Replace
kubectl apply in CI/CD → aws ecs update-service --force-new-deployment. The architecture (VPC, private subnets, ALB, RDS, S3, CloudWatch, IAM roles) and CI/CD concepts are identical. EKS adds Kubernetes portability; ECS Fargate is simpler, cheaper, and AWS-native. The AWS skills you build are the same either way. What you're building
Order Service
What Spring Boot service handling order creation and lifecycle. Stores orders in RDS PostgreSQL. On order creation, publishes an
OrderCreated event to an SNS topic, which fans out to SQS queues for downstream consumers. Deployed on EKS with IRSA (no embedded AWS credentials).Why Demonstrates the Transactional Outbox pattern: a
@Scheduled poller writes the order row and an outbox event in the same RDS transaction, then publishes unpublished events to SNS and marks them sent. This prevents message loss even if SNS or SQS is temporarily unavailable.Inventory Service
What Consumes
OrderCreated events from SQS, reserves inventory in RDS, and publishes InventoryReserved or InsufficientStock events. Caches frequently queried stock levels in ElastiCache Redis (cache-aside, 60-second TTL).Why Demonstrates idempotent SQS consumption: uses the
orderId as an idempotency key with a DynamoDB conditional write to prevent double-deduction on message redelivery.Document Service
What Spring Boot service for file upload and download. Clients get a presigned PUT URL from this service, upload directly to S3, then retrieve content via presigned GET URLs. This service never handles file bytes.
Why Demonstrates the presigned URL pattern: offloads transfer cost and latency to S3 directly. Deployed on EKS with IRSA scoping S3 access to only this service's bucket prefix.
Observability
What CloudWatch Container Insights for EKS pod metrics and logs, Micrometer custom metrics (order throughput, inventory cache hit rate, SQS processing latency), and AWS X-Ray distributed traces across all three services.
Why Ties together every observability layer from Phases 5 and 11: infrastructure metrics, application metrics, and cross-service trace correlation. Alarms on SQS queue depth, error rate, and p95 latency reflect what oncall actually pages on.
CI/CD Pipeline
What GitHub Actions workflow triggered on merge to main: build JAR → build Docker image → push to ECR → rolling EKS deploy with readiness/liveness probes guarding against bad deploys. Infrastructure provisioned via Terraform (VPC, RDS, ElastiCache, SQS/SNS, DynamoDB, S3, ECR, EKS).
Why Demonstrates OIDC-federated CI/CD: no long-lived AWS credentials stored in GitHub secrets. The GitOps variant from Phase 12 replaces the
kubectl set image step with an ArgoCD Image Updater commit, the production standard.Capstone Tasks
Interview Q&A, Architecture-level questions
How do you guarantee exactly-once order processing even if the Order Service crashes mid-transaction?
The Transactional Outbox pattern: when the Order Service creates an order, it writes both the order record AND an outbox event to the same local RDS transaction. A separate Outbox Poller reads unpublished events and publishes them to SNS, then marks them published. If the service crashes after committing the DB transaction but before publishing to SNS, the Outbox Poller will retry publishing on restart, the event is never lost.
On the consumer side (Inventory Service), idempotency prevents double-deduction: before processing, the service claims the
On the consumer side (Inventory Service), idempotency prevents double-deduction: before processing, the service claims the
orderId in a DynamoDB deduplication table with a conditional write (attribute_not_exists(orderId)) and a status of IN_PROGRESS, marks it COMPLETED after processing, and deletes the claim on failure so redelivery can retry. The naive version, write the key then discard on conflict, loses orders when a consumer crashes mid-processing: the key exists but the work never happened. Combined, these two patterns give you at-least-once delivery with effectively-once processing across an unreliable distributed system (true exactly-once delivery does not exist; idempotency is how you make that not matter).Your X-Ray trace shows the order endpoint has p95 latency of 1.8 seconds. Walk me through how you investigate.
In the X-Ray trace timeline, identify which subsegment is the slowest. If it's the RDS subsegment: check CloudWatch RDS metrics for CPU, IOPS, and connection count, a saturated connection pool (Hikari at 100%) is the most common cause. Query Performance Insights for the slow query. If it's an SQS publish subsegment: check if the SNS/SQS call is synchronous and blocking, consider making it async.
If Redis cache hit rate (visible in CloudWatch custom metrics) dropped recently, that would cascade into more RDS queries. Compare the trace segment timestamps against the cache hit metric to correlate. Filter X-Ray traces for the slowest 5% to find if there's a specific code path or input size causing it. The resolution might be adding a read replica, tuning a query, increasing the Hikari pool size, or pre-warming the Redis cache on startup.
If Redis cache hit rate (visible in CloudWatch custom metrics) dropped recently, that would cascade into more RDS queries. Compare the trace segment timestamps against the cache hit metric to correlate. Filter X-Ray traces for the slowest 5% to find if there's a specific code path or input size causing it. The resolution might be adding a read replica, tuning a query, increasing the Hikari pool size, or pre-warming the Redis cache on startup.
Walk me through the production architecture you built in this project.
The architecture starts at the edge with an Application Load Balancer that routes HTTP/HTTPS traffic into an EKS cluster running inside a VPC. The cluster has worker nodes across two Availability Zones in private subnets for resilience. Three Spring Boot microservices run as Kubernetes Deployments, Order Service (order creation + Transactional Outbox), Inventory Service (SQS consumer, Redis cache-aside, DynamoDB idempotency), and Document Service (S3 presigned URLs). All pods use IRSA, IAM roles bound to Kubernetes ServiceAccounts, so no AWS credentials are embedded anywhere.
The event-driven layer: Order Service publishes
The data layer: RDS PostgreSQL in a private subnet reachable only from EKS pods via Security Group rules. ElastiCache Redis and DynamoDB in private subnets for the Inventory Service.
Observability: CloudWatch Container Insights collects pod metrics and logs; Micrometer custom metrics track order throughput, cache hit rate, and SQS lag; AWS X-Ray provides distributed traces across all three services. Infrastructure is provisioned via Terraform with remote state in S3. CI/CD runs in GitHub Actions with OIDC federation, merge to main triggers a build, ECR push, and rolling deployment to EKS automatically.
The event-driven layer: Order Service publishes
OrderCreated events to an SNS topic; the Inventory Service consumes from an SQS queue subscribed to that topic, deduplicates via DynamoDB, and caches stock levels in ElastiCache Redis. The Document Service stores files in S3 and returns presigned URLs to clients so file transfers bypass the application servers entirely.The data layer: RDS PostgreSQL in a private subnet reachable only from EKS pods via Security Group rules. ElastiCache Redis and DynamoDB in private subnets for the Inventory Service.
Observability: CloudWatch Container Insights collects pod metrics and logs; Micrometer custom metrics track order throughput, cache hit rate, and SQS lag; AWS X-Ray provides distributed traces across all three services. Infrastructure is provisioned via Terraform with remote state in S3. CI/CD runs in GitHub Actions with OIDC federation, merge to main triggers a build, ECR push, and rolling deployment to EKS automatically.
How would you handle a database migration with zero downtime?
Use an expand-contract pattern with Flyway (or Liquibase): first deploy a migration that adds the new column or table without removing anything (expand). The old application version ignores the new column; the new version uses it. After all instances are running the new version (confirmed via rolling deploy), deploy a second migration that removes the old column (contract).
For high-risk migrations: take a manual RDS snapshot immediately before the migration window. Enable RDS Multi-AZ if not already on. Have a tested rollback plan, Flyway supports undo scripts, and the RDS snapshot is the last resort. Run the migration during low-traffic hours. Monitor error rates in CloudWatch during and after. Never run a migration that locks a table for minutes in production without testing its impact on a clone of the production database first.
For high-risk migrations: take a manual RDS snapshot immediately before the migration window. Enable RDS Multi-AZ if not already on. Have a tested rollback plan, Flyway supports undo scripts, and the RDS snapshot is the last resort. Run the migration during low-traffic hours. Monitor error rates in CloudWatch during and after. Never run a migration that locks a table for minutes in production without testing its impact on a clone of the production database first.
How would you scale this architecture to handle 10× traffic?
At the application layer: Kubernetes Horizontal Pod Autoscaler (HPA) scales pods based on CPU/memory, the EKS cluster already handles pod-level scaling. For node-level scaling, add Karpenter (AWS's node autoscaler) to provision new EC2 worker nodes automatically when pods are unschedulable.
At the database layer: RDS read replicas for read-heavy traffic (direct reporting queries to replicas). For write-heavy load, consider Aurora PostgreSQL, it separates storage from compute and scales reads across up to 15 replicas with automatic failover. RDS Proxy handles connection pooling if Lambda or many pods are hammering the DB.
At the network edge: CloudFront CDN in front of the ALB caches API responses where possible (rare for a CRUD API but useful for read-heavy public endpoints). S3 files served via CloudFront edge locations rather than presigned S3 URLs directly, reduces latency globally.
For the Document Service specifically: large file uploads should use S3 multipart upload initiated directly from the client (presigned multipart upload URLs), bypassing the application entirely for file bytes.
At the database layer: RDS read replicas for read-heavy traffic (direct reporting queries to replicas). For write-heavy load, consider Aurora PostgreSQL, it separates storage from compute and scales reads across up to 15 replicas with automatic failover. RDS Proxy handles connection pooling if Lambda or many pods are hammering the DB.
At the network edge: CloudFront CDN in front of the ALB caches API responses where possible (rare for a CRUD API but useful for read-heavy public endpoints). S3 files served via CloudFront edge locations rather than presigned S3 URLs directly, reduces latency globally.
For the Document Service specifically: large file uploads should use S3 multipart upload initiated directly from the client (presigned multipart upload URLs), bypassing the application entirely for file bytes.
My Notes
Saved to browser storage automatically as you type.