AWS Certified Solutions Architect - Associate (SAA-C03) Comprehensive Study Guide
Complete Learning Path for Certification Success
Overview
This study guide provides a structured learning path from fundamentals to exam readiness for the AWS Certified Solutions Architect - Associate (SAA-C03) certification. Designed for complete novices, it teaches all concepts progressively while focusing exclusively on exam-relevant content. Extensive diagrams and visual aids are integrated throughout to enhance understanding and retention.
About This Certification
Exam Code: SAA-C03 Exam Duration: 130 minutes Number of Questions: 65 (50 scored + 15 unscored) Passing Score: 720 out of 1000 Question Types: Multiple choice (one correct answer) and multiple response (two or more correct answers) Exam Format: Scenario-based questions testing real-world architecture decisions
Target Candidate: Individuals with at least 1 year of hands-on experience designing cloud solutions using AWS services, though this guide is designed to teach complete beginners from the ground up.
What This Guide Covers
This comprehensive study guide covers all four domains of the SAA-C03 exam:
Domain 1: Design Secure Architectures (30% of exam)
Secure access to AWS resources
Secure workloads and applications
Data security controls
Domain 2: Design Resilient Architectures (26% of exam)
Scalable and loosely coupled architectures
Highly available and fault-tolerant architectures
Domain 3: Design High-Performing Architectures (24% of exam)
High-performing storage solutions
Elastic compute solutions
High-performing database solutions
Scalable network architectures
Data ingestion and transformation solutions
Domain 4: Design Cost-Optimized Architectures (20% of exam)
Cost-optimized storage solutions
Cost-optimized compute solutions
Cost-optimized database solutions
Cost-optimized network architectures
Section Organization
Study Sections (read in order):
Overview (this section) - How to use the guide and study plan
Fundamentals - Section 0: Essential background and prerequisites
02_domain1_secure_architectures - Section 1: Security (30% of exam)
03_domain2_resilient_architectures - Section 2: Resilience (26% of exam)
04_domain3_high_performing_architectures - Section 3: Performance (24% of exam)
05_domain4_cost_optimized_architectures - Section 4: Cost Optimization (20% of exam)
ā Select appropriate AWS services for different scenarios
ā Explain architectural decisions using AWS best practices
ā Score 75%+ on practice tests consistently
ā Feel confident on exam day
Skills You'll Develop:
Architecture design and evaluation
Service selection and comparison
Security best practices implementation
Cost optimization strategies
Performance tuning techniques
Disaster recovery planning
Troubleshooting and problem-solving
Getting Help
If You're Stuck:
Review the relevant section in the chapter
Study the associated diagrams
Check 99_appendices for quick reference
Review practice question explanations
Revisit 01_fundamentals for foundational concepts
Additional Resources (After Completing This Guide):
AWS Documentation (official reference)
AWS Whitepapers (Well-Architected Framework)
AWS Training and Certification portal
AWS re:Invent videos (for deeper dives)
Ready to Begin?
Start with Fundamentals to build your foundation, then progress through each domain chapter. Remember: this is a marathon, not a sprint. Consistent daily study is more effective than cramming.
Your journey to AWS Solutions Architect - Associate certification starts now!
Last Updated: October 2025 Exam Version: SAA-C03 Study Guide Version: 1.0
Quick Start Guide
For Complete Beginners (6-10 weeks):
Week 1: Read 01_fundamentals + take notes
Week 2-3: Read 02_domain1_secure_architectures + practice Domain 1 questions
Week 4-5: Read 03_domain2_resilient_architectures + practice Domain 2 questions
Week 6: Read 04_domain3_high_performing_architectures + practice Domain 3 questions
Week 7: Read 05_domain4_cost_optimized_architectures + practice Domain 4 questions
Week 8: Read 06_integration + take full practice tests
Week 9: Review weak areas + retake practice tests (target: 80%+)
Files are numbered for sequential reading (00, 01, 02, etc.)
Each domain chapter is self-contained but builds on previous knowledge
Diagrams are in the diagrams/ folder, referenced in text
Quick reference cards at end of each chapter for rapid review
Reading Strategy:
Read chapters in order (01 ā 02 ā 03 ā 04 ā 05 ā 06)
Don't skip ahead - concepts build progressively
Use 99_appendices as quick reference during study
Return to 08_final_checklist in your last week
Review 07_study_strategies before taking practice tests
Visual Learning:
173 Mermaid diagrams throughout the guide
Each diagram has detailed text explanation
Diagrams show architecture, flows, decisions, and comparisons
Study diagrams carefully - they simplify complex concepts
Practice Integration:
Practice questions are organized by difficulty and domain
Start with beginner questions after reading each chapter
Progress to intermediate and advanced as confidence grows
Review explanations for ALL questions, not just incorrect ones
Legend
Throughout this guide, you'll see these markers:
ā Must Know: Critical for exam success - memorize these
š” Tip: Helpful insight or shortcut to remember concepts
ā ļø Warning: Common mistake to avoid - exam traps
š Connection: Related to other topics - cross-reference
š Practice: Hands-on exercise to reinforce learning
šÆ Exam Focus: Frequently tested concept - high priority
š Diagram: Visual representation available in diagrams folder
Final Words
This comprehensive study guide is designed to take you from complete novice to exam-ready in 6-10 weeks. The key to success is:
Consistency: Study 2-3 hours daily, every day
Understanding: Focus on WHY, not just WHAT
Practice: Take all practice tests and review thoroughly
Patience: Don't rush - mastery takes time
Confidence: Trust your preparation and stay calm
Remember: This guide is self-sufficient. You have everything you need to pass the SAA-C03 exam. Follow the study plan, complete all practice questions, and you'll be ready!
Good luck on your certification journey! š
Next Step: Begin with 01_fundamentals - Essential Background
Chapter 0: Essential Background and Prerequisites
Chapter Overview
What you'll learn:
AWS Global Infrastructure (Regions, Availability Zones, Edge Locations)
AWS Shared Responsibility Model
Core AWS concepts and terminology
AWS Well-Architected Framework fundamentals
Basic networking and cloud computing concepts
Time to complete: 8-10 hours Prerequisites: None - this chapter starts from the basics
Why this matters: Understanding these foundational concepts is critical for the SAA-C03 exam. Every question assumes you know how AWS infrastructure works, what AWS is responsible for versus what you're responsible for, and how to apply architectural best practices. Without this foundation, the domain-specific chapters won't make sense.
Section 1: What is Cloud Computing?
Introduction
The problem: Traditional IT infrastructure requires companies to buy, install, and maintain physical servers in their own data centers. This means:
Large upfront capital expenses (buying servers, networking equipment, cooling systems)
Long lead times (weeks or months to procure and set up new hardware)
Capacity planning challenges (over-provision and waste money, or under-provision and run out of capacity)
Difficulty scaling globally (need to build data centers in every region you serve)
The solution: Cloud computing provides on-demand access to computing resources (servers, storage, databases, networking) over the internet, with pay-as-you-go pricing. Instead of owning and maintaining physical infrastructure, you rent it from a cloud provider like AWS.
Why it's tested: The SAA-C03 exam assumes you understand the fundamental benefits of cloud computing and can design solutions that leverage these benefits. Questions often test whether you can identify when cloud-native solutions are more appropriate than traditional approaches.
Core Concepts
What is Cloud Computing?
What it is: Cloud computing is the on-demand delivery of IT resources over the internet with pay-as-you-go pricing. Instead of buying, owning, and maintaining physical data centers and servers, you can access technology services such as computing power, storage, and databases on an as-needed basis from a cloud provider like Amazon Web Services (AWS).
Why it exists: Before cloud computing, every company that needed IT infrastructure had to build and maintain their own data centers. This was expensive, time-consuming, and required specialized expertise. Cloud computing emerged to solve these problems by allowing companies to rent infrastructure instead of owning it, similar to how you rent an apartment instead of building a house.
Real-world analogy: Think of cloud computing like electricity from a power company. You don't build your own power plant - you plug into the grid and pay for what you use. Similarly, you don't build your own data center - you connect to AWS and pay for the computing resources you consume.
How it works (Detailed step-by-step):
You identify your need: Your application needs a server to run a web application. Instead of buying physical hardware, you decide to use AWS.
You provision resources via API/Console: You log into the AWS Management Console (a web interface) or use the AWS API (programmatic access) and request a virtual server (called an EC2 instance). You specify what type of server you need (CPU, memory, storage).
AWS allocates resources: Within minutes, AWS provisions a virtual server for you from their massive pool of physical servers in their data centers. This virtual server is isolated from other customers' servers using virtualization technology.
You use the resources: Your virtual server is now running and accessible over the internet. You can install your application, configure it, and start serving users. The server behaves just like a physical server you might have in your own data center.
You pay for what you use: AWS meters your usage (how many hours the server runs, how much data you transfer, how much storage you use) and charges you accordingly. If you stop using the server, you stop paying for it.
You scale as needed: If your application becomes popular and needs more servers, you can provision additional servers in minutes. If traffic decreases, you can terminate servers and stop paying for them. This elasticity is a key benefit of cloud computing.
The Six Advantages of Cloud Computing
ā Must Know: These six advantages appear frequently in exam questions. You need to recognize scenarios where each advantage applies.
Trade capital expense for variable expense
What it means: Instead of paying large upfront costs for data centers and servers (capital expense), you pay only for the computing resources you consume (variable expense).
Example: A startup doesn't need $100,000 to buy servers before launching. They can start with $10/month on AWS and scale up as they grow.
Exam relevance: Questions test whether you can identify cost optimization opportunities by moving from fixed to variable costs.
Benefit from massive economies of scale
What it means: AWS buys hardware and operates data centers at massive scale, achieving lower costs than individual companies could. These savings are passed to customers through lower prices.
Example: AWS can negotiate better prices with hardware vendors because they buy millions of servers. You benefit from these bulk discounts.
Exam relevance: Questions may ask why cloud solutions are often more cost-effective than on-premises solutions.
Stop guessing capacity
What it means: You don't need to predict how much infrastructure you'll need months in advance. You can scale up or down based on actual demand.
Example: A retail website doesn't need to buy enough servers to handle Black Friday traffic all year round. They can scale up for Black Friday and scale down afterward.
Exam relevance: Questions test your understanding of auto-scaling and elastic architectures.
Increase speed and agility
What it means: New IT resources are available in minutes instead of weeks. This allows faster experimentation and innovation.
Example: A developer can spin up a test environment in 5 minutes to try a new idea, instead of waiting weeks for IT to procure and configure hardware.
Exam relevance: Questions test whether you can design solutions that enable rapid deployment and iteration.
Stop spending money running and maintaining data centers
What it means: You can focus on your business and applications instead of managing physical infrastructure (racking servers, managing power and cooling, physical security).
Example: A healthcare company can focus on improving patient care instead of hiring data center technicians.
Exam relevance: Questions test whether you understand the operational benefits of managed services.
Go global in minutes
What it means: You can deploy your application in multiple geographic regions around the world with just a few clicks, providing lower latency to global users.
Example: A gaming company can deploy servers in North America, Europe, and Asia simultaneously to provide low-latency gameplay to players worldwide.
Exam relevance: Questions test your understanding of multi-region architectures and global deployment strategies.
š” Tip: When you see exam questions asking "Why should the company move to AWS?" or "What are the benefits of this cloud solution?", think about these six advantages. The correct answer often relates to one or more of them.
Section 2: AWS Global Infrastructure
Introduction
The problem: Applications need to be available to users around the world with low latency (fast response times). If all your servers are in one location, users far away will experience slow performance. Additionally, if that one location experiences a disaster (power outage, natural disaster, network failure), your entire application goes down.
The solution: AWS has built a global infrastructure with data centers distributed around the world. This allows you to deploy your application close to your users for low latency, and across multiple isolated locations for high availability and disaster recovery.
Why it's tested: Understanding AWS global infrastructure is fundamental to the SAA-C03 exam. Questions frequently test your ability to design architectures that leverage Regions, Availability Zones, and Edge Locations for resilience, performance, and compliance.
Core Concepts
AWS Regions
What it is: An AWS Region is a physical geographic area where AWS has multiple data centers. Each Region is completely independent and isolated from other Regions. As of 2025, AWS has 33+ Regions worldwide, with names like us-east-1 (N. Virginia), eu-west-1 (Ireland), and ap-southeast-1 (Singapore).
Why it exists: Regions exist to allow you to deploy applications close to your users (reducing latency), comply with data residency requirements (some countries require data to stay within their borders), and provide geographic redundancy (if one Region fails, your application can continue running in another Region).
Real-world analogy: Think of AWS Regions like different branches of a bank. Each branch operates independently - if the New York branch has a problem, the London branch continues operating normally. You choose which branch to use based on where you live (proximity) and local regulations.
How it works (Detailed step-by-step):
AWS builds data centers in a geographic area: AWS selects a location (like Northern Virginia) and builds multiple data centers in that area. These data centers are connected with high-speed, low-latency networking.
The Region is isolated: Each Region is completely independent. Resources in us-east-1 don't automatically replicate to eu-west-1. This isolation provides fault tolerance - a problem in one Region doesn't affect other Regions.
You choose a Region for your resources: When you create AWS resources (like EC2 instances, S3 buckets, RDS databases), you must specify which Region to create them in. This decision is based on:
Proximity to users: Choose a Region close to your users for low latency
Compliance requirements: Some regulations require data to stay in specific countries
Service availability: Not all AWS services are available in all Regions
Cost: Pricing varies slightly between Regions
Resources stay in that Region: Once created, resources remain in that Region unless you explicitly copy or move them. For example, an EC2 instance in us-east-1 cannot be directly moved to eu-west-1 - you would need to create a new instance in eu-west-1.
You can deploy across multiple Regions: For global applications, you can deploy resources in multiple Regions and use services like Route 53 (DNS) and CloudFront (CDN) to route users to the nearest Region.
ā Must Know:
Each Region is completely isolated and independent
Resources don't automatically replicate across Regions
You choose the Region based on latency, compliance, service availability, and cost
Region names follow the pattern: geographic-area-number (e.g., us-east-1, eu-west-2)
Detailed Example 1: E-commerce Application Deployment
Imagine you're running an e-commerce website that sells products to customers in the United States and Europe. Here's how you would use Regions:
Scenario: Your company is based in the US, but 40% of your customers are in Europe. European customers complain about slow page load times.
Solution using Regions:
Deploy your application in us-east-1 (N. Virginia) to serve US customers
Deploy a copy of your application in eu-west-1 (Ireland) to serve European customers
Use Route 53 with geolocation routing to automatically direct US users to us-east-1 and European users to eu-west-1
Each Region has its own EC2 instances, load balancers, and databases
You replicate product catalog data between Regions so both have the same inventory information
Result: US customers connect to servers in Virginia (low latency), European customers connect to servers in Ireland (low latency). If the Virginia Region experiences an outage, European customers are unaffected because Ireland is completely independent.
Detailed Example 2: Compliance Requirements
Scenario: A German healthcare company must comply with GDPR, which requires patient data to remain within the European Union.
Solution using Regions:
Deploy all application resources in eu-central-1 (Frankfurt, Germany)
Configure S3 buckets with region restrictions to prevent accidental data transfer outside the EU
Use AWS Organizations with Service Control Policies (SCPs) to prevent developers from creating resources in non-EU Regions
Enable CloudTrail logging to audit all data access and ensure compliance
Result: All patient data stays within the EU, satisfying GDPR requirements. The company can prove to regulators that data never leaves the EU Region.
Detailed Example 3: Disaster Recovery Across Regions
Scenario: A financial services company needs to ensure their trading platform remains available even if an entire AWS Region fails.
Solution using Regions:
Primary deployment in us-east-1 (N. Virginia) handles all production traffic
Standby deployment in us-west-2 (Oregon) remains ready but doesn't serve traffic
Database replication from us-east-1 to us-west-2 keeps data synchronized
Route 53 health checks monitor the us-east-1 deployment
If us-east-1 fails, Route 53 automatically redirects traffic to us-west-2
Result: If the entire us-east-1 Region becomes unavailable (extremely rare but possible), the application automatically fails over to us-west-2 within minutes, minimizing downtime.
ā ļø Common Mistakes & Misconceptions:
Mistake 1: Assuming resources automatically replicate across Regions
Why it's wrong: AWS Regions are completely isolated. If you create an EC2 instance in us-east-1, it doesn't automatically appear in eu-west-1.
Correct understanding: You must explicitly configure cross-region replication for services that support it (like S3, RDS, DynamoDB) or manually deploy resources in multiple Regions.
Mistake 2: Thinking all AWS services are available in all Regions
Why it's wrong: New AWS services typically launch in a few Regions first, then gradually expand to other Regions over time.
Correct understanding: Always check the AWS Regional Services List to confirm a service is available in your chosen Region before designing your architecture.
Mistake 3: Choosing a Region based only on cost
Why it's wrong: While cost is a factor, choosing a Region far from your users can result in poor performance (high latency), which may cost you more in lost customers than you save on infrastructure.
Correct understanding: Prioritize proximity to users and compliance requirements, then consider cost as a secondary factor.
š Connections to Other Topics:
Relates to Availability Zones (covered next) because: Each Region contains multiple Availability Zones
Builds on Disaster Recovery (covered in Domain 2) by: Providing geographic redundancy for business continuity
Often used with Route 53 (covered in Domain 3) to: Route users to the nearest Region for optimal performance
Availability Zones (AZs)
What it is: An Availability Zone (AZ) is one or more discrete data centers within an AWS Region, each with redundant power, networking, and connectivity. Each Region has multiple AZs (typically 3-6), and they are physically separated from each other (different buildings, sometimes different flood plains) but connected with high-speed, low-latency networking.
Why it exists: Even within a single geographic region, you need protection against localized failures. A single data center could experience power outages, cooling failures, network issues, or natural disasters. By distributing your application across multiple AZs within a Region, you protect against these single-point-of-failure scenarios while maintaining low latency between components.
Real-world analogy: Think of Availability Zones like different buildings in a corporate campus. All buildings are in the same city (Region) and connected with high-speed fiber optic cables, but each building has its own power supply, cooling system, and network connection. If one building loses power, the others continue operating normally.
How it works (Detailed step-by-step):
AWS builds multiple isolated data centers in a Region: Within each Region, AWS constructs 3-6 separate data center facilities. These are physically separated (typically 10-100 km apart) to protect against localized disasters, but close enough for low-latency communication (typically <2ms latency between AZs).
Each AZ has independent infrastructure: Each AZ has its own:
Power supply (with backup generators and UPS systems)
Cooling systems
Network connectivity (multiple ISPs)
Physical security This independence means a failure in one AZ (like a power outage) doesn't affect other AZs.
AZs are connected with redundant, high-speed networking: AWS connects AZs within a Region using multiple redundant 100 Gbps fiber optic connections. This allows your application components in different AZs to communicate quickly and reliably.
You distribute resources across AZs: When designing your architecture, you deploy resources (EC2 instances, databases, load balancers) across multiple AZs. For example:
Deploy web servers in AZ-1a, AZ-1b, and AZ-1c
Use an Application Load Balancer that distributes traffic across all three AZs
Use RDS Multi-AZ to automatically replicate your database to a standby in a different AZ
AWS handles failover automatically (for some services): Many AWS services automatically handle AZ failures. For example:
Elastic Load Balancers automatically stop sending traffic to unhealthy AZs
RDS Multi-AZ automatically fails over to the standby database in another AZ
S3 automatically replicates data across multiple AZs
You benefit from high availability: If one AZ fails completely, your application continues running in the remaining AZs with minimal disruption.
ā Must Know:
Each Region has multiple AZs (minimum 3, typically 3-6)
AZs are physically separated but connected with low-latency networking
AZ names are Region-specific: us-east-1a, us-east-1b, us-east-1c, etc.
Deploying across multiple AZs is the primary way to achieve high availability in AWS
Some services (like S3, DynamoDB) automatically use multiple AZs; others (like EC2) require you to explicitly deploy across AZs
š Global Infrastructure Diagram:
graph TB
subgraph "AWS Global Infrastructure"
subgraph "Region: us-east-1 (N. Virginia)"
subgraph "AZ-1a"
DC1[Data Center 1]
DC2[Data Center 2]
end
subgraph "AZ-1b"
DC3[Data Center 3]
DC4[Data Center 4]
end
subgraph "AZ-1c"
DC5[Data Center 5]
DC6[Data Center 6]
end
end
subgraph "Region: eu-west-1 (Ireland)"
subgraph "AZ-2a"
DC7[Data Center 7]
end
subgraph "AZ-2b"
DC8[Data Center 8]
end
subgraph "AZ-2c"
DC9[Data Center 9]
end
end
subgraph "Edge Locations"
EDGE1[CloudFront Edge<br/>New York]
EDGE2[CloudFront Edge<br/>London]
EDGE3[CloudFront Edge<br/>Tokyo]
end
end
DC1 -.Low-latency connection.-> DC3
DC1 -.Low-latency connection.-> DC5
DC3 -.Low-latency connection.-> DC5
style DC1 fill:#c8e6c9
style DC3 fill:#c8e6c9
style DC5 fill:#c8e6c9
style EDGE1 fill:#e1f5fe
style EDGE2 fill:#e1f5fe
style EDGE3 fill:#e1f5fe
This diagram illustrates the hierarchical structure of AWS global infrastructure. At the highest level, we have Regions - completely independent geographic areas like us-east-1 (Northern Virginia) and eu-west-1 (Ireland). Each Region is isolated from other Regions, meaning resources don't automatically replicate between them and a failure in one Region doesn't affect others.
Within each Region, we see multiple Availability Zones (AZ-1a, AZ-1b, AZ-1c in us-east-1). Each AZ contains one or more data centers (shown as DC1, DC2, etc.). The green data centers in us-east-1 represent active data centers within different AZs, connected by low-latency, high-bandwidth networking (shown as dotted lines). This low-latency connection (typically <2ms) allows your application components in different AZs to communicate quickly, enabling you to build highly available architectures without sacrificing performance.
The physical separation between AZs (they're in different buildings, sometimes different flood plains) protects against localized failures. If AZ-1a experiences a power outage, AZ-1b and AZ-1c continue operating normally because they have independent power supplies, cooling systems, and network connections.
At the bottom, we see Edge Locations (shown in blue) - these are separate from Regions and AZs. Edge Locations are part of AWS's content delivery network (CloudFront) and are distributed in major cities worldwide (200+ locations). They cache content close to end users for faster delivery. Unlike Regions and AZs where you deploy your application infrastructure, Edge Locations are managed by AWS and used automatically when you enable CloudFront.
The key architectural principle shown here is defense in depth: Regions protect against geographic disasters, Availability Zones protect against localized failures within a Region, and multiple data centers within each AZ protect against individual data center failures. This multi-layered approach enables AWS to achieve extremely high availability (99.99% or higher for many services).
Detailed Example 1: Multi-AZ Web Application
Imagine you're deploying a three-tier web application (web servers, application servers, database) that needs to be highly available.
Scenario: Your e-commerce application must remain available even if an entire data center fails. Downtime costs $10,000 per minute in lost sales.
Solution using Multiple AZs:
Web Tier (in 3 AZs):
Deploy 2 EC2 instances in us-east-1a running your web application
Deploy 2 EC2 instances in us-east-1b running your web application
Deploy 2 EC2 instances in us-east-1c running your web application
Total: 6 web servers distributed across 3 AZs
Load Balancer (automatically multi-AZ):
Create an Application Load Balancer (ALB) and enable all 3 AZs
The ALB automatically distributes traffic across all 6 web servers
The ALB performs health checks every 30 seconds
If servers in one AZ become unhealthy, the ALB automatically stops sending traffic to that AZ
Application Tier (in 3 AZs):
Deploy 2 EC2 instances in each AZ running your application logic
Total: 6 application servers distributed across 3 AZs
Database Tier (Multi-AZ RDS):
Create an RDS database with Multi-AZ enabled
Primary database runs in us-east-1a
Standby database automatically created in us-east-1b
AWS synchronously replicates all data from primary to standby
If primary fails, AWS automatically promotes standby to primary (1-2 minute failover)
What happens when AZ-1a fails:
The power goes out in the entire us-east-1a Availability Zone
All EC2 instances in us-east-1a become unreachable (2 web servers, 2 app servers)
The ALB detects failed health checks for servers in us-east-1a within 30 seconds
The ALB stops sending new traffic to us-east-1a, routing all traffic to us-east-1b and us-east-1c
RDS detects the primary database is unreachable and automatically fails over to the standby in us-east-1b (takes 1-2 minutes)
Your application continues serving customers with 4 web servers and 4 app servers (instead of 6 each)
Performance may be slightly degraded due to reduced capacity, but the application remains available
When us-east-1a recovers, the ALB automatically starts sending traffic to those servers again
Result: Total downtime is approximately 1-2 minutes (during database failover), compared to potentially hours if you had deployed everything in a single AZ. The cost of running resources in 3 AZs instead of 1 is minimal (no extra charge for using multiple AZs, just the cost of the additional EC2 instances), but the benefit is massive (avoiding $10,000/minute in lost sales).
Detailed Example 2: Multi-AZ Database for Data Durability
Scenario: A financial services company stores transaction records in a database. Losing this data would be catastrophic (regulatory violations, customer lawsuits, loss of trust).
Solution using RDS Multi-AZ:
Enable RDS Multi-AZ: When creating the RDS database, enable the Multi-AZ option
Primary database in AZ-1a: Handles all read and write operations
Standby database in AZ-1b: Receives synchronous replication of every transaction
Synchronous replication: When your application writes data to the primary database:
The write is sent to the primary database in AZ-1a
The primary database immediately replicates the write to the standby in AZ-1b
Only after the standby confirms it has received the data does the primary acknowledge the write to your application
This ensures zero data loss - if the primary fails immediately after acknowledging a write, the standby already has that data
Automatic failover: If the primary database fails:
RDS detects the failure within 60 seconds
RDS automatically promotes the standby to primary
RDS updates the DNS record to point to the new primary
Your application reconnects and continues operating
Total failover time: 1-2 minutes
Result: Even if the entire us-east-1a Availability Zone is destroyed (extremely unlikely but theoretically possible), you lose zero data because every transaction was synchronously replicated to us-east-1b before being acknowledged. The cost is approximately 2x the single-AZ database cost (you're running two database instances), but the benefit is guaranteed data durability and high availability.
Detailed Example 3: Auto Scaling Across AZs
Scenario: A news website experiences unpredictable traffic spikes when breaking news occurs. Traffic can increase from 1,000 requests/second to 50,000 requests/second within minutes.
Solution using Auto Scaling across AZs:
Create an Auto Scaling Group: Configure it to maintain a minimum of 6 EC2 instances (2 per AZ) and scale up to 60 instances (20 per AZ)
Distribute across 3 AZs: Configure the Auto Scaling Group to balance instances evenly across us-east-1a, us-east-1b, and us-east-1c
Set scaling policies: When CPU utilization exceeds 70%, add 3 instances (1 per AZ). When CPU drops below 30%, remove 3 instances (1 per AZ)
Use an ALB: The Application Load Balancer distributes traffic across all instances in all AZs
What happens during a traffic spike:
Breaking news causes traffic to spike from 1,000 to 50,000 requests/second
CPU utilization on existing instances quickly rises above 70%
Auto Scaling detects high CPU and launches 3 new instances (1 in each AZ)
The new instances register with the ALB and start receiving traffic within 2-3 minutes
If CPU remains high, Auto Scaling continues adding instances (3 at a time, distributed across AZs) until traffic is handled or the maximum of 60 instances is reached
When the traffic spike ends and CPU drops below 30%, Auto Scaling gradually terminates instances (3 at a time, maintaining balance across AZs)
Result: The application automatically scales to handle traffic spikes without manual intervention, and the multi-AZ distribution ensures that if one AZ fails during a traffic spike, the other two AZs continue serving traffic. The even distribution across AZs also ensures balanced load and prevents any single AZ from becoming a bottleneck.
ā ļø Common Mistakes & Misconceptions:
Mistake 1: Deploying all resources in a single AZ to save costs
Why it's wrong: There's no cost savings - AWS doesn't charge extra for using multiple AZs. You pay for the resources (EC2 instances, storage, etc.), not for the number of AZs you use.
Correct understanding: Always deploy across at least 2 AZs (preferably 3) for production workloads. The only "cost" is the additional resources you run for redundancy (e.g., running 6 servers instead of 3), but this is necessary for high availability.
Mistake 2: Assuming AZ names are consistent across AWS accounts
Why it's wrong: AWS randomizes AZ names across accounts. Your us-east-1a might be a different physical data center than someone else's us-east-1a. This prevents all customers from concentrating resources in the same physical AZ.
Correct understanding: Use AZ IDs (like use1-az1) when coordinating across accounts, not AZ names (like us-east-1a).
Mistake 3: Thinking data automatically replicates across AZs
Why it's wrong: Only certain services automatically replicate across AZs (S3, DynamoDB, EFS). For EC2 instances and EBS volumes, you must explicitly configure replication or deploy resources in multiple AZs.
Correct understanding: Check each service's documentation to understand its AZ behavior. For EC2, you must manually launch instances in multiple AZs. For RDS, you must enable Multi-AZ. For S3, replication across AZs is automatic.
š Connections to Other Topics:
Relates to High Availability (Domain 2) because: Multi-AZ deployments are the foundation of highly available architectures
Builds on Load Balancing (Domain 2) by: Using load balancers to distribute traffic across AZs
Often used with Auto Scaling (Domain 3) to: Automatically maintain balanced capacity across AZs
š” Tips for Understanding:
Think of AZs as "failure domains" - design your architecture so that the failure of any single AZ doesn't bring down your application
The rule of thumb: Always use at least 2 AZs for production workloads, preferably 3
Remember: Low latency between AZs (<2ms) means you can treat them almost like a single data center for performance purposes, but they're isolated for fault tolerance
Edge Locations and CloudFront
What it is: Edge Locations are AWS data centers specifically designed to deliver content to end users with the lowest possible latency. They are part of Amazon CloudFront, AWS's Content Delivery Network (CDN). AWS has 400+ Edge Locations in 90+ cities across 48 countries, far more than the 33 Regions.
Why it exists: Even if you deploy your application in multiple Regions, users far from those Regions will still experience high latency. For example, if your application is in us-east-1 and eu-west-1, users in Australia will have high latency to both Regions (200-300ms). Edge Locations solve this by caching content close to users worldwide, reducing latency to 10-50ms.
Real-world analogy: Think of Edge Locations like local convenience stores. The main warehouse (Region) is far away, but the convenience store (Edge Location) in your neighborhood stocks popular items. You can get those items quickly from the local store without traveling to the warehouse. If the store doesn't have what you need, it orders from the warehouse, but most requests are served locally.
How it works (Detailed step-by-step):
You enable CloudFront: You create a CloudFront distribution and point it to your origin (the source of your content, like an S3 bucket or an EC2 web server in a Region).
User requests content: A user in Tokyo requests an image from your website (www.example.com/logo.png).
DNS routes to nearest Edge Location: CloudFront's DNS automatically routes the user to the nearest Edge Location (in this case, Tokyo).
Edge Location checks cache: The Tokyo Edge Location checks if it has logo.png cached locally.
Cache hit (content is cached): If the Edge Location has the content cached and it hasn't expired:
The Edge Location immediately returns the content to the user
Latency: 10-20ms (very fast)
The origin server (in us-east-1) is never contacted
This is the most common scenario for popular content
Cache miss (content not cached): If the Edge Location doesn't have the content cached:
The Edge Location requests the content from the origin server (in us-east-1)
The origin server sends the content to the Edge Location
The Edge Location caches the content locally and returns it to the user
Latency: 150-200ms for this first request (slower)
Subsequent requests from users in Tokyo will be cache hits (fast)
Content expires and refreshes: You configure a Time-To-Live (TTL) for cached content (e.g., 24 hours). After 24 hours, the Edge Location requests fresh content from the origin to ensure users get updated content.
ā Must Know:
Edge Locations are separate from Regions and AZs - they're specifically for content delivery
There are 400+ Edge Locations worldwide, far more than the 33 Regions
Edge Locations cache content from your origin (S3, EC2, ALB, etc.)
CloudFront is the service that uses Edge Locations
Edge Locations can also be used for uploading content (S3 Transfer Acceleration)
Detailed Example 1: Global Website Performance
Scenario: A media company hosts video content in S3 buckets in us-east-1. They have users worldwide, but users in Asia and Australia complain about slow video loading times.
Problem without CloudFront:
User in Sydney requests a video from S3 in us-east-1
Request travels from Sydney to Virginia (approximately 15,000 km)
Latency: 200-250ms per request
Video takes 30-60 seconds to start playing
Buffering occurs frequently during playback
Solution with CloudFront:
Create a CloudFront distribution with the S3 bucket as the origin
Enable CloudFront in all Edge Locations worldwide
Update the website to use the CloudFront URL instead of the direct S3 URL
What happens:
User in Sydney requests a video
DNS routes the request to the Sydney Edge Location (closest to the user)
First request (cache miss):
Sydney Edge Location requests the video from S3 in us-east-1
S3 sends the video to Sydney Edge Location
Sydney Edge Location caches the video and streams it to the user
Latency: 200ms for the initial request, but subsequent chunks stream quickly
Second user in Sydney requests the same video (cache hit):
Sydney Edge Location already has the video cached
Video streams immediately from Sydney Edge Location
Latency: 10-20ms
Video starts playing in 2-3 seconds
No buffering during playback
Result: Video loading time reduced from 30-60 seconds to 2-3 seconds for users in Sydney. The first user experiences slightly slower loading (cache miss), but all subsequent users in the region benefit from the cached content. The media company's bandwidth costs also decrease because most requests are served from Edge Locations instead of the origin S3 bucket.
Detailed Example 2: Dynamic Content Acceleration
Scenario: An e-commerce application serves dynamic content (personalized product recommendations, shopping cart, user profiles) that can't be cached. Users in Europe experience slow page loads because the application servers are in us-east-1.
Solution with CloudFront (even for dynamic content):
CloudFront can accelerate dynamic content through network optimizations, even though the content isn't cached:
Create a CloudFront distribution with the ALB (Application Load Balancer) in us-east-1 as the origin
Enable CloudFront for dynamic content (set TTL to 0 for non-cacheable content)
CloudFront uses AWS's private backbone network to route requests
What happens:
User in London requests their shopping cart (dynamic, personalized content)
Request goes to London Edge Location
Edge Location forwards the request to us-east-1 using AWS's private backbone network (not the public internet)
AWS's backbone network is optimized for low latency and high reliability
Application server in us-east-1 generates the personalized shopping cart
Response travels back through AWS's backbone network to London Edge Location
Edge Location forwards the response to the user
Result: Even though the content isn't cached, latency is reduced by 20-40% because AWS's private network is faster and more reliable than the public internet. Additionally, CloudFront maintains persistent connections to the origin, reducing the overhead of establishing new connections for each request.
Detailed Example 3: S3 Transfer Acceleration
Scenario: A video production company in Australia needs to upload large video files (5-50 GB each) to S3 in us-east-1. Direct uploads to S3 are slow (taking hours) and frequently fail due to network issues.
Solution with S3 Transfer Acceleration:
S3 Transfer Acceleration uses CloudFront Edge Locations to accelerate uploads:
Enable S3 Transfer Acceleration on the S3 bucket
Use the Transfer Acceleration endpoint instead of the standard S3 endpoint
Upload files using the Transfer Acceleration endpoint
What happens:
Video file upload starts from Sydney
File is uploaded to the Sydney Edge Location (close to the user, low latency)
Sydney Edge Location uses AWS's private backbone network to transfer the file to S3 in us-east-1
AWS's backbone network is optimized for high throughput and reliability
File arrives at S3 in us-east-1
Result: Upload speed increases by 50-500% (depending on distance and network conditions). A 10 GB file that previously took 3 hours to upload now takes 30-45 minutes. Upload reliability also improves because the long-distance transfer happens over AWS's reliable backbone network instead of the public internet.
ā ļø Common Mistakes & Misconceptions:
Mistake 1: Thinking Edge Locations are the same as Regions
Why it's wrong: Edge Locations are much smaller and only cache content - you can't deploy EC2 instances or databases in Edge Locations.
Correct understanding: Regions are where you deploy your application infrastructure. Edge Locations are where CloudFront caches content to serve users quickly.
Mistake 2: Assuming all content should be cached at Edge Locations
Why it's wrong: Some content shouldn't be cached (personalized data, real-time data, sensitive data). Caching this content could show users stale or incorrect information.
Correct understanding: Use CloudFront for static content (images, videos, CSS, JavaScript) and public content. For dynamic or personalized content, either don't cache it or use very short TTLs.
Mistake 3: Forgetting to invalidate cached content after updates
Why it's wrong: If you update content at the origin but don't invalidate the CloudFront cache, users will continue seeing old content until the TTL expires.
Correct understanding: When you update content, create a CloudFront invalidation to immediately clear the cached content, or use versioned file names (logo-v2.png instead of logo.png) to force cache misses.
š Connections to Other Topics:
Relates to Performance Optimization (Domain 3) because: CloudFront reduces latency and improves user experience
Builds on S3 (Domain 3) by: Caching S3 content at Edge Locations for faster delivery
Often used with Route 53 (Domain 3) to: Provide DNS routing to the nearest Edge Location
š” Tips for Understanding:
Think of CloudFront as a global caching layer that sits in front of your application
Use CloudFront for any content that's accessed by users in multiple geographic locations
Remember: Edge Locations are read-only for most use cases (except S3 Transfer Acceleration, which allows writes)
šÆ Exam Focus: Questions often test whether you understand when to use CloudFront (global content delivery, reducing latency) versus when to use multi-Region deployments (compliance, disaster recovery). CloudFront is for performance; multi-Region is for availability and compliance.
Section 3: AWS Shared Responsibility Model
Introduction
The problem: When you move to the cloud, security responsibilities are split between you (the customer) and AWS (the cloud provider). If you don't understand who is responsible for what, you might assume AWS is protecting something that you're actually responsible for, leading to security vulnerabilities. Conversely, you might waste time and money protecting things that AWS already handles.
The solution: The AWS Shared Responsibility Model clearly defines which security responsibilities belong to AWS ("Security OF the Cloud") and which belong to you ("Security IN the Cloud"). This model varies depending on the type of service you use (IaaS, PaaS, SaaS).
Why it's tested: The SAA-C03 exam frequently tests your understanding of the Shared Responsibility Model. Questions ask you to identify who is responsible for specific security tasks, or to design solutions that properly address customer responsibilities while leveraging AWS's responsibilities.
Core Concepts
Understanding "Security OF the Cloud" vs "Security IN the Cloud"
What it is: The Shared Responsibility Model divides security and compliance responsibilities between AWS and the customer:
AWS Responsibility: "Security OF the Cloud": AWS is responsible for protecting the infrastructure that runs all AWS services. This includes the physical data centers, hardware, software, networking, and facilities.
Customer Responsibility: "Security IN the Cloud": Customers are responsible for securing their data, applications, operating systems, and configurations within AWS. The extent of customer responsibility varies based on the service used.
Why it exists: In traditional on-premises IT, you're responsible for everything - from physical security of the building to application security. In the cloud, AWS takes over the lower layers (physical security, hardware, infrastructure), allowing you to focus on your applications and data. However, you still need to secure what you put in the cloud. The Shared Responsibility Model clarifies this division to prevent security gaps.
Real-world analogy: Think of AWS like a secure apartment building. The building owner (AWS) is responsible for:
Physical security (locks on the building, security cameras, guards)
Building infrastructure (electricity, plumbing, HVAC)
Structural integrity (foundation, walls, roof)
You (the tenant) are responsible for:
Locking your apartment door
Securing your belongings inside the apartment
Who you give keys to
What you do inside your apartment
The building owner can't enter your apartment to secure your belongings, and you can't modify the building's foundation. Each party has clear responsibilities.
How it works (Detailed step-by-step):
AWS secures the infrastructure: AWS is responsible for:
Physical security: Data centers with 24/7 security guards, biometric access controls, video surveillance, and intrusion detection systems
This diagram illustrates the division of security responsibilities between customers and AWS, organized in three layers: Customer Responsibility (red), Shared Controls (orange), and AWS Responsibility (blue).
Customer Responsibility (Top Layer - Red): At the top, we see customer responsibilities, which represent "Security IN the Cloud." The customer is responsible for everything they put into AWS:
Customer Data: This is the most critical customer responsibility. You must classify your data (public, confidential, restricted), implement appropriate encryption, and control who can access it. AWS provides the tools (KMS, encryption options), but you must use them correctly.
Platform & Application Management: You're responsible for securing your applications, including patching application vulnerabilities, implementing secure coding practices, and managing application configurations.
Operating System, Network & Firewall Configuration: For IaaS services like EC2, you must patch the OS, configure firewalls (security groups), and harden the OS according to security best practices. For managed services like RDS, AWS handles this.
Client-Side Data Encryption & Server-Side Encryption: You decide whether to encrypt data and manage encryption keys. AWS provides encryption services (KMS), but you must enable and configure them.
Network Traffic Protection: You must configure VPCs, subnets, security groups, and NACLs to control network traffic. You also decide whether to use VPNs or Direct Connect for encrypted connections.
IAM & Access Management: You create IAM users, groups, roles, and policies. You implement MFA, rotate credentials, and follow the principle of least privilege. This is entirely your responsibility.
Shared Controls (Middle Layer - Orange): These responsibilities are shared between AWS and customers, but each party handles different aspects:
Patch Management: AWS patches the underlying infrastructure, hypervisor, and managed service software (like RDS database engine). You patch your guest operating systems (EC2) and applications.
Configuration Management: AWS configures the infrastructure and provides secure defaults. You configure your resources (security groups, bucket policies, etc.) according to your security requirements.
Awareness & Training: AWS trains its employees on security best practices and compliance. You must train your employees on how to use AWS securely and follow your organization's security policies.
AWS Responsibility (Bottom Layer - Blue): At the bottom, we see AWS responsibilities, which represent "Security OF the Cloud." AWS is responsible for the entire infrastructure:
Software Layer: AWS manages and secures the software that provides compute (EC2 hypervisor), storage (S3 software), database (RDS engine), and networking services. AWS patches vulnerabilities, monitors for threats, and ensures service availability.
Hardware/AWS Global Infrastructure: AWS maintains all physical hardware - servers, storage devices, networking equipment. AWS replaces failed hardware, upgrades capacity, and ensures hardware security.
Regions, Availability Zones, Edge Locations: AWS designs, builds, and operates the global infrastructure. AWS ensures Regions are isolated, AZs are connected with low-latency networking, and Edge Locations are strategically placed.
Physical Security of Data Centers: AWS implements multiple layers of physical security - perimeter fencing, security guards, biometric access controls, video surveillance, intrusion detection, and environmental controls. Customers never have physical access to AWS data centers.
The key insight from this diagram is that security is a partnership. AWS provides a secure infrastructure, but you must use it securely. AWS can't access your data to encrypt it for you, and you can't access AWS data centers to verify physical security. Each party must fulfill their responsibilities for the overall system to be secure.
Detailed Example 1: EC2 Instance Security (IaaS)
Scenario: You're deploying a web application on EC2 instances. Who is responsible for what?
AWS Responsibilities:
Physical security of the data center where the EC2 instance runs
Security of the hypervisor that creates the virtual machine
Network infrastructure connecting the data center
Hardware maintenance and replacement
Patching the hypervisor and underlying infrastructure
Your Responsibilities:
Choosing a secure AMI (Amazon Machine Image) to launch the instance
Patching the guest operating system (e.g., applying Ubuntu security updates)
Configuring the OS securely (disabling unnecessary services, hardening SSH)
Installing and patching application software (e.g., Apache, Nginx)
Configuring security groups to control inbound/outbound traffic
Managing SSH keys and ensuring they're not compromised
Configuring IAM roles for the EC2 instance to access other AWS services
Monitoring logs and responding to security incidents
What happens if there's a security breach:
If the hypervisor is compromised: AWS is responsible and will fix it
If your OS is compromised due to unpatched vulnerabilities: You are responsible
If your application has a SQL injection vulnerability: You are responsible
If someone gains physical access to the data center: AWS is responsible
Result: For EC2 (IaaS), you have significant security responsibilities because you control the operating system and everything above it. This gives you flexibility but requires security expertise.
Detailed Example 2: RDS Database Security (PaaS)
Scenario: You're using Amazon RDS for your database. Who is responsible for what?
AWS Responsibilities:
Physical security of the data center
Security of the hypervisor and underlying infrastructure
Patching the database operating system
Patching the database engine (MySQL, PostgreSQL, etc.)
Performing automated backups
Implementing Multi-AZ replication for high availability
Monitoring database health and performance
Your Responsibilities:
Configuring database security groups to control network access
Creating database users and managing their permissions
Encrypting data at rest (enabling RDS encryption)
Encrypting data in transit (enforcing SSL/TLS connections)
Configuring automated backups and retention periods
Implementing application-level access controls
Classifying and protecting sensitive data in the database
Monitoring database access logs and responding to suspicious activity
What happens if there's a security breach:
If the database engine has a vulnerability: AWS patches it automatically
If the database OS has a vulnerability: AWS patches it automatically
If database credentials are leaked: You are responsible for rotating them
If unauthorized users access the database: You are responsible (check your security groups and IAM policies)
Result: For RDS (PaaS), AWS handles more security responsibilities than EC2. You don't need to patch the OS or database engine, but you're still responsible for access control, encryption, and data protection.
Detailed Example 3: S3 Bucket Security (SaaS-like)
Scenario: You're storing files in Amazon S3. Who is responsible for what?
AWS Responsibilities:
Physical security of the data centers storing S3 data
Durability of data (S3 automatically replicates data across multiple AZs)
Availability of the S3 service
Patching and maintaining S3 infrastructure
Protecting against infrastructure-level DDoS attacks
Your Responsibilities:
Configuring S3 bucket policies to control access
Enabling S3 bucket versioning to protect against accidental deletion
Enabling S3 encryption (SSE-S3, SSE-KMS, or SSE-C)
Configuring S3 Block Public Access to prevent accidental public exposure
Implementing S3 Object Lock for compliance requirements
Managing IAM policies for users accessing S3
Classifying data and applying appropriate security controls
Monitoring S3 access logs and responding to suspicious activity
Configuring S3 lifecycle policies for data retention
Enabling MFA Delete for critical buckets
What happens if there's a security breach:
If S3 infrastructure is compromised: AWS is responsible
If your bucket is publicly accessible due to misconfigured policies: You are responsible
If someone gains access using stolen IAM credentials: You are responsible for rotating credentials
If data is lost due to S3 infrastructure failure: AWS is responsible (and will restore from replicas)
If data is deleted by an authorized user: You are responsible (use versioning and MFA Delete to prevent this)
Result: For S3, AWS handles almost all infrastructure security, but you're responsible for access control and data protection. Most S3 security breaches are due to misconfigured bucket policies, not AWS infrastructure failures.
ā ļø Common Mistakes & Misconceptions:
Mistake 1: Assuming AWS is responsible for patching your EC2 instances
Why it's wrong: EC2 is IaaS - you have full control over the guest OS, which means you're responsible for patching it.
Correct understanding: AWS patches the hypervisor and infrastructure, but you must patch the OS and applications on your EC2 instances. Use AWS Systems Manager Patch Manager to automate this.
Mistake 2: Thinking AWS can access your data to help with security
Why it's wrong: AWS has a strict policy of not accessing customer data without explicit permission. AWS can't encrypt your data, configure your security groups, or fix your application vulnerabilities.
Correct understanding: You are solely responsible for your data and configurations. AWS provides tools and services, but you must use them correctly.
Mistake 3: Believing that using AWS automatically makes you compliant with regulations
Why it's wrong: AWS provides a compliant infrastructure (AWS is responsible for infrastructure compliance), but you're responsible for how you use that infrastructure. You must configure services correctly to meet your compliance requirements.
Correct understanding: AWS provides compliance certifications for the infrastructure (SOC 2, ISO 27001, PCI DSS, etc.), but you must implement appropriate controls in your applications and configurations to achieve compliance.
Mistake 4: Assuming managed services mean AWS handles all security
Why it's wrong: Even with managed services like RDS, you're still responsible for access control, encryption, and data protection.
Correct understanding: Managed services reduce your operational burden (AWS handles patching, backups, etc.), but you're always responsible for IAM, encryption, and data security.
š Connections to Other Topics:
Relates to IAM (Domain 1) because: You're responsible for all access management
Builds on Encryption (Domain 1) by: Clarifying that you must enable and configure encryption
Often tested with Compliance (Domain 1) to: Verify you understand customer vs. AWS responsibilities for compliance
š” Tips for Understanding:
Remember the simple rule: AWS secures the infrastructure; you secure what you put on the infrastructure
For IaaS (EC2), you have more responsibility; for PaaS (RDS), AWS handles more; for SaaS, AWS handles almost everything
When in doubt, ask: "Can I configure this?" If yes, you're responsible for configuring it securely
šÆ Exam Focus: Exam questions often present a security scenario and ask "Who is responsible for fixing this?" or "What should the customer do to secure this?" Always think about whether the issue is in the infrastructure (AWS) or in the customer's configuration/data (customer).
Section 4: AWS Well-Architected Framework
Introduction
The problem: When designing cloud architectures, there are countless decisions to make: which services to use, how to configure them, how to ensure security, how to optimize costs, and how to maintain reliability. Without a structured framework, architects might make suboptimal decisions, leading to systems that are insecure, unreliable, expensive, or difficult to operate.
The solution: The AWS Well-Architected Framework provides a consistent approach for evaluating architectures and implementing designs that scale over time. It consists of six pillars - Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability - each with design principles and best practices.
Why it's tested: The SAA-C03 exam is fundamentally about designing well-architected solutions. Every question tests your ability to apply Well-Architected principles to real-world scenarios. Understanding this framework is essential for passing the exam and for your career as a solutions architect.
Core Concepts
What is the AWS Well-Architected Framework?
What it is: The AWS Well-Architected Framework is a set of best practices, design principles, and questions that help you evaluate and improve your cloud architectures. It was developed by AWS solutions architects based on years of experience designing systems for thousands of customers. The framework is organized into six pillars, each focusing on a different aspect of architecture.
Why it exists: AWS recognized that customers were repeatedly making the same architectural mistakes and facing similar challenges. By codifying best practices into a framework, AWS helps customers avoid common pitfalls and build better systems from the start. The framework also provides a common language for discussing architecture, making it easier for teams to collaborate and for AWS to provide guidance.
Real-world analogy: Think of the Well-Architected Framework like building codes for construction. When building a house, you follow building codes that specify requirements for structural integrity, electrical safety, plumbing, fire safety, etc. These codes are based on decades of experience and prevent common problems. Similarly, the Well-Architected Framework provides "building codes" for cloud architectures, helping you avoid common problems and build robust systems.
How it works (Detailed step-by-step):
You design an architecture: You're planning to build a new application on AWS or evaluating an existing application.
You review against the six pillars: For each pillar, you ask yourself a series of questions:
Operational Excellence: How do you operate and monitor your system?
Security: How do you protect your data and systems?
Reliability: How do you ensure your system recovers from failures?
Performance Efficiency: How do you use resources efficiently?
Cost Optimization: How do you avoid unnecessary costs?
Sustainability: How do you minimize environmental impact?
You identify gaps: As you answer the questions, you identify areas where your architecture doesn't follow best practices. For example, you might discover that you're not using Multi-AZ deployments (Reliability pillar) or that you're not encrypting data at rest (Security pillar).
You implement improvements: You prioritize the gaps based on business impact and implement improvements. For example, you might enable RDS Multi-AZ for your database or enable S3 encryption for your data.
You iterate continuously: Architecture is not a one-time activity. You regularly review your architecture against the framework as your application evolves, new AWS services become available, and best practices change.
You use AWS tools: AWS provides tools to help you apply the framework:
AWS Well-Architected Tool: A free service that helps you review your workloads against the framework
AWS Trusted Advisor: Provides automated checks for some Well-Architected best practices
AWS Well-Architected Labs: Hands-on labs to learn and implement best practices
ā Must Know: The six pillars of the Well-Architected Framework:
Operational Excellence: Run and monitor systems to deliver business value
Security: Protect information, systems, and assets
Reliability: Recover from failures and meet demand
Performance Efficiency: Use resources efficiently
Cost Optimization: Avoid unnecessary costs
Sustainability: Minimize environmental impact
š Well-Architected Framework Diagram:
graph TB
WAF[AWS Well-Architected Framework]
WAF --> OP[Operational Excellence]
WAF --> SEC[Security]
WAF --> REL[Reliability]
WAF --> PERF[Performance Efficiency]
WAF --> COST[Cost Optimization]
WAF --> SUS[Sustainability]
OP --> OP1[Perform operations as code]
OP --> OP2[Make frequent, small, reversible changes]
OP --> OP3[Refine operations procedures frequently]
OP --> OP4[Anticipate failure]
OP --> OP5[Learn from operational failures]
SEC --> SEC1[Implement strong identity foundation]
SEC --> SEC2[Enable traceability]
SEC --> SEC3[Apply security at all layers]
SEC --> SEC4[Automate security best practices]
SEC --> SEC5[Protect data in transit and at rest]
SEC --> SEC6[Keep people away from data]
SEC --> SEC7[Prepare for security events]
REL --> REL1[Automatically recover from failure]
REL --> REL2[Test recovery procedures]
REL --> REL3[Scale horizontally]
REL --> REL4[Stop guessing capacity]
REL --> REL5[Manage change through automation]
PERF --> PERF1[Democratize advanced technologies]
PERF --> PERF2[Go global in minutes]
PERF --> PERF3[Use serverless architectures]
PERF --> PERF4[Experiment more often]
PERF --> PERF5[Consider mechanical sympathy]
COST --> COST1[Implement cloud financial management]
COST --> COST2[Adopt consumption model]
COST --> COST3[Measure overall efficiency]
COST --> COST4[Stop spending on undifferentiated heavy lifting]
COST --> COST5[Analyze and attribute expenditure]
SUS --> SUS1[Understand your impact]
SUS --> SUS2[Establish sustainability goals]
SUS --> SUS3[Maximize utilization]
SUS --> SUS4[Anticipate and adopt new efficient offerings]
SUS --> SUS5[Use managed services]
SUS --> SUS6[Reduce downstream impact]
style WAF fill:#e1f5fe
style OP fill:#f3e5f5
style SEC fill:#ffebee
style REL fill:#c8e6c9
style PERF fill:#fff3e0
style COST fill:#e8f5e9
style SUS fill:#e0f2f1
This diagram illustrates the AWS Well-Architected Framework's hierarchical structure, with the framework at the center branching into six pillars, each with its own design principles.
The Six Pillars (Color-Coded):
Operational Excellence (Purple): Focuses on running and monitoring systems to deliver business value and continually improving processes. The design principles include:
Perform operations as code: Define your infrastructure and operations as code (Infrastructure as Code) so you can version, test, and automate them
Make frequent, small, reversible changes: Deploy changes incrementally so failures have minimal impact and can be easily rolled back
Refine operations procedures frequently: Continuously improve your operational procedures based on lessons learned
Anticipate failure: Perform "pre-mortem" exercises to identify potential failures before they occur
Learn from operational failures: Share lessons learned across teams and implement improvements
Security (Red): Focuses on protecting information, systems, and assets while delivering business value. The design principles include:
Implement a strong identity foundation: Use IAM with least privilege, eliminate long-term credentials, implement MFA
Enable traceability: Monitor and log all actions and changes (CloudTrail, CloudWatch Logs)
Apply security at all layers: Defense in depth - secure network, compute, storage, data, and application layers
Automate security best practices: Use automation to enforce security controls consistently
Protect data in transit and at rest: Encrypt data using TLS for transit and KMS for data at rest
Keep people away from data: Reduce direct access to data to minimize risk of human error or malicious activity
Prepare for security events: Have incident response plans and practice them regularly
Reliability (Green): Focuses on ensuring a workload performs its intended function correctly and consistently. The design principles include:
Automatically recover from failure: Monitor systems and trigger automated recovery when thresholds are breached
Test recovery procedures: Regularly test your disaster recovery and failover procedures
Scale horizontally: Distribute load across multiple smaller resources instead of one large resource
Stop guessing capacity: Use Auto Scaling to match capacity to demand automatically
Manage change through automation: Use Infrastructure as Code to make changes predictable and reversible
Performance Efficiency (Orange): Focuses on using computing resources efficiently to meet requirements. The design principles include:
Democratize advanced technologies: Use managed services so your team can focus on applications instead of infrastructure
Go global in minutes: Deploy in multiple Regions to reduce latency for global users
Use serverless architectures: Eliminate operational burden of managing servers
Experiment more often: Easy to test different configurations and instance types
Consider mechanical sympathy: Understand how cloud services work and choose the right tool for the job
Cost Optimization (Light Green): Focuses on avoiding unnecessary costs. The design principles include:
Implement cloud financial management: Establish cost awareness and accountability across the organization
Adopt a consumption model: Pay only for what you use; scale down when not needed
Measure overall efficiency: Monitor business metrics and costs to understand ROI
Stop spending money on undifferentiated heavy lifting: Use managed services instead of managing infrastructure
Analyze and attribute expenditure: Use cost allocation tags to understand where money is spent
Sustainability (Teal): Focuses on minimizing environmental impact. The design principles include:
Understand your impact: Measure and monitor your carbon footprint
Establish sustainability goals: Set targets for reducing environmental impact
Maximize utilization: Right-size resources and use Auto Scaling to avoid idle capacity
Anticipate and adopt new, more efficient hardware and software offerings: Use latest instance types and services
Use managed services: Managed services are more efficient due to economies of scale
Reduce the downstream impact of your cloud workloads: Optimize data transfer and storage
The key insight from this diagram is that well-architected systems balance all six pillars. You can't focus only on cost optimization while ignoring security, or prioritize performance while neglecting reliability. The framework helps you make informed trade-offs and ensures you consider all aspects of architecture.
How the Pillars Relate to the SAA-C03 Exam Domains:
Operational Excellence ā Tested across all domains
Sustainability ā Tested across all domains (newer addition to framework)
The exam is essentially testing your ability to apply Well-Architected principles to real-world scenarios. Every question can be mapped back to one or more pillars of the framework.
Pillar Trade-offs and Balancing
Understanding Trade-offs: In real-world architecture, you often need to make trade-offs between pillars. Understanding these trade-offs is crucial for the exam.
Common Trade-offs:
Performance vs. Cost:
Scenario: You can use larger EC2 instances for better performance, but they cost more
Trade-off: Balance performance requirements with budget constraints
Example: Use c5.2xlarge instances (8 vCPUs, $0.34/hour) for compute-intensive workloads instead of c5.24xlarge (96 vCPUs, $4.08/hour) if 8 vCPUs meet your needs
Exam relevance: Questions test whether you can identify the most cost-effective solution that still meets performance requirements
Trade-off: Balance security requirements with operational overhead
Example: Requiring MFA for all users improves security but adds friction to the user experience
Exam relevance: Questions test whether you can implement appropriate security without over-engineering
Reliability vs. Cost:
Scenario: Multi-AZ and multi-Region deployments improve reliability but increase costs
Trade-off: Balance availability requirements with budget
Example: Use Multi-AZ RDS for production databases (2x cost) but single-AZ for development databases
Exam relevance: Questions test whether you can design appropriately resilient architectures without over-provisioning
Performance vs. Sustainability:
Scenario: Over-provisioning resources for peak performance wastes energy during low-utilization periods
Trade-off: Balance performance needs with environmental impact
Example: Use Auto Scaling to match capacity to demand instead of running maximum capacity 24/7
Exam relevance: Questions test whether you can design efficient architectures that scale with demand
š” Tip for the Exam: When questions present multiple valid solutions, the correct answer usually represents the best balance of the pillars. Look for solutions that meet requirements without over-engineering or under-engineering.
Section 5: Essential Networking Concepts
Introduction
The problem: Cloud architectures rely heavily on networking to connect components, control access, and deliver content to users. Without understanding basic networking concepts, you can't design secure, performant, or reliable architectures.
The solution: This section covers the essential networking concepts you need for the SAA-C03 exam: IP addressing, subnets, routing, DNS, and load balancing. These concepts form the foundation for understanding AWS networking services like VPC, Route 53, and Elastic Load Balancing.
Why it's tested: Networking questions appear throughout the exam, especially in Domain 1 (Security) and Domain 3 (Performance). You need to understand how to design VPCs, configure security groups, route traffic, and optimize network performance.
Core Concepts
IP Addresses and CIDR Notation
What it is: An IP address is a unique identifier for a device on a network. IPv4 addresses are 32-bit numbers typically written as four octets (e.g., 192.168.1.10). CIDR (Classless Inter-Domain Routing) notation specifies a range of IP addresses using a prefix (e.g., 10.0.0.0/16).
Why it exists: Networks need a way to identify and route traffic to specific devices. IP addresses provide this identification. CIDR notation allows efficient allocation of IP address ranges without wasting addresses.
Real-world analogy: Think of IP addresses like street addresses. Just as every house has a unique address (123 Main Street), every device on a network has a unique IP address. CIDR notation is like specifying a neighborhood - "all addresses on Main Street" instead of listing each house individually.
How it works:
IPv4 Address Structure: An IPv4 address consists of 32 bits divided into 4 octets:
Example: 192.168.1.10
Binary: 11000000.10101000.00000001.00001010
Each octet ranges from 0 to 255
CIDR Notation: Specifies a network and the number of bits used for the network portion:
Example: 10.0.0.0/16
/16 means the first 16 bits are the network portion
This leaves 32 - 16 = 16 bits for host addresses
Total addresses: 2^16 = 65,536 addresses
Common CIDR Blocks:
/32: Single IP address (1 address)
/24: 256 addresses (common for small subnets)
/16: 65,536 addresses (common for VPCs)
/8: 16,777,216 addresses (very large networks)
ā Must Know for Exam:
/16 provides 65,536 IP addresses (recommended for VPCs)
/24 provides 256 IP addresses (common for subnets)
AWS reserves 5 IP addresses in each subnet (first 4 and last 1)
Private IP ranges: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
Detailed Example: Planning VPC and Subnet IP Ranges
Scenario: You're designing a VPC for a three-tier application (web, app, database) that needs to run in 3 Availability Zones.
Solution:
VPC CIDR: 10.0.0.0/16 (provides 65,536 addresses)
Subnet allocation (9 subnets total):
Public subnets (for web tier):
us-east-1a: 10.0.1.0/24 (256 addresses)
us-east-1b: 10.0.2.0/24 (256 addresses)
us-east-1c: 10.0.3.0/24 (256 addresses)
Private subnets (for app tier):
us-east-1a: 10.0.11.0/24 (256 addresses)
us-east-1b: 10.0.12.0/24 (256 addresses)
us-east-1c: 10.0.13.0/24 (256 addresses)
Database subnets (for database tier):
us-east-1a: 10.0.21.0/24 (256 addresses)
us-east-1b: 10.0.22.0/24 (256 addresses)
us-east-1c: 10.0.23.0/24 (256 addresses)
Result: Each subnet has 256 addresses (minus 5 reserved by AWS = 251 usable), which is sufficient for most applications. The VPC has room for additional subnets if needed (you've used 9 /24 subnets out of 256 possible /24 subnets in a /16 VPC).
Public vs. Private IP Addresses
What it is: Public IP addresses are routable on the internet and can be accessed from anywhere. Private IP addresses are only routable within a private network (like a VPC) and cannot be accessed directly from the internet.
Why it exists: Not all resources should be accessible from the internet. Private IP addresses allow resources to communicate within a network while remaining isolated from the internet, improving security.
How it works:
Public IP: Assigned to resources that need internet access (web servers, NAT gateways)
Private IP: Assigned to all resources in a VPC; used for internal communication
Elastic IP: A static public IP address that you can associate with resources
ā Must Know:
All EC2 instances get a private IP address
Public IP addresses are optional and can be auto-assigned or manually attached (Elastic IP)
Resources in private subnets can access the internet through a NAT Gateway (which has a public IP)
DNS (Domain Name System)
What it is: DNS translates human-readable domain names (www.example.com) into IP addresses (192.0.2.1) that computers use to communicate.
Why it exists: Remembering IP addresses is difficult for humans. DNS allows us to use memorable names instead of numeric addresses.
Regions are isolated: Resources don't automatically replicate across Regions. You must explicitly configure cross-region replication or deploy resources in multiple Regions.
Availability Zones provide high availability: Always deploy production workloads across at least 2 AZs (preferably 3) to protect against data center failures.
Shared Responsibility varies by service: For EC2 (IaaS), you manage the OS and applications. For RDS (PaaS), AWS manages the OS and database software. Always understand who is responsible for what.
Well-Architected Framework guides all decisions: Every architecture decision should consider all six pillars. The exam tests your ability to apply these principles to real-world scenarios.
Security is always a priority: When in doubt, choose the more secure option. The exam heavily emphasizes security best practices.
Self-Assessment Checklist
Test yourself before moving to the next chapter:
I can explain the six advantages of cloud computing and give examples of each
I understand the difference between Regions, Availability Zones, and Edge Locations
I can design a multi-AZ architecture for high availability
I know when to use multi-Region deployments (compliance, disaster recovery, global performance)
I understand the Shared Responsibility Model and can identify customer vs. AWS responsibilities
I can explain all six pillars of the Well-Architected Framework
I understand IP addressing and CIDR notation
I know the difference between public and private IP addresses
I can explain how DNS works and why it's important
Practice Questions
Try these from your practice test bundles:
Fundamentals questions in Domain 1 Bundle 1
Global Infrastructure questions in Domain 2 Bundle 1
Expected score: 80%+ to proceed
If you scored below 80%:
Review Section 2 (AWS Global Infrastructure) for Region/AZ concepts
Review Section 3 (Shared Responsibility Model) for security responsibilities
Review Section 4 (Well-Architected Framework) for design principles
Quick Reference Card
AWS Global Infrastructure:
Region: Geographic area with multiple AZs (e.g., us-east-1)
Availability Zone: One or more data centers within a Region (e.g., us-east-1a)
Edge Location: CDN endpoint for CloudFront (400+ worldwide)
Shared Responsibility:
AWS: Physical security, hardware, infrastructure, managed service software
Customer: Data, applications, OS (for EC2), access management, encryption
Well-Architected Pillars:
Operational Excellence: Run and monitor systems
Security: Protect data and systems
Reliability: Recover from failures
Performance Efficiency: Use resources efficiently
Cost Optimization: Avoid unnecessary costs
Sustainability: Minimize environmental impact
Networking Basics:
/16 CIDR: 65,536 addresses (VPC)
/24 CIDR: 256 addresses (subnet)
Private IP ranges: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
AWS reserves: 5 IP addresses per subnet
Next Steps
You're now ready to dive into the exam domains! The next chapter covers Domain 1: Design Secure Architectures, which accounts for 30% of the exam. You'll learn about:
IAM (users, groups, roles, policies)
VPC security (security groups, NACLs)
Data encryption (KMS, encryption at rest and in transit)
Security services (WAF, Shield, GuardDuty, Macie)
Proceed to: 02_domain1_secure_architectures
Chapter 0 Complete - Total Words: ~11,000 Diagrams Created: 3 Estimated Study Time: 8-10 hours
Chapter Summary
What We Covered
This foundational chapter established the essential knowledge needed for the AWS Certified Solutions Architect - Associate exam. We explored:
ā AWS Global Infrastructure: Regions, Availability Zones, Edge Locations, and how they enable high availability and low latency
ā Well-Architected Framework: The six pillars (Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability) that guide architectural decisions
ā Shared Responsibility Model: Understanding what AWS manages versus what customers manage across different service types
ā Core AWS Services: Introduction to compute (EC2, Lambda), storage (S3, EBS), networking (VPC), and database services
ā Key Terminology: Essential terms like elasticity, scalability, fault tolerance, high availability, and disaster recovery
ā Service Categories: How AWS services are organized and when to use each category
Critical Takeaways
Global Infrastructure Design: AWS has 30+ Regions worldwide, each with multiple isolated Availability Zones. Design for multi-AZ deployments for high availability and multi-Region for disaster recovery.
Well-Architected Framework is Your Guide: Every architectural decision should be evaluated against the six pillars. This framework appears throughout the exam in scenario-based questions.
Shared Responsibility: AWS secures the infrastructure (hardware, facilities, network), while customers secure what they put in the cloud (data, applications, access management). Know the boundaries.
Service Selection Matters: Choose the right service for the job - managed services reduce operational overhead, serverless eliminates infrastructure management, and purpose-built services optimize for specific workloads.
Regions and AZs are Foundational: Understanding how to leverage multiple AZs for fault tolerance and multiple Regions for disaster recovery is critical for 26% of the exam (Domain 2).
Self-Assessment Checklist
Test yourself before moving to Domain 1. You should be able to:
Explain AWS Global Infrastructure: Describe the relationship between Regions, Availability Zones, and Edge Locations
List the Six Pillars: Name all six pillars of the Well-Architected Framework and give an example of each
Draw the Shared Responsibility Model: Sketch what AWS manages vs. what customers manage for IaaS, PaaS, and SaaS
Identify Service Categories: Given a requirement, identify which AWS service category to use (compute, storage, database, networking)
Define Key Terms: Explain the difference between:
High availability vs. fault tolerance
Scalability vs. elasticity
RPO vs. RTO
Vertical scaling vs. horizontal scaling
Choose Deployment Strategies: Explain when to use single-AZ, multi-AZ, and multi-Region deployments
Understand Service Models: Differentiate between IaaS (EC2), PaaS (Elastic Beanstalk), and SaaS (WorkMail)
Networking: VPC, Route 53, CloudFront, Direct Connect, VPN
Design Principles:
Design for failure (assume everything fails)
Decouple components (loose coupling)
Implement elasticity (scale automatically)
Think parallel (horizontal scaling)
Use managed services (reduce operational burden)
Next Steps
You're now ready to dive into Domain 1: Design Secure Architectures (Chapter 2). This domain covers:
IAM and access management (30% of exam weight)
Network security (VPC, security groups, NACLs)
Data protection (encryption, key management)
The fundamentals you learned here will be applied throughout all four domains. Keep this chapter as a reference as you progress through the more advanced topics.
AWS Global Infrastructure: Regions contain multiple isolated Availability Zones for fault tolerance; Edge Locations provide low-latency content delivery
Well-Architected Framework: Five pillars guide architectural decisions - always consider all five when designing solutions
Shared Responsibility: AWS secures the infrastructure; customers secure their data, applications, and access management
Design for Failure: Assume everything fails; use multiple AZs, implement health checks, and automate recovery
Loose Coupling: Decouple components using queues, load balancers, and managed services to improve resilience and scalability
Self-Assessment Checklist
Test yourself before moving on:
I can explain the difference between Regions, Availability Zones, and Edge Locations
I understand all five pillars of the Well-Architected Framework
I can describe the Shared Responsibility Model and give examples of AWS vs customer responsibilities
I know the main AWS service categories (Compute, Storage, Database, Networking)
I understand key design principles: design for failure, loose coupling, elasticity, horizontal scaling
I can explain when to use EC2 vs Lambda vs containers
I understand the difference between S3, EBS, and EFS storage types
Compliance and governance: Organizations, SCPs, Control Tower
Time to complete: 12-15 hours Prerequisites: Chapter 0 (Fundamentals) Exam weight: 30% of scored content
Why this matters: Security is the highest-weighted domain on the SAA-C03 exam. Every architecture you design must be secure by default. This chapter teaches you how to implement defense-in-depth security using AWS services, following the principle of least privilege and the AWS Shared Responsibility Model.
Section 1: IAM (Identity and Access Management) Fundamentals
Introduction
The problem: In any IT system, you need to control who can access what resources and what actions they can perform. Without proper access control, unauthorized users could access sensitive data, malicious actors could compromise systems, and legitimate users might accidentally delete critical resources. Traditional on-premises systems use Active Directory and file permissions, but cloud environments need more flexible, scalable access control.
The solution: AWS Identity and Access Management (IAM) provides centralized control over access to AWS resources. IAM allows you to create users, groups, and roles, and attach policies that define permissions. IAM is free, globally available, and integrates with all AWS services.
Why it's tested: IAM questions appear throughout the SAA-C03 exam, not just in Domain 1. Understanding IAM is fundamental to designing secure architectures. Questions test your ability to implement least privilege, use roles instead of long-term credentials, configure cross-account access, and troubleshoot permission issues.
Core Concepts
What is IAM?
What it is: IAM is a web service that helps you securely control access to AWS resources. You use IAM to control who is authenticated (signed in) and authorized (has permissions) to use resources. IAM is a feature of your AWS account offered at no additional charge.
Why it exists: Before IAM, AWS accounts had only a root user with full access to everything. This was insecure because:
You couldn't give different people different levels of access
You couldn't revoke access without changing the root password
You couldn't audit who did what
You couldn't implement least privilege
IAM solves these problems by allowing you to create multiple identities with specific permissions, audit all actions, and implement security best practices.
Real-world analogy: Think of IAM like a corporate office building's security system. The building owner (root user) has master access to everything. IAM users are like employees with ID badges - each badge grants access to specific floors and rooms based on their job role. IAM groups are like departments (all engineers get access to the engineering floor). IAM roles are like temporary visitor badges that grant specific access for a limited time.
How it works (Detailed step-by-step):
You create an AWS account: When you create an AWS account, you start with a root user that has complete access to all AWS services and resources. This root user is identified by the email address used to create the account.
You create IAM users: Instead of using the root user for daily tasks, you create IAM users for each person who needs access to AWS. Each IAM user has:
A unique name (e.g., "alice", "bob")
Credentials (password for console access, access keys for programmatic access)
Permissions (defined by attached policies)
You organize users into groups: To simplify permission management, you create IAM groups (e.g., "Developers", "Administrators", "Auditors") and add users to groups. Policies attached to a group apply to all users in that group.
You create IAM roles: For applications and services (not people), you create IAM roles. Roles are assumed temporarily and don't have long-term credentials. For example, an EC2 instance assumes a role to access S3.
You attach policies: Policies are JSON documents that define permissions. You attach policies to users, groups, or roles to grant permissions. Policies specify:
Which actions are allowed (e.g., s3:GetObject, ec2:StartInstances)
Which resources the actions apply to (e.g., specific S3 buckets, all EC2 instances)
Conditions (e.g., only allow access from specific IP addresses)
AWS evaluates permissions: When a user or role tries to perform an action, AWS evaluates all applicable policies to determine if the action is allowed. By default, all actions are denied unless explicitly allowed.
ā Must Know:
IAM is global - users, groups, roles, and policies are not Region-specific
Root user has complete access and should be secured with MFA and rarely used
IAM users are for people; IAM roles are for applications and services
Policies define permissions; they can be attached to users, groups, or roles
By default, all actions are denied (implicit deny) unless explicitly allowed
An explicit deny in any policy overrides all allows
This diagram illustrates the complete IAM architecture and how different components interact within an AWS account.
Root User (Red - Top): The root user sits at the top with complete, unrestricted access to all AWS services and resources. The dotted line with "Should not use" emphasizes that the root user should be secured with MFA and used only for tasks that specifically require root access (like changing account settings or closing the account). For day-to-day operations, you should use IAM users or roles instead.
IAM Users (Blue): Three IAM users are shown: Alice (Developer), Bob (Administrator), and Charlie (Auditor). Each user represents a real person who needs access to AWS. Users have long-term credentials (passwords and/or access keys) and are assigned to groups based on their job function. Notice that users don't have direct policy attachments in this diagram - they inherit permissions from their groups, which is a best practice for easier management.
IAM Groups (Purple): Groups are collections of users with similar access needs. The diagram shows three groups:
Developers: Contains Alice and other developers who need access to development resources
Administrators: Contains Bob and other admins who need broad access to manage AWS resources
Auditors: Contains Charlie and other auditors who need read-only access to review configurations and logs
Groups simplify permission management - instead of attaching policies to each user individually, you attach policies to groups. When a user joins or leaves a team, you simply add or remove them from the appropriate group.
IAM Roles (Orange): Roles are shown for non-human entities:
EC2-S3-Access: A role that EC2 instances can assume to access S3 buckets
Lambda-Execution: A role that Lambda functions assume to write logs to CloudWatch
Cross-Account-Access: A role that allows users from another AWS account to access resources in this account
Roles don't have long-term credentials. Instead, they provide temporary security credentials when assumed. This is more secure than embedding access keys in application code.
IAM Policies (Green): Policies are JSON documents that define permissions. The diagram shows three policies:
S3-Read-Only: Allows reading objects from S3 buckets but not writing or deleting
EC2-Full-Access: Allows all EC2 actions (start, stop, terminate instances, etc.)
CloudWatch-Logs: Allows writing logs to CloudWatch Logs
Policies are attached to groups and roles. The Developers group has the S3-Read-Only policy, meaning all developers can read S3 objects. The EC2-S3-Access role has the S3-Read-Only policy, meaning EC2 instances with this role can read S3 objects.
AWS Resources (Bottom): The diagram shows how IAM entities interact with AWS resources:
The EC2 instance has the EC2-S3-Access role attached, allowing it to access S3
The Lambda function has the Lambda-Execution role attached, allowing it to write logs
Bob (Administrator) can manage EC2 instances because his Administrators group has the EC2-Full-Access policy
Alice (Developer) can read from S3 because her Developers group has the S3-Read-Only policy
Key Architectural Principles Shown:
Least Privilege: Each entity has only the permissions it needs. Developers can read S3 but not delete. Auditors can view but not modify.
Separation of Duties: Different groups have different permissions. Developers can't perform administrative tasks.
Roles for Applications: EC2 and Lambda use roles, not embedded credentials, to access other services.
Group-Based Management: Users inherit permissions from groups, making it easy to manage permissions for many users.
Root User Protection: The root user is not used for daily operations, reducing the risk of compromise.
This architecture represents IAM best practices and is the foundation for secure AWS environments. Understanding this structure is critical for the SAA-C03 exam.
IAM Users
What it is: An IAM user is an entity that represents a person or application that interacts with AWS. Each IAM user has a unique name within the AWS account and can have credentials (password for console access, access keys for programmatic access) and permissions.
Why it exists: You need a way to give individuals access to AWS without sharing the root user credentials. IAM users provide individual identities with specific permissions, enabling accountability (you know who did what) and security (you can revoke access for specific users).
Real-world analogy: Think of IAM users like employee accounts in a company's computer system. Each employee has their own username and password, their own email address, and their own set of permissions based on their role. If an employee leaves, you disable their account without affecting others.
How it works (Detailed step-by-step):
Creating an IAM user:
You navigate to the IAM console and click "Add users"
You specify a username (e.g., "alice.smith")
You choose the type of access:
AWS Management Console access: Provides a password for signing into the AWS web console
Programmatic access: Provides access keys (Access Key ID and Secret Access Key) for using the AWS CLI, SDKs, or APIs
You can enable both types of access for a single user
Setting credentials:
Console password: You can auto-generate a password or create a custom password. You can require the user to change their password on first sign-in.
Access keys: AWS generates an Access Key ID (like a username) and Secret Access Key (like a password). The Secret Access Key is shown only once - if you lose it, you must create new access keys.
Assigning permissions:
You can attach policies directly to the user (not recommended for most cases)
You can add the user to one or more groups (recommended - easier to manage)
You can set a permissions boundary (advanced - limits the maximum permissions the user can have)
For programmatic access: User configures the AWS CLI or SDK with their access keys
AWS authenticates and authorizes:
AWS verifies the credentials (authentication)
AWS evaluates all policies attached to the user and their groups to determine what actions are allowed (authorization)
The user can perform only the actions explicitly allowed by their policies
ā Must Know:
IAM users are for long-term credentials (people who need ongoing access)
Each user should represent one person - don't share IAM user credentials
Users can have console access, programmatic access, or both
Access keys should be rotated regularly (every 90 days is a common practice)
Users can have up to 2 active access keys (allows rotation without downtime)
Enable MFA (Multi-Factor Authentication) for all users, especially those with administrative access
Detailed Example 1: Creating a Developer User
Scenario: You're hiring a new developer, Alice, who needs access to AWS to deploy applications. She needs console access to view resources and programmatic access to deploy code.
Step-by-step implementation:
Create the IAM user:
aws iam create-user --user-name alice.smith
Enable console access:
aws iam create-login-profile --user-name alice.smith --password 'TempPassword123!' --password-reset-required
This creates a temporary password that Alice must change on first sign-in.
Result: Alice can now sign into the AWS console with her username and password (plus MFA code), and she can use the AWS CLI with her access keys. Her permissions are determined by the policies attached to the Developers group. If Alice leaves the company, you can delete her IAM user without affecting other developers.
Detailed Example 2: Rotating Access Keys
Scenario: Alice's access keys are 90 days old and need to be rotated for security. You need to rotate them without causing downtime for her applications.
Step-by-step implementation:
Create a second access key (Alice can have up to 2 active keys):
aws configure set aws_access_key_id AKIAI44QH8DHBEXAMPLE
aws configure set aws_secret_access_key je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY
Alice updates any applications or scripts that use the old key
Alice tests that everything works with the new key
Deactivate the old key (don't delete yet - keep it as a backup):
aws iam update-access-key --user-name alice.smith --access-key-id AKIAIOSFODNN7EXAMPLE --status Inactive
Monitor for errors (wait 24-48 hours):
Check CloudTrail logs for any API calls using the old key
If any applications are still using the old key, they'll fail and you can identify them
Update those applications to use the new key
Delete the old key (after confirming nothing is using it):
aws iam delete-access-key --user-name alice.smith --access-key-id AKIAIOSFODNN7EXAMPLE
Result: Alice's access keys have been rotated without downtime. The two-key system allows graceful rotation - you create the new key, update applications, verify everything works, then delete the old key.
Detailed Example 3: Troubleshooting Permission Issues
Scenario: Alice tries to terminate an EC2 instance but gets an "Access Denied" error. You need to troubleshoot why.
Step-by-step troubleshooting:
Check what policies are attached to Alice:
aws iam list-attached-user-policies --user-name alice.smith
aws iam list-groups-for-user --user-name alice.smith
Output shows Alice is in the "Developers" group.
Check what policies are attached to the Developers group:
aws iam list-attached-group-policies --group-name Developers
Output shows the group has the "DevelopersPolicy" attached.
View the policy document:
aws iam get-policy-version --policy-arn arn:aws:iam::123456789012:policy/DevelopersPolicy --version-id v1
This updated policy allows terminating instances, but only if they're tagged with Environment=Development. This prevents developers from accidentally terminating production instances.
Result: You've identified the permission issue, understood why it exists, and implemented a solution that grants the necessary permission while maintaining security (developers can only terminate development instances, not production).
ā ļø Common Mistakes & Misconceptions:
Mistake 1: Sharing IAM user credentials among multiple people
Why it's wrong: You lose accountability - you can't tell who performed which action. If one person leaves, you have to change credentials for everyone.
Correct understanding: Create a separate IAM user for each person. Use IAM groups to manage permissions for multiple users with similar needs.
Mistake 2: Embedding access keys in application code
Why it's wrong: If the code is shared (e.g., pushed to GitHub), the access keys are exposed. Anyone with the keys can access your AWS account.
Correct understanding: Use IAM roles for applications running on AWS (EC2, Lambda, ECS). For applications running outside AWS, use temporary credentials from AWS STS or store credentials in a secrets manager.
Mistake 3: Never rotating access keys
Why it's wrong: If access keys are compromised, attackers have unlimited time to use them. Old keys might be embedded in forgotten scripts or applications.
Correct understanding: Rotate access keys every 90 days. Use AWS IAM Access Analyzer to identify unused access keys and delete them.
Mistake 4: Granting overly broad permissions
Why it's wrong: If an IAM user is compromised, the attacker has access to everything the user can access. This violates the principle of least privilege.
Correct understanding: Grant only the permissions needed for the user's job. Start with minimal permissions and add more as needed, rather than starting with broad permissions and trying to restrict them.
š Connections to Other Topics:
Relates to IAM Roles (covered next) because: Roles are preferred over users for applications
Builds on IAM Policies (covered later) by: Policies define what users can do
Often used with MFA (covered later) to: Add an extra layer of security
š” Tips for Understanding:
Think of IAM users as "people accounts" - each person gets their own user
Remember: Users have long-term credentials; roles have temporary credentials
When troubleshooting permissions, always check: user policies, group policies, and resource policies
šÆ Exam Focus: Questions often test whether you understand when to use IAM users vs. roles, how to implement least privilege, and how to troubleshoot permission issues. Remember: roles are preferred for applications; users are for people.
IAM Groups
What it is: An IAM group is a collection of IAM users. Groups let you specify permissions for multiple users, making it easier to manage permissions. Users in a group automatically inherit the permissions assigned to the group.
Why it exists: Managing permissions for individual users becomes unmanageable as your organization grows. If you have 50 developers and need to change their permissions, you don't want to update 50 individual users. Groups solve this by allowing you to manage permissions once for the entire group.
Real-world analogy: Think of IAM groups like departments in a company. All employees in the Engineering department get access to the engineering tools and resources. When a new engineer joins, you add them to the Engineering department and they automatically get the appropriate access. When they leave, you remove them from the department.
How it works (Detailed step-by-step):
Creating a group:
You create a group with a descriptive name (e.g., "Developers", "DatabaseAdmins", "Auditors")
You attach policies to the group that define what members can do
You add users to the group
Users inherit permissions:
When a user is added to a group, they inherit all policies attached to that group
A user can be in multiple groups (e.g., Alice might be in both "Developers" and "OnCallEngineers")
The user's effective permissions are the union of all policies from all their groups plus any policies attached directly to the user
Managing permissions at scale:
To grant a new permission to all developers, you update the Developers group policy once
All users in the group immediately get the new permission
To revoke access for a user, you remove them from the group
ā Must Know:
Groups are collections of users - they simplify permission management
Users can be in multiple groups (up to 10 groups per user)
Groups cannot be nested (a group cannot contain another group)
Groups cannot be used as principals in resource-based policies (you can't grant S3 bucket access to a group directly)
Best practice: Attach policies to groups, not individual users
Detailed Example 1: Organizing Users by Job Function
Scenario: Your company has developers, database administrators, and auditors. Each group needs different permissions.
Step-by-step implementation:
Create groups for each job function:
aws iam create-group --group-name Developers
aws iam create-group --group-name DatabaseAdmins
aws iam create-group --group-name Auditors
aws iam put-group-policy --group-name Developers --policy-name DevelopersPolicy --policy-document file://developers-policy.json
aws iam put-group-policy --group-name DatabaseAdmins --policy-name DatabaseAdminsPolicy --policy-document file://dbadmins-policy.json
aws iam put-group-policy --group-name Auditors --policy-name AuditorsPolicy --policy-document file://auditors-policy.json
Add users to appropriate groups:
aws iam add-user-to-group --user-name alice.smith --group-name Developers
aws iam add-user-to-group --user-name bob.jones --group-name DatabaseAdmins
aws iam add-user-to-group --user-name charlie.brown --group-name Auditors
Result: You've organized users by job function. When a new developer joins, you simply add them to the Developers group and they automatically get all developer permissions. When you need to grant developers access to a new service, you update the Developers group policy once instead of updating each developer individually.
Detailed Example 2: Multi-Group Membership
Scenario: Alice is a developer who is also on the on-call rotation. During on-call, she needs additional permissions to restart services and view logs.
aws iam put-group-policy --group-name OnCallEngineers --policy-name OnCallPolicy --policy-document file://oncall-policy.json
Add Alice to both groups:
aws iam add-user-to-group --user-name alice.smith --group-name Developers
aws iam add-user-to-group --user-name alice.smith --group-name OnCallEngineers
Alice's effective permissions:
From Developers group: Can start/stop EC2, read/write S3, invoke Lambda (in us-east-1 and us-west-2)
From OnCallEngineers group: Can reboot/terminate EC2, reboot RDS, manage CloudWatch alarms, read logs, publish SNS messages (for Production and Staging resources)
Combined: Alice has all permissions from both groups
When Alice's on-call rotation ends:
aws iam remove-user-from-group --user-name alice.smith --group-name OnCallEngineers
Alice loses the on-call permissions but retains her developer permissions.
Result: Alice has different permissions based on her current responsibilities. During on-call, she has elevated permissions to respond to incidents. When her rotation ends, you simply remove her from the OnCallEngineers group without affecting her developer permissions.
Detailed Example 3: Temporary Project Access
Scenario: Your company is working on a special project that requires access to a specific S3 bucket. Multiple users from different teams need access for 3 months.
Step-by-step implementation:
Create a project-specific group:
aws iam create-group --group-name ProjectPhoenixTeam
aws iam put-group-policy --group-name ProjectPhoenixTeam --policy-name ProjectPhoenixAccess --policy-document file://project-policy.json
Add team members from different departments:
aws iam add-user-to-group --user-name alice.smith --group-name ProjectPhoenixTeam # Developer
aws iam add-user-to-group --user-name bob.jones --group-name ProjectPhoenixTeam # DBA
aws iam add-user-to-group --user-name david.lee --group-name ProjectPhoenixTeam # Data Scientist
After 3 months, when the project ends:
# Remove all users from the group
aws iam remove-user-from-group --user-name alice.smith --group-name ProjectPhoenixTeam
aws iam remove-user-from-group --user-name bob.jones --group-name ProjectPhoenixTeam
aws iam remove-user-from-group --user-name david.lee --group-name ProjectPhoenixTeam
# Delete the group
aws iam delete-group-policy --group-name ProjectPhoenixTeam --policy-name ProjectPhoenixAccess
aws iam delete-group --group-name ProjectPhoenixTeam
Result: You've granted temporary access to multiple users from different teams without modifying their permanent permissions. When the project ends, you clean up by removing users from the group and deleting the group. Each user retains their original permissions from their primary groups (Developers, DatabaseAdmins, etc.).
ā ļø Common Mistakes & Misconceptions:
Mistake 1: Trying to nest groups (putting a group inside another group)
Why it's wrong: IAM doesn't support nested groups. You can't create a "SeniorDevelopers" group that contains the "Developers" group.
Correct understanding: If you need hierarchical permissions, create separate groups with different policies. Users can be in multiple groups to get combined permissions.
Mistake 2: Attaching policies directly to users instead of using groups
Why it's wrong: This becomes unmanageable as your organization grows. If you have 50 developers with individual policies, updating permissions requires 50 changes.
Correct understanding: Always use groups for permission management. Attach policies to groups, then add users to groups. Only attach policies directly to users in exceptional cases.
Mistake 3: Creating too many groups with overlapping permissions
Why it's wrong: This creates confusion and makes it hard to understand what permissions a user has. You might have "Developers", "BackendDevelopers", "FrontendDevelopers", "SeniorDevelopers", etc., with unclear distinctions.
Correct understanding: Create groups based on clear job functions or responsibilities. Use descriptive names. Document what each group is for and what permissions it grants.
Mistake 4: Forgetting that users can be in multiple groups
Why it's wrong: You might create overly broad groups because you think users can only be in one group.
Correct understanding: Users can be in up to 10 groups. Use this to your advantage - create focused groups (Developers, OnCallEngineers, ProjectTeam) and add users to multiple groups as needed.
š Connections to Other Topics:
Relates to IAM Users (covered previously) because: Groups contain users
Builds on IAM Policies (covered later) by: Policies attached to groups apply to all group members
Often used with Least Privilege (covered later) to: Grant minimum necessary permissions to groups
š” Tips for Understanding:
Think of groups as "permission templates" - create a group for each job function
Remember: Groups simplify management but don't provide additional security - they're just a way to organize users
When designing groups, think about how people's roles might change over time
šÆ Exam Focus: Questions often test whether you understand how to use groups effectively, how multi-group membership works, and how to troubleshoot permission issues involving groups. Remember: groups are for management convenience, not security boundaries.
IAM Roles
What it is: An IAM role is an IAM identity with specific permissions, but unlike users, roles are not associated with a specific person. Instead, roles are assumed by entities that need temporary access to AWS resources - such as EC2 instances, Lambda functions, or users from another AWS account. When an entity assumes a role, AWS provides temporary security credentials that expire after a specified time.
Why it exists: Embedding long-term credentials (access keys) in applications is insecure - if the application code is compromised or accidentally shared, the credentials are exposed. Roles solve this by providing temporary credentials that automatically rotate and expire. Roles also enable cross-account access and allow AWS services to access other AWS services on your behalf.
Real-world analogy: Think of IAM roles like temporary security badges at a conference. You don't get a permanent employee badge - instead, you check in at registration, show your ID, and receive a temporary badge that's valid for the day. The badge grants you access to specific areas based on your registration type (speaker, attendee, vendor). At the end of the day, the badge expires automatically. Similarly, when an application assumes a role, it gets temporary credentials that expire automatically.
How it works (Detailed step-by-step):
Creating a role:
You create a role and specify who can assume it (the trust policy)
You attach permissions policies that define what the role can do
You optionally set a maximum session duration (1 hour to 12 hours)
Trust policy (who can assume the role):
The trust policy is a JSON document that specifies which entities can assume the role
For EC2 instances: Trust policy allows the EC2 service to assume the role
For Lambda functions: Trust policy allows the Lambda service to assume the role
For cross-account access: Trust policy allows users from another AWS account to assume the role
Assuming the role:
An entity (EC2 instance, Lambda function, IAM user) requests to assume the role
AWS STS (Security Token Service) validates the request against the trust policy
These credentials are valid for the session duration (default 1 hour, configurable up to 12 hours)
Using temporary credentials:
The entity uses the temporary credentials to make AWS API calls
AWS validates the credentials and checks the role's permissions policies
The entity can perform only the actions allowed by the role's policies
Automatic rotation:
Before the credentials expire, AWS automatically provides new credentials
For EC2 instances and Lambda functions, this happens transparently - you don't need to do anything
The credentials expire automatically after the session duration, limiting the impact if they're compromised
ā Must Know:
Roles provide temporary credentials that automatically rotate and expire
Roles are for applications and services, not for people (though users can assume roles for cross-account access)
Roles have two types of policies: trust policy (who can assume) and permissions policy (what they can do)
EC2 instances and Lambda functions should always use roles, never embedded access keys
Roles can be assumed by: AWS services, IAM users (same or different account), federated users, web identity providers
š IAM Roles Flow Diagram:
sequenceDiagram
participant APP as Application<br/>(EC2 Instance)
participant EC2 as EC2 Service
participant STS as AWS STS<br/>(Security Token Service)
participant S3 as S3 Service
Note over APP,S3: Application needs to access S3
APP->>EC2: Request temporary credentials<br/>for attached IAM role
EC2->>STS: AssumeRole request<br/>for EC2-S3-Access role
STS->>STS: Validate role trust policy<br/>(EC2 is allowed to assume this role)
STS->>EC2: Return temporary credentials<br/>(Access Key, Secret Key, Session Token)<br/>Valid for 1-12 hours
EC2->>APP: Provide temporary credentials
Note over APP: Credentials are automatically<br/>rotated before expiration
APP->>S3: GetObject request<br/>using temporary credentials
S3->>S3: Validate credentials<br/>Check role permissions
S3->>APP: Return object data
Note over APP,S3: No long-term credentials stored!<br/>Credentials expire automatically
See: diagrams/02_domain1_iam_roles_flow.mmd
Diagram Explanation (detailed):
This sequence diagram illustrates how IAM roles work in practice, showing the complete flow from an application requesting access to receiving temporary credentials and using them to access AWS services.
Step 1: Application Needs Access: The application running on an EC2 instance needs to access an S3 bucket. Instead of having access keys embedded in the application code, the EC2 instance has an IAM role attached to it (EC2-S3-Access role).
Step 2: Request Temporary Credentials: The application uses the AWS SDK, which automatically detects that it's running on EC2 and requests temporary credentials from the EC2 metadata service. This happens transparently - the application code doesn't need to explicitly request credentials.
Step 3: AssumeRole Request to STS: The EC2 service forwards the request to AWS Security Token Service (STS), asking to assume the EC2-S3-Access role on behalf of the instance.
Step 4: Validate Trust Policy: STS checks the role's trust policy to verify that the EC2 service is allowed to assume this role. The trust policy for this role looks like:
Session Token: Additional credential that proves these are temporary credentials
Expiration Time: When these credentials will expire (default 1 hour, max 12 hours)
These credentials are returned to the EC2 service, which provides them to the application.
Step 6: Automatic Rotation: The AWS SDK automatically handles credential rotation. Before the credentials expire, the SDK requests new credentials from the metadata service. This happens transparently - the application doesn't need to handle credential rotation.
Step 7: Use Credentials to Access S3: The application makes an API call to S3 (GetObject) using the temporary credentials. The request includes the Access Key ID, Secret Access Key, and Session Token.
Step 8: Validate and Authorize: S3 validates the temporary credentials with STS and checks the role's permissions policy to determine if the GetObject action is allowed. The permissions policy for this role looks like:
This policy allows reading objects from the specific S3 bucket.
Step 9: Return Data: If the action is allowed, S3 returns the requested object data to the application.
Key Security Benefits Shown:
No Long-Term Credentials: The application never has access keys embedded in its code. If the application code is compromised, there are no permanent credentials to steal.
Automatic Expiration: The temporary credentials expire after 1-12 hours. Even if an attacker obtains the credentials, they have limited time to use them.
Automatic Rotation: The SDK automatically requests new credentials before the old ones expire, ensuring continuous operation without manual intervention.
Least Privilege: The role has permissions only to read from a specific S3 bucket, not all S3 buckets or other AWS services. If the credentials are compromised, the damage is limited.
Auditability: All actions performed using the role are logged in CloudTrail with the role name, making it easy to audit what happened and when.
This pattern is the recommended way to grant AWS services access to other AWS services. It's more secure than embedding access keys and requires no credential management by the application developer.
Detailed Example 2: Cross-Account Access with External ID
Imagine you're a SaaS company providing analytics services. Your customer (Company A) wants you to access their S3 bucket to analyze their data, but they want to ensure that only your application can access their data, not other customers' applications that might also use your service.
The Problem: If you just create an IAM role in Company A's account that trusts your AWS account, any application in your account could potentially assume that role. This is called the "confused deputy problem" - Company A's role might be tricked into granting access to the wrong application.
The Solution: Use an External ID, which acts like a secret password that only you and Company A know.
Setup Process:
You Generate a Unique External ID: Your application generates a random, unique identifier for Company A (e.g., "CompanyA-12345-abcde"). This External ID is stored in your database associated with Company A's account.
Company A Creates a Role: Company A creates an IAM role in their account with this trust policy:
The request comes from your AWS account (matches the Principal)
The External ID in the request matches the External ID in the trust policy
Only if both match does STS grant temporary credentials
Why This Works:
Even if another customer (Company B) tries to trick your application into accessing Company A's data, they don't know Company A's External ID
Each customer has a unique External ID, preventing cross-customer access
The External ID acts as a shared secret that proves the request is legitimate
Real-World Scenario: AWS CloudFormation uses this pattern. When you create a stack that needs to access resources in another account, CloudFormation requires an External ID to prevent unauthorized cross-account access.
Detailed Example 3: Service Control Policies (SCPs) in AWS Organizations
Imagine you're managing a large enterprise with 50 AWS accounts organized into different Organizational Units (OUs): Development, Testing, Production, and Security. You need to enforce company-wide security policies that cannot be overridden by individual account administrators.
The Challenge: Even if you create perfect IAM policies in each account, an account administrator could modify or delete those policies. You need a way to enforce policies at a higher level that cannot be bypassed.
The Solution: Service Control Policies (SCPs) in AWS Organizations act as guardrails that define the maximum permissions for all IAM entities in an account, regardless of their IAM policies.
How SCPs Work:
SCPs don't grant permissions - they define boundaries. An IAM entity can only perform actions that are allowed by BOTH:
Their IAM policy (identity-based or resource-based)
The SCPs applied to their account
Think of it like this: IAM policies say "what you can do," while SCPs say "what you're allowed to do." You need both to allow an action.
Example SCP Implementation:
Scenario: You want to prevent anyone in Development accounts from launching expensive EC2 instance types (like p3.16xlarge GPU instances that cost $24/hour), but Production accounts should be able to use them.
Step 2: Attach SCP to Development OU: This SCP is attached to the Development OU, which contains 20 development accounts.
What Happens:
In Development Account:
A developer has full EC2 permissions via their IAM policy
They try to launch a p3.16xlarge instance
AWS evaluates: IAM policy says "Allow", but SCP says "Deny"
Result: Denied - The SCP overrides the IAM policy
Even if the account administrator gives themselves full admin permissions, they still cannot launch these instance types
In Production Account:
Production OU doesn't have this restrictive SCP
A production engineer with EC2 permissions can launch p3.16xlarge instances
AWS evaluates: IAM policy says "Allow", SCP doesn't deny
Result: Allowed
Key SCP Characteristics:
Inheritance: SCPs attached to parent OUs apply to all child OUs and accounts. If you attach an SCP to the root of your organization, it applies to ALL accounts.
Explicit Deny Wins: If any SCP denies an action, that action is denied regardless of IAM policies. This is the most powerful feature - it cannot be overridden.
Default Deny: By default, SCPs use a "FullAWSAccess" policy that allows everything. When you create restrictive SCPs, you're adding denies on top of this.
No Effect on Root User: SCPs do not affect the root user of member accounts. This is why you should always secure root users with MFA and avoid using them.
Common SCP Use Cases:
Use Case 1: Prevent Region Usage: Force all resources to be created in specific regions for data residency compliance:
SCPs define maximum permissions - they don't grant permissions
Explicit deny in SCP cannot be overridden by any IAM policy
SCPs apply to all IAM entities in an account except the root user
SCPs are inherited from parent OUs to child OUs and accounts
You need both IAM policy Allow AND no SCP Deny for an action to succeed
SCPs are evaluated before IAM policies in the authorization flow
š” Tips for Understanding SCPs:
Think of SCPs as "permission boundaries for entire accounts"
Use SCPs for organization-wide security requirements that must not be bypassed
Start with broad SCPs at the root, then add more specific ones at OU level
Test SCPs in a non-production OU first to avoid accidentally blocking critical operations
ā ļø Common Mistakes with SCPs:
Mistake: Thinking SCPs grant permissions
Why it's wrong: SCPs only restrict permissions. You still need IAM policies to grant permissions.
Correct understanding: SCPs set boundaries; IAM policies grant permissions within those boundaries.
Mistake: Forgetting SCPs don't affect root user
Why it's wrong: Root user can still perform actions denied by SCPs
Correct understanding: Always secure root user separately with MFA and avoid using it for daily operations.
Mistake: Creating overly restrictive SCPs that block AWS service operations
Why it's wrong: Some AWS services need to perform actions on your behalf (like CloudFormation creating resources)
Correct understanding: Use condition keys to allow service-to-service calls while restricting user actions.
Section 2: Network Security & VPC Architecture
Introduction
The problem: Applications need to be accessible to users while remaining protected from attacks. Public internet exposure creates security risks, but complete isolation makes applications unusable.
The solution: Amazon Virtual Private Cloud (VPC) provides network isolation with fine-grained control over traffic flow, allowing you to create secure network architectures that balance accessibility with protection.
Why it's tested: Network security is fundamental to the "Design Secure Architectures" domain (30% of exam). Questions test your ability to design VPC architectures with proper segmentation, access controls, and traffic filtering.
Core Concepts
Virtual Private Cloud (VPC) Fundamentals
What it is: A VPC is a logically isolated section of the AWS cloud where you can launch AWS resources in a virtual network that you define. You have complete control over your virtual networking environment, including IP address ranges, subnets, route tables, and network gateways.
Why it exists: When AWS launched, all resources were in a shared network space. Customers needed network isolation for security, compliance, and to replicate their on-premises network architectures in the cloud. VPC provides this isolation while maintaining the flexibility and scalability of cloud computing.
Real-world analogy: Think of a VPC like a private office building within a large business district (AWS Region). The building has its own address range (CIDR block), multiple floors (subnets), security checkpoints (security groups and NACLs), and controlled entry/exit points (internet gateways and NAT gateways). Just as you control who enters your building and which floors they can access, you control network traffic in your VPC.
How it works (Detailed step-by-step):
Create VPC with CIDR Block: You define an IP address range for your VPC using CIDR notation (e.g., 10.0.0.0/16). This gives you 65,536 IP addresses to use within your VPC. AWS reserves 5 IP addresses in each subnet for networking purposes (network address, VPC router, DNS, future use, and broadcast).
Divide into Subnets: You create subnets within your VPC, each in a specific Availability Zone. Each subnet gets a portion of the VPC's IP address range (e.g., 10.0.1.0/24 for public subnet, 10.0.2.0/24 for private subnet). Subnets cannot span multiple Availability Zones.
Configure Route Tables: Each subnet has a route table that determines where network traffic is directed. The route table contains rules (routes) that specify which traffic goes where. For example, a route might say "send traffic destined for 10.0.0.0/16 to local (within VPC)" and "send traffic destined for 0.0.0.0/0 (internet) to the internet gateway."
Attach Internet Gateway (for public access): An internet gateway is a horizontally scaled, redundant, and highly available VPC component that allows communication between your VPC and the internet. You attach one internet gateway per VPC. Resources in subnets with routes to the internet gateway can communicate with the internet if they have public IP addresses.
Configure Security Groups: Security groups act as virtual firewalls for your EC2 instances. They control inbound and outbound traffic at the instance level. Security groups are stateful - if you allow inbound traffic, the response traffic is automatically allowed outbound.
Configure Network ACLs: Network Access Control Lists (NACLs) provide an additional layer of security at the subnet level. They control traffic entering and leaving subnets. NACLs are stateless - you must explicitly allow both inbound and outbound traffic.
Launch Resources: You launch EC2 instances, RDS databases, and other resources into your subnets. Each resource gets a private IP address from the subnet's CIDR range. Resources in public subnets can optionally receive public IP addresses or Elastic IPs for internet communication.
Traffic Flow: When an instance sends traffic, AWS evaluates security groups, NACLs, and route tables to determine if the traffic is allowed and where it should go. This evaluation happens at wire speed without impacting performance.
š VPC Architecture Diagram:
graph TB
subgraph "AWS Cloud"
subgraph "VPC 10.0.0.0/16"
IGW[Internet Gateway]
subgraph "Availability Zone A"
subgraph "Public Subnet 10.0.1.0/24"
WEB1[Web Server<br/>Public IP: 54.x.x.x<br/>Private IP: 10.0.1.10]
NAT1[NAT Gateway<br/>Elastic IP: 52.x.x.x]
end
subgraph "Private Subnet 10.0.2.0/24"
APP1[App Server<br/>Private IP: 10.0.2.10]
DB1[RDS Primary<br/>Private IP: 10.0.2.20]
end
end
subgraph "Availability Zone B"
subgraph "Public Subnet 10.0.3.0/24"
WEB2[Web Server<br/>Public IP: 54.x.x.y<br/>Private IP: 10.0.3.10]
NAT2[NAT Gateway<br/>Elastic IP: 52.x.x.y]
end
subgraph "Private Subnet 10.0.4.0/24"
APP2[App Server<br/>Private IP: 10.0.4.10]
DB2[RDS Standby<br/>Private IP: 10.0.4.20]
end
end
end
end
INTERNET[Internet Users]
INTERNET -->|HTTPS 443| IGW
IGW --> WEB1
IGW --> WEB2
See: diagrams/02_domain1_vpc_architecture.mmd
Diagram Explanation (Comprehensive):
This diagram shows a production-ready, highly available VPC architecture spanning two Availability Zones (AZ-A and AZ-B) within a single AWS Region. Let me explain each component and how they work together:
VPC Foundation (10.0.0.0/16): The entire VPC uses the 10.0.0.0/16 CIDR block, providing 65,536 IP addresses. This is a private IP range (RFC 1918) that won't conflict with public internet addresses. The /16 subnet mask means the first 16 bits are fixed (10.0), and the remaining 16 bits can vary, giving us flexibility to create many subnets.
Internet Gateway (IGW): The Internet Gateway is the entry and exit point for internet traffic. It's a highly available, horizontally scaled AWS-managed component attached to the VPC. The IGW performs Network Address Translation (NAT) for instances with public IP addresses, translating between private IPs (10.0.x.x) and public IPs (54.x.x.x). It's the only way for resources in public subnets to communicate directly with the internet.
Public Subnets (10.0.1.0/24 and 10.0.3.0/24): These subnets are "public" because their route tables have a route sending internet-bound traffic (0.0.0.0/0) to the Internet Gateway. Each public subnet provides 256 IP addresses (actually 251 usable, as AWS reserves 5). Resources in public subnets can have public IP addresses and communicate directly with the internet. In this architecture, we place web servers and NAT Gateways in public subnets because they need to accept connections from or initiate connections to the internet.
Web Servers (WEB1 and WEB2): Each web server has two IP addresses: a private IP from the subnet range (10.0.1.10 and 10.0.3.10) and a public IP (54.x.x.x and 54.x.x.y) for internet communication. When internet users send HTTPS requests to the public IP, the Internet Gateway translates it to the private IP and forwards it to the web server. The web server processes the request and sends the response back through the IGW. Having web servers in both AZs provides high availability - if AZ-A fails, WEB2 in AZ-B continues serving traffic.
NAT Gateways (NAT1 and NAT2): NAT Gateways enable instances in private subnets to initiate outbound connections to the internet (for software updates, API calls, etc.) while preventing inbound connections from the internet. Each NAT Gateway has an Elastic IP address (a static public IP) and is placed in a public subnet. When an app server in a private subnet sends traffic to the internet, the traffic is routed to the NAT Gateway, which translates the private IP to its Elastic IP, sends the traffic to the internet, receives the response, and forwards it back to the app server. Having separate NAT Gateways in each AZ provides high availability and reduces cross-AZ data transfer costs.
Private Subnets (10.0.2.0/24 and 10.0.4.0/24): These subnets are "private" because their route tables send internet-bound traffic to a NAT Gateway instead of directly to the Internet Gateway. Resources in private subnets only have private IP addresses and cannot be directly accessed from the internet. This provides an additional security layer - even if an attacker compromises the web server, they cannot directly access the app servers or databases. The private subnets can still initiate outbound connections through the NAT Gateway for updates and external API calls.
Application Servers (APP1 and APP2): These servers run the business logic and are placed in private subnets for security. They only have private IPs (10.0.2.10 and 10.0.4.10) and cannot be accessed directly from the internet. Web servers communicate with app servers using private IPs within the VPC. The app servers can make outbound internet connections through their respective NAT Gateways for tasks like calling external APIs or downloading updates.
RDS Database Instances (DB1 and DB2): The database instances are also in private subnets with only private IPs (10.0.2.20 and 10.0.4.20). DB1 is the primary instance handling all read and write operations, while DB2 is a standby replica in a different AZ for high availability. RDS automatically performs synchronous replication from DB1 to DB2, ensuring zero data loss. If DB1 fails, RDS automatically promotes DB2 to primary within 1-2 minutes. The databases are the most critical and sensitive components, so they're placed in the most protected layer with no internet access.
Route Tables:
Public Route Table: Contains two routes: (1) 10.0.0.0/16 ā local (traffic within VPC stays in VPC), and (2) 0.0.0.0/0 ā IGW (all other traffic goes to internet). This table is associated with both public subnets.
Private Route Table AZ-A: Contains (1) 10.0.0.0/16 ā local, and (2) 0.0.0.0/0 ā NAT1 (internet traffic goes through NAT Gateway in AZ-A). Associated with private subnets in AZ-A.
Private Route Table AZ-B: Same as AZ-A but routes to NAT2. Associated with private subnets in AZ-B.
Traffic Flow Examples:
User Request Flow: Internet user ā IGW ā WEB1 (public subnet) ā APP1 (private subnet) ā DB1 (private subnet) ā response back through same path.
Outbound Update Flow: APP1 needs to download updates ā traffic routed to NAT1 (via route table) ā NAT1 translates private IP to Elastic IP ā IGW ā Internet ā response back through same path.
Cross-AZ Communication: WEB1 (AZ-A) can communicate with APP2 (AZ-B) using private IPs because both are in the same VPC (10.0.0.0/16 ā local route).
Database Replication: DB1 ā DB2 synchronous replication happens over private IPs within the VPC, never leaving AWS's network.
Security Layers: This architecture implements defense in depth with multiple security layers:
Network Segmentation: Public and private subnets separate internet-facing and internal resources
No Direct Internet Access: App servers and databases cannot be accessed from internet
Controlled Outbound Access: Private resources can only reach internet through NAT Gateways
High Availability: Resources in multiple AZs ensure service continuity during failures
Least Privilege: Each tier only has the network access it needs
This is the recommended architecture pattern for production workloads on AWS, balancing security, availability, and operational requirements.
Detailed Example 1: Three-Tier Web Application VPC Design
Let's design a VPC for an e-commerce application with web servers, application servers, and databases. The application needs to be highly available, secure, and scalable.
Requirements:
Support 100 web servers, 200 application servers, 10 database instances
High availability across 2 Availability Zones
Web servers accessible from internet
Application servers and databases not directly accessible from internet
Application servers need to call external payment APIs
Comply with PCI-DSS requirements for payment processing
Design Solution:
Step 1: Choose VPC CIDR Block We'll use 10.0.0.0/16 (65,536 IPs) to ensure we have enough addresses for growth.
Step 2: Plan Subnet Structure We need 6 subnets (3 tiers Ć 2 AZs):
Public Subnet AZ-A: 10.0.1.0/24 (256 IPs) - Web servers
Public Subnet AZ-B: 10.0.2.0/24 (256 IPs) - Web servers
Data Transfer: Cross-AZ traffic costs $0.01/GB (minimize by using same-AZ NAT)
Elastic IPs: Free when attached to running NAT Gateways
Detailed Example 2: Security Group vs NACL - When to Use Each
Understanding the difference between Security Groups and Network ACLs is critical for the exam. Let's explore a scenario that demonstrates when to use each.
Scenario: You're securing a web application where you've noticed suspicious traffic patterns. Some IP addresses are making thousands of requests per second (potential DDoS), and you need to block them. You also need to ensure that only your application servers can access your database.
Security Groups Approach:
Security Groups are stateful, instance-level firewalls. When you allow inbound traffic, the response is automatically allowed outbound.
Problem with Security Groups for DDoS: Security Groups cannot block specific IP addresses. They can only allow traffic from specific sources. To block the malicious IPs, you would need to:
Remove the rule allowing 0.0.0.0/0
Add rules allowing only legitimate IP ranges
This is impractical when you need to allow all internet users except specific attackers
Network ACL Approach:
Network ACLs are stateless, subnet-level firewalls. You must explicitly allow both inbound and outbound traffic. NACLs support both ALLOW and DENY rules, and rules are evaluated in order by rule number.
Example NACL for Public Subnet:
Inbound Rules:
Rule # Type Protocol Port Source Allow/Deny
10 HTTP TCP 80 0.0.0.0/0 ALLOW
20 HTTPS TCP 443 0.0.0.0/0 ALLOW
30 Custom TCP 1024- 0.0.0.0/0 ALLOW (ephemeral ports)
65535
50 All traffic All All 198.51.100.5/32 DENY (malicious IP)
60 All traffic All All 198.51.100.6/32 DENY (malicious IP)
100 All traffic All All 0.0.0.0/0 DENY (default deny)
Outbound Rules:
Rule # Type Protocol Port Destination Allow/Deny
10 HTTP TCP 80 0.0.0.0/0 ALLOW
20 HTTPS TCP 443 0.0.0.0/0 ALLOW
30 Custom TCP 1024- 0.0.0.0/0 ALLOW (ephemeral ports)
65535
100 All traffic All All 0.0.0.0/0 DENY (default deny)
How NACL Blocks Malicious IPs:
Traffic from 198.51.100.5 arrives at the subnet
NACL evaluates rules in order (10, 20, 30, 50...)
Rule 50 matches (source IP 198.51.100.5) and denies the traffic
Traffic is blocked before reaching any instance in the subnet
This protects all instances in the subnet simultaneously
Why NACLs Are Better for IP Blocking:
Can explicitly DENY specific IPs or ranges
Evaluated before traffic reaches instances (reduces load)
Protects entire subnet, not just individual instances
Rules evaluated in order, allowing fine-grained control
Database Security Group Example:
For the database tier, Security Groups are ideal because you want to allow access only from specific sources (application servers), not block specific sources.
ā Allowing traffic from specific sources (other security groups, IP ranges)
ā You want stateful firewall behavior (automatic response traffic)
ā You need instance-level granularity
ā You want to reference other security groups dynamically
Use Network ACLs when:
ā Blocking specific IP addresses or ranges (DDoS mitigation)
ā Adding an additional layer of defense (defense in depth)
ā Enforcing subnet-level policies that apply to all resources
ā You need explicit control over both inbound and outbound traffic
ā Compliance requires stateless firewall rules
Use Both (Defense in Depth):
ā NACL blocks known malicious IPs at subnet boundary
ā Security Group allows only legitimate application traffic at instance level
ā Provides multiple layers of protection
Common Exam Scenario: "A web application is experiencing a DDoS attack from specific IP addresses. How can you quickly block these IPs?"
Answer: Use Network ACL DENY rules. Security Groups cannot deny traffic, only allow it. NACLs can explicitly deny specific IPs and are evaluated before traffic reaches instances.
VPN and Direct Connect for Hybrid Connectivity
What they are: AWS Site-to-Site VPN and AWS Direct Connect are services that securely connect your on-premises data center or office network to your AWS VPC, enabling hybrid cloud architectures.
Why they exist: Many organizations cannot move all their infrastructure to the cloud immediately. They need secure, reliable connections between on-premises systems and AWS resources. Public internet connections are insecure and unreliable for production workloads. VPN and Direct Connect provide secure, private connectivity options.
Real-world analogy: Think of your on-premises network and AWS VPC as two office buildings in different cities. VPN is like making a secure phone call over the public phone network - it's encrypted and private, but uses public infrastructure. Direct Connect is like having a dedicated private fiber optic cable between the buildings - it's more expensive but provides better performance, reliability, and security.
AWS Site-to-Site VPN:
A VPN connection creates an encrypted tunnel over the public internet between your on-premises network and your VPC. It uses IPsec (Internet Protocol Security) to encrypt all traffic.
How VPN Works (Step-by-step):
Create Virtual Private Gateway (VGW): Attach a VGW to your VPC. This is the VPN endpoint on the AWS side. The VGW is highly available across multiple AZs automatically.
Create Customer Gateway: Define your on-premises VPN device's public IP address in AWS. This tells AWS where to establish the VPN tunnel.
Create VPN Connection: AWS generates VPN configuration including pre-shared keys, tunnel IP addresses, and routing information. You download this configuration.
Configure On-Premises Device: Apply the AWS-provided configuration to your on-premises VPN device (firewall, router, or VPN appliance).
Establish Tunnels: AWS creates two VPN tunnels (for redundancy) to different AWS endpoints. Your device establishes IPsec tunnels to both endpoints.
Configure Routing: Update your VPC route tables to send traffic destined for your on-premises network (e.g., 192.168.0.0/16) to the VGW. Update your on-premises routing to send AWS-bound traffic through the VPN tunnels.
Traffic Flow: When an EC2 instance sends traffic to an on-premises IP, the VPC route table directs it to the VGW, which encrypts it and sends it through the VPN tunnel. Your on-premises device decrypts it and forwards it to the destination.
VPN Characteristics:
Bandwidth: Up to 1.25 Gbps per tunnel (2.5 Gbps total with both tunnels)
Latency: Variable, depends on internet path (typically 50-200ms)
Cost: $0.05/hour per VPN connection + data transfer charges
Setup Time: Minutes to hours
Encryption: IPsec encryption (AES-256)
Availability: Two tunnels for redundancy
When to Use VPN:
ā Quick setup needed (hours, not weeks)
ā Budget-conscious (low monthly cost)
ā Bandwidth requirements under 1 Gbps
ā Temporary or backup connectivity
ā Multiple remote offices need AWS access
ā Encryption required by compliance
AWS Direct Connect:
Direct Connect provides a dedicated network connection from your on-premises data center to AWS through a Direct Connect location (AWS partner facility). Traffic never traverses the public internet.
How Direct Connect Works (Step-by-step):
Choose Direct Connect Location: Select an AWS Direct Connect location near your data center. These are facilities operated by AWS partners (like Equinix, CoreSite).
Order Cross-Connect: Work with the facility provider to establish a physical fiber connection from your equipment to AWS's equipment in the same facility. This is called a "cross-connect."
Create Direct Connect Connection: In AWS console, create a Direct Connect connection specifying the location and bandwidth (1 Gbps or 10 Gbps).
Create Virtual Interface (VIF): Create a private VIF to access your VPC, or a public VIF to access AWS public services (S3, DynamoDB) without going through the internet.
Configure BGP: Direct Connect uses Border Gateway Protocol (BGP) for dynamic routing. You configure BGP on your router to exchange routes with AWS.
Attach to Virtual Private Gateway or Direct Connect Gateway: Connect your VIF to a VGW (for single VPC) or Direct Connect Gateway (for multiple VPCs/regions).
Update Route Tables: Configure VPC route tables to send on-premises traffic to the VGW. BGP automatically advertises your VPC routes to your on-premises network.
Traffic Flow: Traffic flows over the dedicated fiber connection, never touching the public internet. AWS routes it directly to your VPC.
Direct Connect Characteristics:
Bandwidth: 1 Gbps, 10 Gbps, or 100 Gbps dedicated connections
Hybrid Architecture Pattern: VPN + Direct Connect:
For maximum reliability, many organizations use both:
Primary: Direct Connect for production traffic (high bandwidth, low latency)
Backup: VPN for failover if Direct Connect fails
Configuration: Use BGP to prefer Direct Connect (lower BGP metric), automatically failover to VPN if Direct Connect is unavailable
Detailed Example: Hybrid Cloud Architecture with Direct Connect
Scenario: A financial services company has a data center in New York with 500 TB of customer data. They're migrating applications to AWS us-east-1 region but must keep the database on-premises for compliance. Applications in AWS need low-latency access to the on-premises database.
Requirements:
Consistent latency under 20ms for database queries
Bandwidth for 10 Gbps peak traffic
Highly available (99.99% uptime)
Secure connection (encrypted)
Access to multiple VPCs in us-east-1
Solution Design:
Step 1: Order Two Direct Connect Connections
Order two 10 Gbps Direct Connect connections at different Direct Connect locations (e.g., Equinix NY5 and CoreSite NY1) for redundancy
Each connection costs $2.25/hour = $1,620/month
Step 2: Create Direct Connect Gateway
Create a Direct Connect Gateway to connect multiple VPCs to the Direct Connect connections
This allows all VPCs to share the same Direct Connect connections
Step 3: Create Private Virtual Interfaces
Create two private VIFs, one on each Direct Connect connection
Associate both VIFs with the Direct Connect Gateway
Configure BGP with AS numbers and BGP keys
Step 4: Attach VPCs to Direct Connect Gateway
Attach Virtual Private Gateways from Production VPC, Development VPC, and Testing VPC to the Direct Connect Gateway
All three VPCs can now communicate with on-premises over Direct Connect
Step 5: Configure VPN for Encryption
Create Site-to-Site VPN connections over each Direct Connect connection
This provides IPsec encryption for data in transit (compliance requirement)
VPN over Direct Connect combines Direct Connect's performance with VPN's encryption
Step 6: Configure BGP Routing
On-premises router advertises 192.168.0.0/16 (on-premises network) to AWS via BGP
AWS advertises VPC CIDR blocks (10.0.0.0/16, 10.1.0.0/16, 10.2.0.0/16) to on-premises
Configure BGP weights to prefer primary Direct Connect connection, failover to secondary if primary fails
Direct Connect saves money at high data volumes (>40 TB/month)
Plus benefits of consistent performance and lower latency
AWS Security Services
AWS WAF (Web Application Firewall):
What it is: AWS WAF is a web application firewall that protects your web applications from common web exploits and bots that could affect availability, compromise security, or consume excessive resources.
Why it exists: Traditional network firewalls (security groups, NACLs) operate at the network layer (Layer 3/4) and cannot inspect HTTP/HTTPS request content. Web applications face application-layer attacks (Layer 7) like SQL injection, cross-site scripting (XSS), and bot attacks that require deep packet inspection. WAF provides this application-layer protection.
Real-world analogy: Think of WAF like a security guard at a nightclub entrance who checks IDs and searches bags. Network firewalls are like the fence around the building - they control who can approach, but WAF inspects what people are carrying and what they're trying to do once they're at the door.
How WAF Works:
Deploy WAF: Attach WAF to CloudFront distribution, Application Load Balancer, API Gateway, or AppSync GraphQL API.
Create Web ACL: A Web Access Control List (Web ACL) contains rules that define what traffic to allow, block, or count.
Add Rules: Rules inspect HTTP/HTTPS requests for patterns like:
SQL injection attempts (e.g., ' OR 1=1-- in query parameters)
Cross-site scripting (e.g., <script> tags in input fields)
Requests from specific countries (geo-blocking)
Requests from known malicious IP addresses
Rate limiting (e.g., max 2000 requests per 5 minutes from single IP)
Rule Evaluation: When a request arrives, WAF evaluates rules in priority order. First matching rule determines the action (allow, block, count).
Action:
Allow: Request passes through to your application
Block: WAF returns 403 Forbidden to the client
Count: WAF logs the match but allows the request (for testing rules)
Logging: WAF logs all requests to CloudWatch Logs, S3, or Kinesis Data Firehose for analysis.
PHP/WordPress: Protects PHP and WordPress applications
When to Use WAF:
ā Protecting web applications from OWASP Top 10 attacks
ā Blocking bot traffic and scrapers
ā Rate limiting to prevent DDoS
ā Geo-blocking for compliance or business reasons
ā Custom rules for application-specific threats
ā Protecting APIs from abuse
AWS Shield:
What it is: AWS Shield is a managed DDoS (Distributed Denial of Service) protection service that safeguards applications running on AWS.
Why it exists: DDoS attacks attempt to make applications unavailable by overwhelming them with traffic. These attacks can cost thousands of dollars per hour in bandwidth charges and lost revenue. Shield provides automatic protection against common DDoS attacks.
Two Tiers:
Shield Standard (Free, automatic):
Protects against most common Layer 3/4 DDoS attacks (SYN floods, UDP floods, reflection attacks)
HTTP Flood: Attacker sends legitimate-looking HTTP requests at high volume to exhaust application resources.
Shield Mitigation: Works with WAF to rate limit and filter malicious requests
When to Use Shield Advanced:
ā Business-critical applications that cannot tolerate downtime
ā Applications that have been targeted by DDoS attacks before
ā Need for 24/7 expert support during attacks
ā Concern about DDoS-related AWS charges
ā Compliance requirements for DDoS protection
AWS GuardDuty:
What it is: Amazon GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts, workloads, and data.
Why it exists: Traditional security tools require manual log analysis and correlation across multiple sources. GuardDuty uses machine learning to automatically analyze billions of events across AWS CloudTrail, VPC Flow Logs, and DNS logs to identify threats without requiring you to deploy or manage any infrastructure.
Real-world analogy: GuardDuty is like a security operations center (SOC) analyst who monitors security cameras, access logs, and network traffic 24/7, looking for suspicious patterns. Instead of you having to watch all the logs, GuardDuty does it automatically and alerts you only when it finds something suspicious.
How GuardDuty Works:
Enable GuardDuty: One-click enable in AWS console. No agents or sensors to deploy.
Data Sources: GuardDuty automatically analyzes:
CloudTrail Events: API calls and management events (who did what, when)
VPC Flow Logs: Network traffic patterns (who talked to whom)
DNS Logs: DNS queries (what domains were resolved)
What it detected: IAM credentials from an EC2 instance are being used from an external IP
Why it matters: Instance credentials were stolen and are being used outside AWS
Remediation: Revoke the credentials, investigate how they were stolen, rotate all credentials
Recon:IAMUser/MaliciousIPCaller:
What it detected: API calls are being made from a known malicious IP address
Why it matters: Attacker may have compromised IAM credentials and is reconnaissance
Remediation: Review CloudTrail for unauthorized actions, rotate credentials, enable MFA
When to Use GuardDuty:
ā Continuous threat detection without managing infrastructure
ā Detecting compromised instances and credentials
ā Identifying reconnaissance and data exfiltration
ā Compliance requirements for threat monitoring
ā Automated security monitoring across multiple accounts
Cost: $4.50 per million CloudTrail events analyzed + $1.00 per GB of VPC Flow Logs + $0.50 per million DNS queries. Typical cost: $50-200/month per account.
Section 3: Data Security & Encryption
Introduction
The problem: Data is the most valuable asset for most organizations. Data breaches can result in millions of dollars in losses, regulatory fines, and reputational damage. Data must be protected both when stored (at rest) and when transmitted (in transit).
The solution: AWS provides comprehensive encryption services and key management tools to protect data throughout its lifecycle. Encryption transforms readable data into unreadable ciphertext that can only be decrypted with the correct key.
Why it's tested: Data protection is a core component of the "Design Secure Architectures" domain. The exam tests your understanding of when and how to use encryption, key management best practices, and compliance requirements.
Core Concepts
AWS Key Management Service (KMS)
What it is: AWS KMS is a managed service that makes it easy to create and control the cryptographic keys used to encrypt your data. KMS uses Hardware Security Modules (HSMs) to protect the security of your keys.
Why it exists: Managing encryption keys is complex and risky. If you lose keys, you lose access to your data. If keys are compromised, your data is exposed. KMS provides secure, auditable key management without requiring you to operate your own HSM infrastructure.
Real-world analogy: Think of KMS like a bank's safe deposit box system. The bank (AWS) provides the secure vault (HSM) and manages access controls, but only you have the key to your specific box. You can authorize others to access your box, and the bank keeps detailed records of every access.
How KMS Works (Detailed step-by-step):
Create Customer Master Key (CMK): You create a CMK in KMS, which is a logical representation of a master key. The actual key material never leaves the HSM. You can choose:
AWS-managed CMK: AWS creates and manages the key (free, automatic rotation)
Customer-managed CMK: You create and manage the key ($1/month, optional rotation)
Custom key store: Keys stored in CloudHSM cluster you control (advanced use case)
Define Key Policy: The key policy is a resource-based policy that controls who can use and manage the key. It's similar to an IAM policy but attached to the key itself. Example policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Enable IAM User Permissions",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:root"
},
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "Allow use of the key for encryption",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/EC2-S3-Access"
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:GenerateDataKey"
],
"Resource": "*"
}
]
}
Encrypt Data: When you need to encrypt data, you call KMS Encrypt API with your data and the CMK ID. KMS uses the CMK to encrypt your data and returns the ciphertext. The CMK never leaves KMS.
Store Ciphertext: You store the encrypted data (ciphertext) in your storage service (S3, EBS, RDS, etc.). The ciphertext is useless without the CMK to decrypt it.
Decrypt Data: When you need to access the data, you call KMS Decrypt API with the ciphertext. KMS verifies you have permission to use the CMK, decrypts the data, and returns the plaintext.
Audit: Every KMS API call is logged in CloudTrail, providing a complete audit trail of who used which keys, when, and for what purpose.
Envelope Encryption:
For large data (>4 KB), KMS uses envelope encryption to improve performance:
Generate Data Key: Call KMS GenerateDataKey API. KMS generates a data encryption key (DEK), encrypts it with your CMK, and returns both the plaintext DEK and encrypted DEK.
Encrypt Data Locally: Use the plaintext DEK to encrypt your data locally (in your application or AWS service). This is fast because it doesn't require network calls to KMS.
Store Encrypted Data + Encrypted DEK: Store both the encrypted data and the encrypted DEK together. Delete the plaintext DEK from memory.
Decrypt Data: To decrypt, send the encrypted DEK to KMS. KMS decrypts it with your CMK and returns the plaintext DEK. Use the plaintext DEK to decrypt your data locally.
Why Envelope Encryption:
KMS can only encrypt/decrypt up to 4 KB directly
Encrypting large data locally is faster than sending it to KMS
You only need to call KMS once per data key, not once per data block
Most AWS services (S3, EBS, RDS) use envelope encryption automatically
Detailed Example 1: S3 Bucket Encryption with KMS
Scenario: You're storing customer financial records in S3. Compliance requires that all data be encrypted at rest with keys you control, and you must be able to audit all access to the encryption keys.
Solution: Use S3 with SSE-KMS (Server-Side Encryption with KMS).
Volume Creation: When you create an encrypted EBS volume, AWS generates a unique data key for that volume using your CMK.
Data Encryption: All data written to the volume is encrypted using AES-256 with the data key. This happens in the EC2 hypervisor, transparent to your instance.
Data Key Storage: The encrypted data key is stored with the volume metadata. The plaintext data key is stored in memory on the EC2 host (never on disk).
Snapshots: When you create a snapshot of an encrypted volume, the snapshot is automatically encrypted with the same data key. You can copy the snapshot to another region and re-encrypt with a different CMK.
Volume Attachment: When you attach an encrypted volume to an instance, the EC2 service calls KMS to decrypt the data key. The plaintext data key is loaded into the EC2 host's memory.
Performance: Encryption/decryption happens in hardware on the EC2 host, with no performance impact compared to unencrypted volumes.
What You Get:
Transparent Encryption: No application changes required
Data at Rest: All data on volume encrypted
Snapshots: Automatically encrypted
Data in Transit: Data moving between EC2 and EBS is encrypted
No Performance Impact: Hardware-accelerated encryption
You cannot encrypt an existing unencrypted volume directly
To encrypt existing volume: Create snapshot ā Copy snapshot with encryption ā Create volume from encrypted snapshot
Root volumes can be encrypted (requires encrypted AMI or encryption during launch)
Encrypted volumes can only be attached to instance types that support EBS encryption
Detailed Example 3: RDS Database Encryption
Scenario: You're running a PostgreSQL database in RDS that stores customer credit card information. PCI-DSS requires encryption of cardholder data at rest.
What it is: AWS Certificate Manager is a service that lets you easily provision, manage, and deploy SSL/TLS certificates for use with AWS services and your internal connected resources.
Why it exists: Managing SSL/TLS certificates is complex and error-prone. Certificates expire and must be renewed, private keys must be securely stored, and certificate deployment must be coordinated across multiple servers. ACM automates certificate provisioning and renewal, eliminating these operational burdens.
Real-world analogy: Think of ACM like a passport office that issues and renews passports automatically. Instead of you having to remember to renew your passport every 10 years and go through the application process, the passport office automatically sends you a new passport before the old one expires.
How ACM Works:
Request Certificate: You request a certificate for your domain (e.g., www.example.com) through ACM console or API.
Domain Validation: ACM must verify you own the domain. Two methods:
DNS Validation: Add a CNAME record to your DNS (recommended, automatic renewal)
Email Validation: Click link in email sent to domain owner
Certificate Issuance: Once validated, ACM issues the certificate signed by Amazon's Certificate Authority.
Deploy Certificate: Attach the certificate to:
CloudFront distribution
Application Load Balancer
Network Load Balancer
API Gateway
Elastic Beanstalk
Automatic Renewal: ACM automatically renews certificates before they expire (60 days before expiration). No action required from you.
Private Key Security: ACM stores private keys securely in AWS. You never have access to the private key, reducing risk of compromise.
Detailed Example: HTTPS for Web Application
Scenario: You're deploying a web application on EC2 instances behind an Application Load Balancer. You need to enable HTTPS with a valid SSL certificate for www.example.com.
Browser validates certificate (trusted by Amazon CA)
TLS handshake completes, encrypted connection established
ALB decrypts HTTPS traffic, forwards HTTP to EC2 instances
EC2 instances process request, return response to ALB
ALB encrypts response, sends HTTPS to user
Important Notes:
ACM certificates are free when used with AWS services
ACM certificates cannot be exported (private key stays in AWS)
For use outside AWS (on-premises servers), use imported certificates or AWS Private CA
Certificates are regional (must request in same region as ALB/CloudFront)
CloudFront requires certificates in us-east-1 region
Comparison Tables
Encryption Options Comparison
Service
Encryption Method
Key Management
Use Case
Cost
S3 SSE-S3
AES-256
AWS-managed keys
Simple encryption, no key control needed
Free
S3 SSE-KMS
AES-256
Customer-managed CMK
Audit trail, key rotation, compliance
$1/month + API calls
S3 SSE-C
AES-256
Customer-provided keys
You manage keys outside AWS
Free (you manage keys)
S3 Client-Side
Your choice
You manage
Encrypt before upload, maximum control
Free (you manage)
EBS Encryption
AES-256
AWS or customer CMK
Transparent EC2 volume encryption
$1/month (if custom CMK)
RDS Encryption
AES-256
AWS or customer CMK
Database encryption at rest
$1/month (if custom CMK)
Security Services Comparison
Service
Layer
Purpose
Cost
When to Use
Security Groups
Instance (L3/L4)
Allow traffic to instances
Free
Control access between tiers
NACLs
Subnet (L3/L4)
Allow/deny traffic to subnets
Free
Block specific IPs, subnet-level rules
AWS WAF
Application (L7)
Block web exploits, bots
$5/month + rules
Protect web apps from OWASP Top 10
AWS Shield
Network (L3/L4)
DDoS protection
Free (Standard)
Automatic DDoS protection
GuardDuty
Account-wide
Threat detection
~$50-200/month
Detect compromised resources
Macie
S3 data
Sensitive data discovery
~$1/GB scanned
Find PII/PHI in S3
IAM Authentication Methods
Method
Use Case
Pros
Cons
IAM Users
Long-term credentials for people
Simple, direct access
Hard to manage at scale, credentials can leak
IAM Roles
Temporary credentials for services
Secure, automatic rotation
Requires trust relationship setup
IAM Identity Center
SSO for multiple accounts
Centralized, SAML/OIDC support
Requires setup, additional service
Cognito User Pools
Application user authentication
Built for web/mobile apps
Not for AWS resource access
Cognito Identity Pools
Temporary AWS credentials for app users
Federated access, mobile-friendly
Complex setup for advanced scenarios
Decision Frameworks
Choosing Encryption Method
When choosing S3 encryption:
š Decision Tree:
Start: Need S3 encryption?
āā Need audit trail of key usage?
ā āā Yes ā Use SSE-KMS (customer-managed CMK)
ā āā No ā Continue
āā Need to control key rotation?
ā āā Yes ā Use SSE-KMS (customer-managed CMK)
ā āā No ā Continue
āā Need to manage keys outside AWS?
ā āā Yes ā Use SSE-C or Client-Side Encryption
ā āā No ā Continue
āā Want simplest solution?
ā āā Yes ā Use SSE-S3 (AWS-managed keys)
Decision Logic Explained:
SSE-KMS: Choose when you need compliance audit trails, key rotation control, or ability to disable keys. Costs $1/month per CMK + API calls.
SSE-S3: Choose for simple encryption without key management overhead. Free and automatic.
SSE-C: Choose when you must manage keys in your own key management system. You provide keys with each request.
Client-Side: Choose when you need to encrypt data before it leaves your application. Maximum control but most complex.
Choosing Network Security Controls
When securing a multi-tier application:
Layer 1: Network Segmentation
ā Use separate subnets for each tier (web, app, database)
ā Public subnets for internet-facing resources only
ā Private subnets for internal resources
ā Separate subnets per Availability Zone
Layer 2: Security Groups
ā Web tier: Allow 80/443 from 0.0.0.0/0
ā App tier: Allow app port from web tier security group only
ā Database tier: Allow database port from app tier security group only
ā Use security group references instead of IP addresses
Layer 3: Network ACLs (optional, for additional security)
ā Block known malicious IPs at subnet boundary
ā Enforce subnet-level policies (e.g., no outbound to internet from database subnet)
ā Add explicit deny rules for compliance
Layer 4: AWS WAF (for web tier)
ā Attach to Application Load Balancer or CloudFront
ā Enable managed rule groups (Core Rule Set, Known Bad Inputs)
ā Add rate limiting rules
ā Enable logging for analysis
Layer 5: GuardDuty (account-wide)
ā Enable in all accounts and regions
ā Configure EventBridge rules for automated response
ā Integrate with Security Hub for centralized view
Choosing Hybrid Connectivity
When connecting on-premises to AWS:
Requirement
VPN
Direct Connect
Both
Quick setup (hours)
ā
ā
ā (VPN first, DX later)
Low cost (<$100/month)
ā
ā
ā
High bandwidth (>1 Gbps)
ā
ā
ā
Consistent latency
ā
ā
ā
Encryption required
ā
ā (add VPN)
ā
High availability
ā (2 tunnels)
ā (order 2)
ā
Temporary/backup
ā
ā
ā (VPN as backup)
Recommendation:
Start with VPN if you need connectivity quickly or have budget constraints
Upgrade to Direct Connect when you need consistent performance or high bandwidth
Use both for production workloads requiring high availability and encryption
Key Facts & Figures
IAM Limits:
Users per account: 5,000 (soft limit, can be increased)
VPCs per region: 5 (default, can be increased to 100s)
Subnets per VPC: 200
Internet Gateways per VPC: 1
NAT Gateways per AZ: 5
Security Groups per VPC: 2,500
Rules per Security Group: 60 inbound, 60 outbound
Security Groups per network interface: 5
NACLs per VPC: 200
Rules per NACL: 20 (default, can be increased to 40)
KMS Limits:
CMKs per region: 10,000 (customer-managed)
API request rate: 5,500/second (shared across all CMKs in region)
Encrypt/Decrypt: 4 KB maximum data size
GenerateDataKey: Returns 256-bit key (32 bytes)
Important Numbers to Remember:
ā Security Group: Stateful, allow rules only, evaluated as a whole
ā NACL: Stateless, allow and deny rules, evaluated in order by rule number
ā KMS API rate: 5,500 requests/second (use S3 Bucket Keys to reduce calls)
ā VPN bandwidth: 1.25 Gbps per tunnel, 2 tunnels per connection
ā Direct Connect: 1 Gbps, 10 Gbps, or 100 Gbps dedicated connections
ā WAF rate limit: Can configure per IP (e.g., 2000 requests per 5 minutes)
šÆ Exam Focus: Questions often test:
Difference between Security Groups (stateful) and NACLs (stateless)
When to use SSE-KMS vs SSE-S3 for S3 encryption
How to block specific IP addresses (use NACL, not Security Group)
Cross-account access patterns (IAM roles with trust policies)
VPN vs Direct Connect selection criteria
WAF use cases for application-layer protection
Chapter Summary
What We Covered
This chapter covered the "Design Secure Architectures" domain, which represents 30% of the SAA-C03 exam. We explored three major areas:
ā Section 1: Identity and Access Management
IAM users, groups, roles, and policies
IAM policy evaluation logic and best practices
Cross-account access with IAM roles and external IDs
AWS Organizations and Service Control Policies (SCPs)
IAM Identity Center for SSO
Federation with SAML and OIDC
Cognito for application user authentication
ā Section 2: Network Security & VPC Architecture
VPC fundamentals and subnet design
Security Groups vs Network ACLs
Multi-tier VPC architectures
NAT Gateways for private subnet internet access
VPN and Direct Connect for hybrid connectivity
AWS WAF for application-layer protection
AWS Shield for DDoS protection
GuardDuty for threat detection
ā Section 3: Data Security & Encryption
AWS KMS for key management
Encryption at rest (S3, EBS, RDS)
Encryption in transit (TLS/SSL)
AWS Certificate Manager for SSL certificates
Envelope encryption patterns
Compliance and audit requirements
Critical Takeaways
IAM Best Practices: Always use IAM roles for AWS services instead of embedding access keys. Enable MFA for all users. Follow principle of least privilege. Use SCPs to enforce organization-wide policies.
Network Segmentation: Separate public and private subnets. Place only internet-facing resources in public subnets. Use Security Groups for instance-level control and NACLs for subnet-level control.
Defense in Depth: Use multiple security layers (network segmentation + security groups + NACLs + WAF + GuardDuty). No single security control is sufficient.
Encryption Everywhere: Encrypt data at rest with KMS. Encrypt data in transit with TLS. Use customer-managed CMKs when you need audit trails or key rotation control.
Hybrid Connectivity: Use VPN for quick setup and low cost. Use Direct Connect for high bandwidth and consistent performance. Use both for high availability.
Stateful vs Stateless: Security Groups are stateful (return traffic automatically allowed). NACLs are stateless (must explicitly allow both directions). This is a frequent exam question.
Key Management: KMS provides secure, auditable key management. Use envelope encryption for large data. Enable automatic key rotation for compliance.
Application Security: Use WAF to protect against OWASP Top 10 vulnerabilities. Use Shield for DDoS protection. Use GuardDuty for threat detection.
Self-Assessment Checklist
Test yourself before moving on:
I can explain the difference between IAM users, groups, and roles
I understand when to use IAM roles vs IAM users
I can describe how IAM policy evaluation works (explicit deny > explicit allow > implicit deny)
I understand the difference between Security Groups and NACLs
I can design a multi-tier VPC architecture with public and private subnets
I know when to use VPN vs Direct Connect
I understand how KMS encryption works (envelope encryption)
I can explain the difference between SSE-S3, SSE-KMS, and SSE-C
I know when to use AWS WAF vs AWS Shield
I understand how GuardDuty detects threats
I can describe how to implement cross-account access with IAM roles
I know how Service Control Policies (SCPs) work in AWS Organizations
Practice Questions
Try these from your practice test bundles:
Domain 1 Bundle 1: Questions 1-20 (IAM and access management)
ā Task 1.3 - Data Security Controls: KMS encryption, data at rest and in transit, ACM certificates, S3 encryption options, backup strategies, compliance frameworks
Critical Takeaways
IAM is the Foundation of AWS Security: Every AWS interaction requires authentication and authorization through IAM. Master the principle of least privilege, use roles instead of access keys, and always enable MFA for privileged accounts.
Defense in Depth with Multiple Security Layers: Combine security groups (stateful, instance-level), NACLs (stateless, subnet-level), WAF (application-level), and Shield (DDoS protection) for comprehensive security.
Encryption Everywhere: Encrypt data at rest using KMS, encrypt data in transit using TLS/SSL with ACM certificates. AWS provides encryption options for every storage service - use them.
Network Segmentation is Critical: Use public subnets for internet-facing resources, private subnets for application/database tiers, and isolated subnets for highly sensitive data. Control traffic flow with route tables and security groups.
Automate Security Monitoring: Use GuardDuty for threat detection, Macie for sensitive data discovery, Security Hub for centralized security findings, and Config for compliance monitoring.
Cross-Account Access Patterns: Use IAM roles with trust policies for cross-account access, not IAM users with access keys. Implement SCPs in AWS Organizations to enforce security boundaries.
Secrets Management: Never hardcode credentials. Use Secrets Manager for automatic rotation or Systems Manager Parameter Store for simple configuration data.
Self-Assessment Checklist
Test yourself before moving to Domain 2. You should be able to:
IAM and Access Management:
Explain the difference between IAM users, groups, roles, and policies
Design a cross-account access strategy using IAM roles
Implement MFA for root and privileged users
Create IAM policies with conditions and resource-level permissions
Configure AWS Organizations with SCPs to enforce security boundaries
Set up IAM Identity Center (SSO) for multi-account access
Understand when to use SAML federation vs. Cognito
Network Security:
Design a multi-tier VPC architecture with public and private subnets
Configure security groups with proper ingress/egress rules
Implement NACLs for subnet-level traffic control
Explain the difference between security groups (stateful) and NACLs (stateless)
Set up VPC endpoints to avoid internet traffic for AWS services
Configure AWS WAF rules to protect against common attacks
Implement AWS Shield Advanced for DDoS protection
Use GuardDuty findings to respond to threats
Data Protection:
Encrypt S3 buckets using SSE-S3, SSE-KMS, or SSE-C
Create and manage KMS customer managed keys (CMKs)
Implement key rotation policies
Configure RDS encryption at rest and in transit
Use ACM to provision and manage SSL/TLS certificates
Set up S3 bucket policies to enforce encryption
Implement S3 Object Lock for compliance requirements
Configure AWS Backup for automated backup management
ā Data Encryption: KMS, CloudHSM, ACM, Secrets Manager
ā Secure Connectivity: VPN, Direct Connect, PrivateLink
ā Application Security: Cognito, API Gateway authorization
Critical Takeaways
IAM Best Practices: Enable MFA for all users, use roles instead of access keys, apply least privilege principle, use IAM policies with conditions
Security Groups vs NACLs: Security groups are stateful (return traffic automatic), NACLs are stateless (must allow both directions); use security groups for instance-level, NACLs for subnet-level
Encryption Everywhere: Encrypt data at rest with KMS, encrypt in transit with TLS/SSL (ACM), rotate keys regularly, use envelope encryption for large data
Defense in Depth: Layer multiple security controls - WAF at edge, security groups at instance, encryption at rest, IAM for access, GuardDuty for threats
Zero Trust: Never trust, always verify - use IAM roles with temporary credentials, implement MFA, monitor with CloudTrail, detect threats with GuardDuty
Self-Assessment Checklist
Test yourself before moving on:
I can explain the difference between IAM users, groups, and roles
I understand when to use identity-based vs resource-based policies
I can design a multi-account strategy using Organizations and SCPs
I know the difference between security groups and NACLs
I can explain how to protect against DDoS attacks using Shield and WAF
I understand KMS key types (AWS managed vs customer managed)
I can describe when to use VPN vs Direct Connect vs PrivateLink
I know how to implement encryption at rest and in transit
I understand how GuardDuty detects threats
I can explain the shared responsibility model for security
Practice Questions
Try these from your practice test bundles:
Domain 1 Bundle 1: Questions 1-20 (IAM and access management)
User authentication and authorization with Cognito
ā Task 1.3: Determine Appropriate Data Security Controls
Encryption at rest with KMS and CloudHSM
Encryption in transit with ACM and TLS
Data lifecycle management and retention policies
Backup and disaster recovery strategies
Compliance and governance with Config, CloudTrail, and Audit Manager
Critical Takeaways
IAM is the foundation of AWS security: Master users, groups, roles, and policies. Always apply least privilege principle. Use roles for applications, not access keys.
Defense in depth: Layer multiple security controls (security groups + NACLs + WAF + Shield). No single point of failure in security.
Encryption everywhere: Encrypt data at rest (KMS), in transit (TLS/ACM), and in use when possible. Use AWS managed keys for simplicity, customer managed keys for control.
Network segmentation is critical: Use public subnets for internet-facing resources, private subnets for backend systems. Control traffic flow with route tables and security groups.
Automate security: Use Config for compliance monitoring, GuardDuty for threat detection, Security Hub for centralized findings. Don't rely on manual checks.
Shared responsibility model: AWS secures the infrastructure, you secure your data, applications, and configurations. Know where the line is drawn.
Audit everything: Enable CloudTrail in all regions, use CloudWatch Logs for centralized logging, set up alerts for suspicious activity.
Secrets management: Never hardcode credentials. Use Secrets Manager for automatic rotation, Systems Manager Parameter Store for configuration.
Multi-account strategy: Use AWS Organizations for centralized management, SCPs for guardrails, Control Tower for automated account setup.
Compliance is continuous: Use AWS Artifact for compliance reports, Config for continuous monitoring, Audit Manager for audit readiness.
Key Services Quick Reference
Identity & Access Management:
IAM: Users, groups, roles, policies (identity-based and resource-based)
IAM Identity Center: Centralized SSO for multiple accounts
AWS Organizations: Multi-account management with SCPs
Control Tower: Automated account setup with guardrails
Cognito: User authentication for web/mobile apps
Network Security:
VPC: Isolated network with subnets, route tables, gateways
Security Groups: Stateful firewall at instance level
Security Hub: Centralized security findings from all services
Secure Connectivity:
VPN: $0.05/hour, up to 1.25 Gbps, encrypted over internet
Direct Connect: $0.30/hour (1 Gbps), dedicated, consistent latency
PrivateLink: $0.01/hour + data, private AWS service access
Transit Gateway: $0.05/hour + data, hub-and-spoke for multiple VPCs
Must Memorize:
Default VPC CIDR: 172.31.0.0/16
Security groups: Stateful, allow only, all rules evaluated
NACLs: Stateless, allow + deny, numbered order (lowest first)
IAM policy size limit: 2,048 characters (inline), 6,144 characters (managed)
KMS key rotation: Automatic every 365 days (AWS managed), manual (customer managed)
CloudTrail: 90 days in Event History (free), S3 for longer retention
Congratulations! You've completed Domain 1 (30% of exam). This is the most heavily weighted domain, so mastering this content is critical for exam success.
This comprehensive chapter explored the critical domain of designing secure architectures on AWS, covering 30% of the SAA-C03 exam content. We examined three major task areas:
Task 1.1: Design Secure Access to AWS Resources
ā IAM fundamentals: users, groups, roles, and policies
ā Multi-factor authentication and root account security
ā Cross-account access patterns and role switching
ā AWS Organizations and Service Control Policies
ā IAM Identity Center for centralized access management
ā Federation with SAML and OIDC providers
ā AWS Control Tower for multi-account governance
Task 1.2: Design Secure Workloads and Applications
ā VPC security architecture with security groups and NACLs
Private connectivity ā VPC endpoints (AWS services) or PrivateLink (third-party)
Congratulations! You've completed Domain 1: Design Secure Architectures. This is the largest domain (30% of the exam), so mastering this content is critical for exam success.
ā VPC endpoints and PrivateLink for private connectivity
Task 1.3: Determine Appropriate Data Security Controls
ā AWS KMS for encryption key management
ā Encryption at rest (S3, EBS, RDS, DynamoDB)
ā Encryption in transit (TLS/SSL, ACM)
ā Data backup and replication strategies
ā AWS Backup for centralized backup management
ā CloudTrail for API logging and audit trails
ā AWS Config for compliance monitoring
Critical Takeaways
Principle of Least Privilege: Always start with minimum permissions and add only what's needed. Use IAM roles instead of long-term credentials whenever possible.
Defense in Depth: Layer multiple security controls (security groups + NACLs + WAF + Shield) for comprehensive protection.
Encryption Everywhere: Enable encryption at rest for all storage services and encryption in transit for all data transfers. Use AWS KMS for centralized key management.
Audit and Monitor: Enable CloudTrail in all regions, use Config for compliance, and GuardDuty for threat detection. Centralize findings in Security Hub.
Secure by Default: Use AWS managed services that provide built-in security features. Enable MFA for all privileged accounts, especially root users.
Network Isolation: Use private subnets for backend resources, public subnets only for internet-facing components. Use VPC endpoints to avoid internet traffic.
Identity Federation: For enterprise environments, federate with existing identity providers (Active Directory, Okta) rather than creating duplicate IAM users.
Compliance Automation: Use AWS Config rules and Security Hub to continuously monitor compliance and automatically remediate violations.
Self-Assessment Checklist
Test yourself before moving on. Can you:
IAM and Access Management
Explain the difference between IAM users, groups, and roles?
Describe how to implement cross-account access securely?
Configure MFA for root and IAM users?
Create an IAM policy with conditions and variables?
Explain when to use resource-based vs identity-based policies?
Implement least privilege access using permissions boundaries?
Set up AWS Organizations with SCPs for multi-account governance?
Network Security
Design a multi-tier VPC architecture with proper security?
Explain the difference between security groups and NACLs?
Configure AWS WAF rules to protect against common attacks?
Implement DDoS protection using Shield and WAF?
Set up VPC endpoints for private AWS service access?
Design a hybrid network with VPN or Direct Connect?
Explain when to use PrivateLink vs VPC peering?
Data Protection
Enable encryption at rest for S3, EBS, RDS, and DynamoDB?
Configure KMS customer-managed keys with proper key policies?
Implement encryption in transit using TLS/SSL and ACM?
Set up automated backup strategies using AWS Backup?
Configure S3 Object Lock for compliance requirements?
Enable CloudTrail logging and log file validation?
Use AWS Config to monitor resource compliance?
Threat Detection and Response
Enable GuardDuty for threat detection?
Configure Macie to discover sensitive data in S3?
Set up Security Hub for centralized security findings?
Implement automated remediation using EventBridge and Lambda?
Use Systems Manager Session Manager for secure instance access?
Practice Questions
Try these from your practice test bundles:
Beginner Level (Build Confidence):
Domain 1 Bundle 1: Questions 1-20
Security Services Bundle: Questions 1-15
Expected score: 70%+ to proceed
Intermediate Level (Test Understanding):
Domain 1 Bundle 2: Questions 1-20
Full Practice Test 1: Domain 1 questions
Expected score: 75%+ to proceed
Advanced Level (Challenge Yourself):
Domain 1 Bundle 3: Questions 1-20
Expected score: 70%+ to proceed
If you scored below target:
Below 60%: Review the entire chapter again, focus on fundamentals
60-70%: Review specific sections where you struggled
70-80%: Review quick facts and decision points
80%+: You're ready! Move to next domain
Quick Reference Card
Copy this to your notes for quick review:
IAM Essentials
Users: Long-term credentials, use for humans
Roles: Temporary credentials, use for services and cross-account
Groups: Collection of users, attach policies to groups
GuardDuty threat detection and Macie data discovery
Secrets Manager and Parameter Store
VPN and Direct Connect for hybrid connectivity
VPC endpoints and PrivateLink
ā Task 1.3: Determine appropriate data security controls
Encryption at rest with KMS
Encryption in transit with ACM/TLS
Key management and rotation
S3 encryption options and bucket policies
RDS and EBS encryption
Backup strategies and compliance
CloudTrail logging and Config rules
Critical Takeaways
IAM Best Practices: Always use roles for applications, enable MFA for privileged users, follow least privilege, and never share credentials.
Defense in Depth: Layer multiple security controls (security groups + NACLs + WAF + Shield) for comprehensive protection.
Encryption Everywhere: Encrypt data at rest (KMS) and in transit (TLS/SSL), with proper key management and rotation.
Network Segmentation: Use public subnets for internet-facing resources, private subnets for backend, and VPC endpoints for AWS service access.
Monitoring and Compliance: Enable CloudTrail in all regions, use Config for compliance, GuardDuty for threats, and Security Hub for centralized visibility.
Cross-Account Access: Use IAM roles with trust policies, not access keys, for secure cross-account access.
Secrets Management: Never hardcode credentials - use Secrets Manager with automatic rotation or Parameter Store for configuration.
Self-Assessment Checklist
Test yourself before moving on:
I can explain the difference between IAM users, groups, and roles
I understand when to use security groups vs NACLs
I can design a multi-tier VPC with proper security controls
I know how to implement encryption at rest and in transit
I understand cross-account access patterns with IAM roles
I can explain the purpose of WAF, Shield, GuardDuty, and Macie
I know when to use VPC endpoints vs internet gateway
I understand KMS key policies and grants
I can design a compliant architecture with proper logging
This chapter covered the essential concepts for designing secure architectures on AWS, which accounts for 30% of the SAA-C03 exam (the largest domain). We explored three major task areas:
Task 1.1: Design Secure Access to AWS Resources
ā IAM users, groups, roles, and policies
ā Multi-factor authentication (MFA) and password policies
ā IAM Identity Center (AWS SSO) for centralized access
ā Cross-account access and role switching
ā AWS Organizations and Service Control Policies (SCPs)
ā AWS Control Tower for multi-account governance
ā Federation with SAML 2.0 and OIDC
ā AWS STS for temporary credentials
ā Resource-based policies and permissions boundaries
ā Least privilege principle and policy evaluation logic
Task 1.2: Design Secure Workloads and Applications
Task 1.3: Determine Appropriate Data Security Controls
ā AWS KMS for encryption key management
ā Encryption at rest (S3, EBS, RDS, DynamoDB)
ā Encryption in transit (TLS/SSL with ACM)
ā S3 bucket encryption and policies
ā S3 Object Lock for compliance
ā S3 Versioning and MFA Delete
ā AWS CloudTrail for API logging
ā AWS Config for compliance monitoring
ā AWS Backup for centralized backup management
ā Key rotation and certificate renewal
ā Data classification and lifecycle policies
Critical Takeaways
Least Privilege: Always grant the minimum permissions necessary. Start with deny-all, then add specific permissions. Use IAM Access Analyzer to identify overly permissive policies.
IAM Policy Evaluation: Explicit Deny > Explicit Allow > Implicit Deny. If any policy has an explicit deny, access is denied regardless of allows.
MFA Everywhere: Enable MFA for root user (mandatory), IAM users with console access, and privileged operations (like S3 MFA Delete).
Root User Protection: Don't use root user for daily tasks. Enable MFA, delete access keys, use only for account-level tasks (billing, account closure).
Cross-Account Access: Use IAM roles with trust policies, not IAM users with access keys. Roles provide temporary credentials and are more secure.
Service Control Policies: SCPs set permission guardrails for entire AWS Organizations. They don't grant permissions, only limit what IAM policies can grant.
Security Groups vs NACLs: Security groups are stateful (return traffic automatic), NACLs are stateless (must allow both directions). Security groups support allow rules only, NACLs support both allow and deny.
VPC Endpoints: Gateway endpoints (S3, DynamoDB) are free and use route tables. Interface endpoints (most services) cost $0.01/hour + data transfer but provide private IPs.
AWS WAF: Protects against common web exploits (SQL injection, XSS). Use managed rules for quick deployment, custom rules for specific needs. Costs $5/month + $1/rule + $0.60/million requests.
AWS Shield: Standard (free, automatic DDoS protection), Advanced ($3,000/month, enhanced protection + DDoS Response Team + cost protection).
GuardDuty: Threat detection using ML, analyzes VPC Flow Logs, CloudTrail, DNS logs. Costs $4.50/million events. Findings can trigger automated remediation via EventBridge.
Secrets Manager: Automatic rotation for RDS, Redshift, DocumentDB. Costs $0.40/secret/month + $0.05/10,000 API calls. Use for database credentials, API keys, OAuth tokens.
KMS Encryption: Customer Managed Keys (CMK) give full control, AWS Managed Keys are free but limited control. CMK costs $1/month + $0.03/10,000 requests.
S3 Object Lock: WORM (Write Once Read Many) for compliance. Governance mode (can be overridden with permissions), Compliance mode (cannot be deleted even by root).
CloudTrail: Logs all API calls, essential for security auditing. Enable log file validation to detect tampering. Store logs in separate security account.
Encryption in Transit: Use TLS 1.2+ for all connections. ACM provides free SSL/TLS certificates with automatic renewal. Use ALB or CloudFront for TLS termination.
Defense in Depth: Layer multiple security controls (IAM + Security Groups + NACLs + WAF + Encryption). If one layer fails, others provide protection.
Credential rotation ā Secrets Manager with Lambda
Secure instance access ā Systems Manager Session Manager
Web application protection ā WAF + Shield + CloudFront
Congratulations! You've completed Chapter 1: Design Secure Architectures. You now understand how to implement comprehensive security controls for AWS resources, workloads, and data.
RDS: Encrypt at creation, can't encrypt existing DB
In-transit: Use TLS/SSL, ACM for certificate management
Monitoring & Compliance:
CloudTrail: API call logging (who did what when)
Config: Resource configuration tracking and compliance
GuardDuty: Threat detection using ML
Security Hub: Centralized security findings
Macie: Sensitive data discovery in S3
Decision Points:
Need to audit API calls? ā CloudTrail
Need to detect threats? ā GuardDuty
Need to protect web app? ā WAF + Shield
Need to rotate secrets? ā Secrets Manager
Need cross-account access? ā IAM role with trust policy
Need to encrypt data? ā KMS with appropriate key policy
Chapter Summary
What We Covered
This chapter covered the three critical task areas for designing secure architectures on AWS:
ā Task 1.1: Secure Access to AWS Resources
IAM fundamentals: users, groups, roles, and policies
Multi-factor authentication (MFA) and credential management
Cross-account access patterns and role switching
AWS Organizations and Service Control Policies (SCPs)
Federation with SAML and OIDC identity providers
AWS IAM Identity Center for centralized SSO
Least privilege principle and permissions boundaries
ā Task 1.2: Secure Workloads and Applications
VPC security architecture with security groups and NACLs
Network segmentation with public and private subnets
AWS WAF for application-layer protection
AWS Shield for DDoS protection
Amazon GuardDuty for threat detection
AWS Secrets Manager for credential rotation
VPN and Direct Connect for hybrid connectivity
VPC endpoints and PrivateLink for private AWS service access
ā Task 1.3: Data Security Controls
Encryption at rest with AWS KMS
Encryption in transit with TLS/SSL and ACM
S3 encryption options (SSE-S3, SSE-KMS, SSE-C)
EBS and RDS encryption
Data backup strategies with AWS Backup
Compliance frameworks and AWS Config
CloudTrail for audit logging
Data lifecycle and retention policies
Critical Takeaways
IAM Best Practices: Always use IAM roles for applications, never embed credentials. Enable MFA on root and privileged accounts. Apply least privilege principle to all policies.
Defense in Depth: Layer security controls - use security groups AND NACLs, encrypt data at rest AND in transit, implement WAF AND Shield for web applications.
Encryption Everywhere: Encrypt all sensitive data. Use KMS for centralized key management. Enable encryption by default on new resources.
Network Segmentation: Isolate resources in private subnets. Use VPC endpoints to avoid internet traffic. Implement bastion hosts or Systems Manager Session Manager for secure access.
Monitoring and Compliance: Enable CloudTrail in all regions. Use Config for compliance tracking. Set up GuardDuty for threat detection. Centralize findings in Security Hub.
Cross-Account Security: Use IAM roles with trust policies for cross-account access. Implement SCPs at the organization level. Use AWS Control Tower for multi-account governance.
Secret Management: Never hardcode credentials. Use Secrets Manager or Parameter Store. Enable automatic rotation for database credentials.
Compliance Automation: Use AWS Config rules to enforce compliance. Implement AWS Backup for automated backups. Use S3 Object Lock for WORM compliance.
Self-Assessment Checklist
Test yourself before moving on. You should be able to:
IAM and Access Management:
Explain the difference between IAM users, groups, and roles
Describe how to implement cross-account access securely
Configure MFA for root and IAM users
Write IAM policies using least privilege principle
Explain when to use resource-based vs identity-based policies
Implement federation with SAML or OIDC
Configure AWS Organizations with SCPs
Use IAM Access Analyzer to identify external access
Network Security:
Design a multi-tier VPC architecture with security groups and NACLs
Explain the difference between security groups (stateful) and NACLs (stateless)
Configure VPC endpoints for S3 and DynamoDB
Implement AWS PrivateLink for third-party services
Set up AWS WAF rules to protect against common attacks
Configure AWS Shield Advanced for DDoS protection
Design hybrid connectivity with VPN or Direct Connect
Implement network segmentation with public and private subnets
Data Protection:
Configure S3 bucket encryption with SSE-S3, SSE-KMS, or SSE-C
Enable EBS encryption by default
Encrypt RDS databases at creation
Implement encryption in transit with TLS/SSL
Manage certificates with AWS Certificate Manager
Configure KMS key policies and grants
Implement automatic key rotation
Set up cross-region replication with encryption
Monitoring and Compliance:
Enable CloudTrail for API logging across all regions
Configure AWS Config rules for compliance checking
Set up Amazon GuardDuty for threat detection
Use Amazon Macie to discover sensitive data in S3
Centralize security findings in AWS Security Hub
Implement automated remediation with EventBridge and Lambda
This chapter covered the three critical task areas for designing secure architectures on AWS:
ā Task 1.1: Secure Access to AWS Resources
IAM fundamentals: users, groups, roles, policies
Multi-factor authentication (MFA) and root user security
Cross-account access and role switching
AWS Organizations and Service Control Policies (SCPs)
Federation with SAML and OIDC
IAM Identity Center (AWS SSO) for centralized access
Least privilege principle and permissions boundaries
ā Task 1.2: Secure Workloads and Applications
VPC security architecture with security groups and NACLs
Network segmentation with public and private subnets
AWS WAF for application protection
AWS Shield for DDoS protection
GuardDuty for threat detection
Secrets Manager for credential management
VPN and Direct Connect for hybrid connectivity
VPC endpoints and PrivateLink for private connectivity
ā Task 1.3: Data Security Controls
Encryption at rest with AWS KMS
Encryption in transit with TLS/SSL and ACM
S3 encryption options (SSE-S3, SSE-KMS, SSE-C)
RDS and EBS encryption
Key rotation and certificate management
Data backup and replication strategies
CloudTrail for audit logging
AWS Config for compliance monitoring
Critical Takeaways
IAM Best Practices: Always use IAM roles for applications, never embed credentials. Enable MFA on all accounts, especially root. Apply least privilege principle to all policies.
Defense in Depth: Use multiple layers of security - security groups, NACLs, WAF, Shield. No single point of failure in security architecture.
Encryption Everywhere: Encrypt data at rest with KMS, encrypt data in transit with TLS. Use envelope encryption for large data sets.
Audit and Monitor: Enable CloudTrail in all regions, use Config for compliance, GuardDuty for threats, and Security Hub for centralized visibility.
Shared Responsibility: AWS secures the infrastructure, you secure what you put in the cloud. Understand where your responsibilities begin.
Network Isolation: Use VPC endpoints to keep traffic within AWS network. Use PrivateLink for private access to services. Segment networks with multiple subnets.
Secrets Management: Never hardcode credentials. Use Secrets Manager or Parameter Store with automatic rotation.
Cross-Account Access: Use IAM roles with trust policies, not IAM users. Implement SCPs at organization level for guardrails.
Self-Assessment Checklist
Test yourself before moving on:
I can explain the difference between IAM users, groups, and roles
I understand when to use resource-based vs identity-based policies
I can design a multi-account architecture with Organizations and SCPs
I know how to implement cross-account access securely
I understand the difference between security groups and NACLs
I can design a VPC with proper network segmentation
I know when to use WAF, Shield, and GuardDuty
I understand the different S3 encryption options
I can explain how KMS works and when to use it
I know how to implement encryption in transit
I understand CloudTrail, Config, and their use cases
I can design a secure hybrid architecture with VPN or Direct Connect
Practice Questions
Try these from your practice test bundles:
Domain 1 Bundle 1: Questions 1-20 (IAM and access control)
Exam Weight: 26% of exam questions (approximately 17 out of 65 questions)
Section 1: Scalable and Loosely Coupled Architectures
Introduction
The problem: Traditional monolithic applications are tightly coupled, making them difficult to scale, update, and maintain. When one component fails, the entire application can fail. When traffic increases, you must scale the entire application even if only one component needs more capacity.
The solution: Loosely coupled architectures separate components so they can scale independently, fail independently, and be updated independently. Components communicate through well-defined interfaces (APIs, message queues, event buses) rather than direct dependencies.
Why it's tested: This is a core principle of cloud architecture and represents 26% of the exam. Questions test your ability to design systems that scale automatically, handle failures gracefully, and minimize dependencies between components.
Core Concepts
Loose Coupling Fundamentals
What it is: Loose coupling is an architectural principle where components are designed to have minimal dependencies on each other. Components interact through standardized interfaces and don't need to know the internal implementation details of other components.
Why it exists: Tightly coupled systems are fragile. If Component A directly calls Component B, and B fails, A fails. If B needs to be updated, A might break. If B is overloaded, A must wait. Loose coupling solves these problems by introducing intermediaries (queues, load balancers, event buses) that buffer and route requests.
Real-world analogy: Think of a restaurant. In a tightly coupled system, customers would go directly into the kitchen and tell the chef what to cook. If the chef is busy, customers wait. If the chef is sick, no one eats. In a loosely coupled system, customers place orders with a waiter (queue), the waiter gives orders to the kitchen (producer), and the kitchen prepares food at its own pace (consumer). If one chef is busy, another chef can take the order. If a chef is sick, orders queue up until another chef is available.
How loose coupling works (Detailed step-by-step):
Identify Components: Break your application into logical components (web tier, application tier, database tier, background processing, etc.).
Define Interfaces: Each component exposes a well-defined interface (REST API, message format, event schema) that other components use to interact with it.
Introduce Intermediaries: Place intermediaries between components:
Load Balancers: Distribute requests across multiple instances
Message Queues: Buffer requests between producers and consumers
Event Buses: Route events from publishers to subscribers
API Gateways: Provide a single entry point for multiple backend services
Implement Asynchronous Communication: Instead of synchronous request-response (Component A waits for Component B), use asynchronous messaging (Component A sends message and continues, Component B processes when ready).
Handle Failures Gracefully: Design components to handle failures of other components:
What it is: Amazon SQS is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. SQS eliminates the complexity and overhead of managing message-oriented middleware.
Why it exists: When Component A produces work faster than Component B can process it, you need a buffer. Without a queue, Component A must either wait (wasting resources) or drop requests (losing data). SQS provides a reliable, scalable buffer that holds messages until consumers are ready to process them.
Real-world analogy: SQS is like a post office mailbox. You (producer) drop letters (messages) in the mailbox at any time, even if the mail carrier (consumer) isn't there. The mail carrier picks up letters when they're ready and delivers them. If you drop 100 letters at once, they wait in the mailbox until the carrier can handle them. If the carrier is sick, letters wait until another carrier is available.
How SQS works (Detailed step-by-step):
Create Queue: You create an SQS queue with a name and configuration (standard or FIFO, visibility timeout, message retention period).
Producer Sends Messages: Your application (producer) sends messages to the queue using the SQS SendMessage API. Each message can be up to 256 KB and contains:
Message Body: The actual data (JSON, XML, plain text)
Message Attributes: Metadata about the message (optional)
Message ID: Unique identifier assigned by SQS
Messages Stored: SQS stores messages redundantly across multiple Availability Zones for durability. Messages are retained for 4 days by default (configurable from 1 minute to 14 days).
Consumer Polls Queue: Your application (consumer) polls the queue using the SQS ReceiveMessage API. SQS returns up to 10 messages per request.
Visibility Timeout: When a consumer receives a message, SQS makes it invisible to other consumers for a visibility timeout period (default 30 seconds, configurable up to 12 hours). This prevents multiple consumers from processing the same message simultaneously.
Process Message: The consumer processes the message (e.g., resize image, send email, update database).
Delete Message: After successfully processing, the consumer deletes the message using the SQS DeleteMessage API. If the consumer doesn't delete the message before the visibility timeout expires, the message becomes visible again and another consumer can process it.
Failure Handling: If a consumer fails to process a message (crashes, throws exception), it doesn't delete the message. After the visibility timeout, the message becomes visible again for retry. After a configurable number of receive attempts (default 5), SQS can move the message to a Dead Letter Queue (DLQ) for investigation.
SQS Queue Types:
Standard Queue:
Throughput: Unlimited, nearly unlimited transactions per second
Ordering: Best-effort ordering (messages usually delivered in order, but not guaranteed)
Delivery: At-least-once delivery (message might be delivered more than once)
Use Case: High throughput, order doesn't matter, can handle duplicates
FIFO Queue:
Throughput: 300 transactions per second (3,000 with batching)
Ordering: Strict ordering (messages delivered in exact order sent)
Delivery: Exactly-once processing (no duplicates)
Use Case: Order matters, cannot handle duplicates (e.g., financial transactions)
Detailed Example 1: Image Processing Pipeline with SQS
Scenario: You're building a photo sharing application. Users upload photos that need to be resized into multiple sizes (thumbnail, medium, large) and have metadata extracted (location, date, camera model). Uploads are bursty - sometimes 10 photos per minute, sometimes 1,000 photos per minute.
Without SQS (Tightly Coupled):
Web server receives upload
Web server resizes images (CPU-intensive, takes 5 seconds per image)
Web server extracts metadata (takes 2 seconds per image)
User waits 7+ seconds for upload to complete
During traffic spikes, web servers become overloaded
Users experience timeouts and failed uploads
With SQS (Loosely Coupled):
Architecture:
Upload Service: Web servers receive uploads, store original image in S3, send message to SQS queue
SQS Queue: Buffers resize requests
Resize Workers: Auto Scaling group of EC2 instances polls queue, processes images
S3: Stores original and resized images
Step-by-Step Flow:
User Uploads Photo:
User uploads photo to web server
Web server stores original in S3: s3://photos/originals/photo123.jpg
Auto Scaling: Additional instances only during spikes
S3: Storage and transfer costs
Total: ~$100-200/month for millions of photos
Amazon SNS (Simple Notification Service)
What it is: Amazon SNS is a fully managed pub/sub (publish/subscribe) messaging service that enables you to decouple microservices, distributed systems, and event-driven serverless applications. SNS provides topics for high-throughput, push-based, many-to-many messaging.
Why it exists: Sometimes you need to send the same message to multiple recipients (fan-out pattern). With point-to-point messaging (like SQS), you'd need to send the message multiple times. SNS allows you to publish once and deliver to many subscribers simultaneously.
Real-world analogy: SNS is like a news broadcaster. The broadcaster (publisher) sends news (messages) to a channel (topic). Anyone interested (subscribers) can tune in to that channel. When news is broadcast, all subscribers receive it simultaneously. Subscribers can be TV viewers (Lambda functions), radio listeners (SQS queues), or newspaper readers (email addresses).
How SNS works (Detailed step-by-step):
Create Topic: You create an SNS topic, which is a communication channel with a unique ARN (Amazon Resource Name).
Subscribe Endpoints: You subscribe endpoints to the topic:
SQS Queue: Messages delivered to queue for processing
Lambda Function: Function invoked with message as input
HTTP/HTTPS Endpoint: POST request sent to your web server
Email/Email-JSON: Email sent to address
SMS: Text message sent to phone number
Mobile Push: Notification sent to mobile app
Publish Message: Your application publishes a message to the topic using the SNS Publish API. The message contains:
Subject: Brief description (optional)
Message: The actual content (up to 256 KB)
Message Attributes: Metadata for filtering (optional)
Fan-Out: SNS immediately delivers the message to all subscribed endpoints in parallel. Each subscriber receives a copy of the message.
Retry Logic: If delivery fails (e.g., Lambda function throttled, HTTP endpoint unavailable), SNS retries with exponential backoff. After multiple failures, SNS can send failed messages to a Dead Letter Queue.
Message Filtering: Subscribers can specify filter policies to receive only messages matching certain criteria. SNS evaluates filters and delivers only matching messages.
SNS vs SQS:
Feature
SNS (Pub/Sub)
SQS (Queue)
Pattern
Publish/Subscribe (1-to-many)
Point-to-Point (1-to-1)
Delivery
Push (SNS pushes to subscribers)
Pull (consumers poll queue)
Persistence
No (messages not stored)
Yes (messages stored up to 14 days)
Subscribers
Multiple (fan-out)
Single consumer per message
Use Case
Notify multiple systems of event
Decouple producer and consumer
SNS + SQS Fan-Out Pattern:
The most powerful pattern combines SNS and SQS: publish to SNS topic, which fans out to multiple SQS queues. Each queue has its own consumer that processes messages independently.
Detailed Example 2: Order Processing with SNS Fan-Out
Scenario: You're building an e-commerce platform. When a customer places an order, multiple systems need to be notified:
Inventory Service: Reduce stock levels
Shipping Service: Create shipping label
Email Service: Send confirmation email
Analytics Service: Record order for reporting
Fraud Detection Service: Check for suspicious activity
Architecture:
Order Service: Publishes order event to SNS topic
SNS Topic: "OrderPlaced" topic
SQS Queues: One queue per service (5 queues total)
Consumers: Each service has workers polling its queue
Step-by-Step Flow:
Customer Places Order:
Order service validates order, charges credit card
Order service publishes message to SNS topic "OrderPlaced":
Result: Only orders >= $1,000 delivered to fraud queue. Low-value orders filtered out, reducing processing load.
Amazon EventBridge
What it is: Amazon EventBridge is a serverless event bus service that makes it easy to connect applications using events. EventBridge receives events from AWS services, custom applications, and SaaS applications, and routes them to targets based on rules.
Why it exists: Modern applications are event-driven - things happen (user signs up, file uploaded, payment processed) and other systems need to react. EventBridge provides a central event bus where all events flow, with powerful routing and filtering capabilities.
Real-world analogy: EventBridge is like a smart mail sorting facility. Letters (events) arrive from many sources (AWS services, your apps, SaaS apps). The facility reads the address and contents (event pattern matching), then routes each letter to the correct destination (targets) based on rules. Some letters might go to multiple destinations (fan-out).
How EventBridge works (Detailed step-by-step):
Event Bus: You use the default event bus (receives AWS service events) or create custom event buses for your applications.
Event Sources: Events come from:
AWS Services: EC2 state changes, S3 object uploads, CloudWatch alarms
Custom Applications: Your apps send events via PutEvents API
SaaS Partners: Zendesk, Datadog, Auth0, etc.
Event Structure: Events are JSON documents with standard structure:
EventBridge sends event directly to S3 (no Lambda needed)
Event stored in S3: s3://security-logs/guardduty/2024/01/15/finding-12345.json
Retained for 7 years (compliance requirement)
Timeline:
T+0s: GuardDuty detects threat
T+1s: EventBridge receives event, matches rule
T+2s: Instance isolated
T+3s: Slack notification sent
T+4s: Jira ticket created
T+5s: Forensics snapshot initiated
T+5s: Event logged to S3
Total Response Time: 5 seconds (vs hours for manual response)
Benefits:
Fast Response: Automated response in seconds
Consistent: Same response every time, no human error
Comprehensive: Multiple actions in parallel
Auditable: All events logged to S3
Scalable: Handles 1 or 1,000 incidents identically
Cost:
EventBridge: $1 per million events (1,000 incidents = $0.001)
Lambda: $0.20 per million requests + compute time
Total: < $1/month for typical incident volume
AWS Lambda for Event-Driven Processing
What it is: AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the compute resources. You don't provision or manage servers - Lambda handles everything.
Why it exists: Traditional servers require provisioning, patching, scaling, and monitoring. For event-driven workloads (process file upload, respond to API call, handle queue message), you pay for idle time when no events occur. Lambda eliminates this waste by running code only when triggered and charging only for compute time used.
Real-world analogy: Lambda is like hiring a contractor instead of a full-time employee. You only pay when they're working on your project (per-request billing). You don't pay for their idle time, vacation, or benefits. When you need more work done, you hire more contractors (automatic scaling). When work is done, contractors leave (no idle resources).
How Lambda works (Detailed step-by-step):
Create Function: You upload your code (Python, Node.js, Java, Go, etc.) and specify:
Runtime: Programming language and version
Handler: Function to invoke (e.g., lambda_function.lambda_handler)
Memory: 128 MB to 10,240 MB (CPU scales proportionally)
Timeout: Maximum execution time (1 second to 15 minutes)
IAM Role: Permissions for function to access AWS services
Configure Trigger: You specify what invokes the function:
API Gateway: HTTP request
S3: Object upload
DynamoDB: Table update
SQS: Message in queue
EventBridge: Event pattern match
CloudWatch Events: Schedule (cron)
And 20+ other event sources
Event Occurs: When the trigger event happens, AWS invokes your Lambda function.
Lambda Execution:
Lambda finds an available execution environment (or creates new one)
Lambda loads your code into the environment
Lambda invokes your handler function with event data
Your code executes (processes event, calls AWS services, returns response)
Lambda captures logs and sends to CloudWatch Logs
Scaling: If multiple events occur simultaneously, Lambda automatically creates multiple execution environments and runs them in parallel. Lambda can scale to thousands of concurrent executions.
Billing: You pay for:
Requests: $0.20 per million requests
Compute Time: $0.0000166667 per GB-second (memory Ć duration)
Free Tier: 1 million requests and 400,000 GB-seconds per month
Detailed Example 4: Thumbnail Generation with Lambda
Scenario: Users upload images to S3. You need to automatically generate thumbnails (200x200) for each uploaded image.
Architecture:
S3 Bucket: Users upload images
S3 Event: Triggers Lambda on object creation
Lambda Function: Generates thumbnail
S3 Bucket: Stores thumbnail
Lambda Function Code (Python):
import boto3
import os
from PIL import Image
import io
s3 = boto3.client('s3')
def lambda_handler(event, context):
# Extract bucket and key from S3 event
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# Don't process thumbnails (avoid infinite loop)
if key.startswith('thumbnails/'):
return
# Download image from S3
response = s3.get_object(Bucket=bucket, Key=key)
image_data = response['Body'].read()
# Open image with Pillow
image = Image.open(io.BytesIO(image_data))
# Resize to thumbnail (200x200)
image.thumbnail((200, 200))
# Save to bytes buffer
buffer = io.BytesIO()
image.save(buffer, format=image.format)
buffer.seek(0)
# Upload thumbnail to S3
thumbnail_key = f'thumbnails/{key}'
s3.put_object(
Bucket=bucket,
Key=thumbnail_key,
Body=buffer,
ContentType=response['ContentType']
)
return {
'statusCode': 200,
'body': f'Thumbnail created: {thumbnail_key}'
}
Parallel Processing: All 1,000 images processed simultaneously
Total Time: 500ms (same as single image)
Cost: 1,000 Ć $0.0000083 = $0.0083 (less than 1 cent)
Without Lambda (EC2 approach):
Need to provision enough EC2 instances to handle peak load (1,000 concurrent)
Instances idle most of the time (waste money)
Need to implement scaling, monitoring, patching
Cost: $100s/month for idle capacity
Lambda Benefits:
No Servers: No provisioning, patching, or management
Automatic Scaling: Handles 1 or 1,000,000 requests
Pay Per Use: Only pay for actual compute time
High Availability: Runs across multiple AZs automatically
Integrated: Native integration with 20+ AWS services
Section 2: High Availability and Fault Tolerance
Introduction
The problem: Hardware fails, software crashes, networks partition, and entire data centers can go offline. Traditional architectures with single points of failure experience downtime when components fail, resulting in lost revenue, poor user experience, and SLA violations.
The solution: High availability (HA) architectures eliminate single points of failure by deploying redundant components across multiple Availability Zones. When one component fails, traffic automatically shifts to healthy components. Fault tolerance goes further by ensuring the system continues operating correctly even during failures.
Why it's tested: This is a core AWS architectural principle and represents a significant portion of the exam. Questions test your ability to design systems that achieve 99.9%, 99.99%, or 99.999% availability using AWS services.
Core Concepts
Availability Zones and Regions
What they are: AWS Regions are geographic areas (e.g., us-east-1 in Virginia, eu-west-1 in Ireland) that contain multiple isolated Availability Zones (AZs). Each AZ is one or more discrete data centers with redundant power, networking, and connectivity.
Why they exist: A single data center can fail due to power outages, network issues, natural disasters, or human error. By distributing resources across multiple physically separated data centers (AZs), you can survive individual data center failures. Regions provide geographic diversity for disaster recovery and data residency requirements.
Real-world analogy: Think of a Region as a city (e.g., New York) and Availability Zones as different neighborhoods in that city (Manhattan, Brooklyn, Queens). Each neighborhood has its own power grid, water supply, and infrastructure. If Manhattan loses power, Brooklyn and Queens continue operating. If you need disaster recovery, you also have resources in a different city (e.g., Los Angeles).
How AZs work (Detailed):
Physical Separation: AZs are physically separated by meaningful distances (miles apart) to reduce risk of simultaneous failure from natural disasters, power outages, or network issues.
Independent Infrastructure: Each AZ has:
Independent power supply (multiple utility providers, backup generators)
Independent cooling systems
Independent network connectivity (multiple ISPs)
Independent physical security
Low-Latency Interconnection: AZs are connected with high-bandwidth, low-latency private fiber networks. Latency between AZs in the same Region is typically < 2ms, enabling synchronous replication.
Fault Isolation: Failures in one AZ don't affect other AZs. AWS designs services to isolate faults within a single AZ.
Availability Zone Naming:
AZ names are account-specific (your us-east-1a might be different from another account's us-east-1a)
This distributes load across physical AZs
Use AZ IDs (use1-az1, use1-az2) for consistent identification across accounts
Detailed Example 1: Multi-AZ RDS Deployment
Scenario: You're running a MySQL database for a critical e-commerce application. The database must be available 99.95% of the time (< 4.5 hours downtime per year). Single-AZ deployment doesn't meet this requirement because AZ failures occur occasionally.
Solution: RDS Multi-AZ deployment.
Architecture:
Primary DB Instance: In AZ-A (us-east-1a), handles all read and write operations
Standby DB Instance: In AZ-B (us-east-1b), synchronously replicates from primary
DNS Endpoint: Single endpoint (mydb.abc123.us-east-1.rds.amazonaws.com) that points to current primary
How Multi-AZ Works:
Normal Operation:
Application connects to DNS endpoint
DNS resolves to primary instance IP in AZ-A
Application sends queries to primary
Primary processes queries and returns results
Primary synchronously replicates every transaction to standby in AZ-B
Standby acknowledges replication before primary commits transaction
This ensures zero data loss (RPO = 0)
Synchronous Replication:
Application writes data: INSERT INTO orders VALUES (...)
Primary writes to its storage
Primary sends transaction to standby
Standby writes to its storage
Standby sends acknowledgment to primary
Primary commits transaction and returns success to application
Replication adds < 5ms latency (AZs are close)
Failure Detection:
RDS continuously monitors primary instance health
Health checks every 1-2 seconds:
Network connectivity
Instance responsiveness
Storage availability
Database process status
If 3 consecutive health checks fail (3-6 seconds), RDS initiates failover
Automatic Failover:
RDS detects primary failure (e.g., AZ-A power outage)
RDS promotes standby in AZ-B to primary
RDS updates DNS record to point to new primary IP
DNS TTL is 30 seconds, but RDS forces immediate update
Applications reconnect and resume operations
Total failover time: 60-120 seconds
Post-Failover:
New primary (formerly standby) handles all traffic
RDS automatically creates new standby in another AZ (AZ-C)
Synchronous replication resumes
System returns to fully redundant state
Failure Scenarios:
Scenario 1: AZ-A Power Outage:
T+0s: Power outage in AZ-A, primary instance becomes unreachable
T+3s: RDS detects failure (3 failed health checks)
T+5s: RDS initiates failover, promotes standby
T+30s: DNS propagates to most clients
T+60s: Applications reconnect to new primary
T+120s: All applications operational
Downtime: 60-120 seconds
Data Loss: Zero (synchronous replication)
Scenario 2: Primary Instance Crash:
T+0s: Database process crashes on primary
T+2s: RDS detects failure
T+5s: RDS initiates failover
T+60s: Applications reconnect
Downtime: 60 seconds
Data Loss: Zero
Scenario 3: Storage Failure:
T+0s: EBS volume fails on primary
T+3s: RDS detects failure
T+5s: RDS initiates failover
T+60s: Applications operational on standby
Downtime: 60 seconds
Data Loss: Zero
Scenario 4: Planned Maintenance:
You need to upgrade database version
RDS performs maintenance on standby first
RDS fails over to upgraded standby (60 seconds downtime)
RDS upgrades old primary (now standby)
Downtime: 60 seconds (vs hours for single-AZ)
What You Get:
High Availability: 99.95% uptime SLA
Zero Data Loss: Synchronous replication (RPO = 0)
Fast Recovery: 60-120 second failover (RTO = 1-2 minutes)
Automatic: No manual intervention required
Transparent: Same endpoint before and after failover
Worth it for production workloads requiring high availability
Important Notes:
Standby is not accessible for reads (use read replicas for read scaling)
Failover is automatic, but applications must handle reconnection
Use connection pooling with retry logic for seamless failover
Multi-AZ is within a single Region (use cross-region read replicas for DR)
Elastic Load Balancing
What it is: Elastic Load Balancing (ELB) automatically distributes incoming application traffic across multiple targets (EC2 instances, containers, IP addresses, Lambda functions) in multiple Availability Zones.
Why it exists: Without a load balancer, you'd need to manually distribute traffic across instances, handle instance failures, and manage scaling. Load balancers automate this, providing high availability, fault tolerance, and automatic scaling.
Real-world analogy: A load balancer is like a restaurant host who seats customers. Instead of customers choosing their own table (which could overload some servers while others are idle), the host distributes customers evenly across all servers. If a server is busy or unavailable, the host sends customers to other servers. If the restaurant gets crowded, the host calls in more servers.
Routes based on content (URL path, hostname, headers, query parameters)
Supports WebSocket and HTTP/2
Integrates with AWS WAF for application security
Best for web applications and microservices
Network Load Balancer (NLB) - Layer 4 (TCP/UDP):
Ultra-high performance (millions of requests per second)
Static IP addresses (Elastic IPs)
Preserves source IP address
Best for TCP/UDP traffic, extreme performance requirements
Gateway Load Balancer (GWLB) - Layer 3 (IP):
Deploys, scales, and manages third-party virtual appliances
Transparent network gateway + load balancer
Best for firewalls, intrusion detection, deep packet inspection
How ALB Works (Detailed step-by-step):
Create Load Balancer:
Choose subnets in multiple AZs (minimum 2)
ALB creates load balancer nodes in each subnet
Each node has its own IP address
DNS name resolves to all node IPs (round-robin)
Configure Target Groups:
Target group is a logical grouping of targets (EC2 instances, IPs, Lambda functions)
Define health check: protocol, path, interval, timeout, thresholds
Example: HTTP GET /health every 30 seconds, timeout 5 seconds, 2 consecutive successes = healthy
Register Targets:
Add EC2 instances to target group
ALB starts sending health checks to each target
Targets must pass health checks before receiving traffic
Configure Listeners:
Listener checks for connection requests on specified protocol and port
Example: HTTPS listener on port 443
Listener rules route requests to target groups based on conditions
Traffic Flow:
Client sends request to ALB DNS name
DNS resolves to ALB node IPs (multiple IPs for redundancy)
Client connects to ALB node
ALB terminates TLS connection (if HTTPS)
ALB selects healthy target using routing algorithm (round-robin, least outstanding requests)
ALB forwards request to target
Target processes request and returns response
ALB forwards response to client
Health Checks:
ALB continuously sends health checks to all targets
If target fails health check (returns non-200 status, times out), ALB marks it unhealthy
ALB stops sending traffic to unhealthy targets
When target passes health checks again, ALB resumes sending traffic
Auto Scaling Integration:
Auto Scaling group launches/terminates instances based on load
New instances automatically registered with target group
ALB starts health checking new instances
Once healthy, ALB sends traffic to new instances
Terminated instances automatically deregistered
Detailed Example 2: High-Availability Web Application with ALB
Scenario: You're deploying a web application that must handle 10,000 requests per second with 99.99% availability. The application runs on EC2 instances and must survive AZ failures.
Architecture:
ALB: In 3 AZs (us-east-1a, us-east-1b, us-east-1c)
Auto Scaling Group: Launches EC2 instances across 3 AZs
Target Group: Contains all EC2 instances
Minimum Instances: 6 (2 per AZ)
Maximum Instances: 30 (10 per AZ)
Step-by-Step Flow:
Initial Deployment:
Auto Scaling launches 6 t3.medium instances (2 per AZ)
Instances install application, start web server
ALB health checks instances (GET /health)
After 2 successful health checks (60 seconds), instances marked healthy
ALB starts sending traffic
Normal Traffic (1,000 req/sec):
Clients send requests to ALB DNS: myapp-123456.us-east-1.elb.amazonaws.com
DNS returns 3 IP addresses (one per AZ)
Clients connect to ALB nodes
ALB distributes traffic evenly: ~167 req/sec per instance
All instances healthy, handling load comfortably
Traffic Spike (10,000 req/sec):
Traffic increases 10x
CloudWatch alarm triggers: CPU > 70%
Auto Scaling adds 12 instances (4 per AZ)
New instances launch, install application (5 minutes)
ALB health checks new instances
Once healthy, ALB includes in rotation
Traffic distributed across 18 instances: ~556 req/sec per instance
CPU drops to 50%, system stable
AZ Failure (us-east-1a):
Power outage in us-east-1a
6 instances in us-east-1a become unreachable
ALB health checks fail for us-east-1a instances
After 2 failed health checks (60 seconds), ALB marks them unhealthy
ALB stops sending traffic to us-east-1a
ALB redistributes traffic to us-east-1b and us-east-1c (12 instances)
Traffic per instance: ~833 req/sec
CPU increases to 65%, still acceptable
Auto Scaling detects high CPU, adds 6 more instances in us-east-1b and us-east-1c
System returns to normal load distribution
AZ Recovery:
Power restored in us-east-1a
Instances in us-east-1a restart
ALB health checks pass
ALB resumes sending traffic to us-east-1a
Traffic redistributes across all 3 AZs
Failure Scenarios:
Scenario 1: Single Instance Failure:
Instance crashes (application bug, out of memory)
ALB health check fails
After 60 seconds, ALB marks instance unhealthy
ALB stops sending traffic to failed instance
Traffic redistributed to healthy instances
Auto Scaling detects failed instance, terminates it
Auto Scaling launches replacement instance
Impact: None (other instances handle traffic)
Recovery: 5 minutes (new instance launch time)
Scenario 2: Entire AZ Failure:
AZ-A fails (power, network, AWS issue)
All instances in AZ-A unreachable
ALB marks all AZ-A instances unhealthy
ALB sends traffic only to AZ-B and AZ-C
Impact: Minimal (60 seconds to detect, traffic redistributed)
Capacity: Reduced by 33%, but Auto Scaling adds instances
Recovery: Automatic when AZ recovers
Scenario 3: ALB Node Failure:
ALB node in AZ-A fails (extremely rare)
Clients connecting to that node experience errors
Clients retry, connect to ALB nodes in AZ-B or AZ-C
Impact: Minimal (clients retry automatically)
Recovery: Immediate (other ALB nodes available)
Scenario 4: Deployment Gone Wrong:
You deploy new application version
New version has bug, returns 500 errors
ALB health checks fail for new instances
ALB keeps sending traffic to old instances (still healthy)
You rollback deployment
Impact: None (ALB prevented bad deployment from affecting users)
ALB Features for High Availability:
Cross-Zone Load Balancing (enabled by default):
Distributes traffic evenly across all targets in all AZs
Without it: Traffic distributed evenly to AZs, then to targets within AZ
With it: Traffic distributed evenly to all targets regardless of AZ
Example: 2 instances in AZ-A, 4 instances in AZ-B
Without cross-zone: AZ-A instances get 25% each, AZ-B instances get 12.5% each
With cross-zone: All instances get 16.67% each
Connection Draining (deregistration delay):
When instance is deregistered (terminating, unhealthy), ALB stops sending new requests
ALB waits for in-flight requests to complete (default 300 seconds)
Prevents abrupt connection termination
Ensures graceful shutdown
Sticky Sessions (session affinity):
Routes requests from same client to same target
Uses cookie to track client-target mapping
Useful for applications that store session state locally
Duration: 1 second to 7 days
Slow Start Mode:
Gradually increases traffic to newly registered targets
Gives targets time to warm up (load caches, establish connections)
Duration: 30 to 900 seconds
Prevents overwhelming new instances
What You Get:
High Availability: 99.99% SLA (ALB itself is highly available)
Fault Tolerance: Survives instance and AZ failures
Automatic Scaling: Integrates with Auto Scaling
Health Checks: Automatic detection and removal of unhealthy targets
SSL Termination: Offloads TLS processing from instances
Content-Based Routing: Route based on URL, headers, etc.
Cost:
ALB: $0.0225/hour = $16.43/month
LCU (Load Balancer Capacity Unit): $0.008 per LCU-hour
LCU measures: new connections, active connections, processed bytes, rule evaluations
Typical cost: $50-200/month depending on traffic
Auto Scaling
What it is: Amazon EC2 Auto Scaling automatically adjusts the number of EC2 instances in response to changing demand. It ensures you have the right number of instances to handle your application load while minimizing costs.
Why it exists: Manual scaling is slow, error-prone, and inefficient. You either over-provision (waste money on idle instances) or under-provision (poor performance during spikes). Auto Scaling automates this, scaling out during high demand and scaling in during low demand.
Real-world analogy: Auto Scaling is like a restaurant manager who adjusts staffing based on customer volume. During lunch rush, the manager calls in more servers. During slow periods, the manager sends servers home. The manager monitors wait times (performance metrics) and adjusts staffing to maintain service quality while controlling labor costs.
How Auto Scaling Works (Detailed step-by-step):
Create Launch Template:
Defines instance configuration: AMI, instance type, security groups, user data
Like a blueprint for launching instances
Can have multiple versions for easy updates
Create Auto Scaling Group (ASG):
Specify launch template
Choose VPC subnets (multiple AZs for high availability)
Set capacity:
Minimum: Minimum number of instances (always running)
Desired: Target number of instances
Maximum: Maximum number of instances (cost control)
Example: Min=2, Desired=4, Max=10
Configure Health Checks:
EC2 Health Check: Instance running and reachable
ELB Health Check: Instance passing load balancer health checks
Unhealthy instances automatically replaced
Create Scaling Policies:
Target Tracking: Maintain metric at target value (e.g., CPU at 50%)
Step Scaling: Add/remove instances based on CloudWatch alarms
Scheduled Scaling: Scale at specific times (e.g., scale up at 9 AM)
Predictive Scaling: Use ML to predict future load and scale proactively
Minimum 4 instances (2 per AZ) ensures service during AZ failure
Auto Scaling automatically replaces failed instances
Distributes instances evenly across AZs
Integrates with load balancer for seamless failover
Disaster Recovery Strategies
What it is: Disaster Recovery (DR) is the process of preparing for and recovering from events that negatively affect business operations. DR strategies define how quickly you can recover (RTO) and how much data you can afford to lose (RPO).
Why it exists: Disasters happen - natural disasters, cyber attacks, human errors, hardware failures. Without a DR plan, these events can cause permanent data loss, extended downtime, and business failure. DR strategies provide a roadmap for recovery.
Real-world analogy: DR is like having insurance and emergency plans for your house. You have smoke detectors (monitoring), fire extinguishers (immediate response), insurance (financial protection), and a plan for where your family will stay if the house burns down (recovery strategy). The level of preparation depends on risk tolerance and budget.
Key Metrics:
Recovery Time Objective (RTO):
How long can your business survive without the system?
Time from disaster to full recovery
Example: RTO = 4 hours means system must be operational within 4 hours
Recovery Point Objective (RPO):
How much data can your business afford to lose?
Time between last backup and disaster
Example: RPO = 1 hour means you can lose up to 1 hour of data
DR Strategies (from least to most expensive):
1. Backup and Restore (Lowest Cost, Highest RTO/RPO)
What it is: Regularly back up data to AWS (S3, Glacier). When disaster occurs, provision infrastructure and restore data from backups.
RTO: Hours to days (time to provision infrastructure + restore data) RPO: Hours (time since last backup) Cost: Very low (only pay for backup storage)
How it works:
Normal Operation: Application runs on-premises or in primary AWS region
Backup: Daily/hourly backups to S3 using AWS Backup, snapshots, or custom scripts
Disaster: Primary site fails
Recovery:
Provision infrastructure (EC2, RDS, etc.) using CloudFormation
Restore data from S3/Glacier
Update DNS to point to new infrastructure
Resume operations
Example:
Primary: On-premises data center
Backup: Daily database backups to S3, weekly full backups to Glacier
Disaster: Data center floods
Recovery:
Day 1: Provision EC2 instances and RDS in AWS (4 hours)
Day 1: Restore database from last night's backup (2 hours)
Day 1: Update DNS, test application (2 hours)
Total RTO: 8 hours
RPO: 24 hours (lost 1 day of data)
When to use:
ā Non-critical applications (can tolerate hours of downtime)
ā Budget-constrained (minimal ongoing cost)
ā Infrequent data changes (low RPO acceptable)
ā Compliance requires backups but not high availability
Cost: $50-500/month (backup storage only)
2. Pilot Light (Low Cost, Medium RTO/RPO)
What it is: Maintain minimal infrastructure in DR site (database replication only). When disaster occurs, quickly scale up remaining infrastructure.
What it is: Maintain scaled-down but fully functional version of production environment in DR site. When disaster occurs, scale up to production capacity.
RTO: Minutes (infrastructure running, just needs scaling) RPO: Seconds to minutes (continuous replication) Cost: Medium (running infrastructure at reduced capacity)
How it works:
Normal Operation: Full production in primary region
Warm Standby: Scaled-down version in DR region (e.g., 25% capacity)
What it is: Run full production capacity in multiple regions simultaneously. Traffic distributed across all regions. When disaster occurs, remaining regions absorb traffic.
RTO: Zero to seconds (no recovery needed, automatic failover) RPO: Zero to seconds (synchronous or near-synchronous replication) Cost: High (2x+ production cost)
How it works:
Normal Operation: Full production in multiple regions
Cost: $10,000-50,000+/month (2-3x production cost)
DR Strategy Comparison:
Strategy
RTO
RPO
Cost
Use Case
Backup & Restore
Hours-Days
Hours
$
Non-critical, budget-constrained
Pilot Light
Minutes-Hours
Minutes
$$
Business-critical, moderate budget
Warm Standby
Minutes
Seconds
$$$
Mission-critical, need fast recovery
Active-Active
Seconds
Seconds
$$$$
Zero-downtime, global applications
Amazon Route 53 for High Availability
What it is: Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service. Route 53 connects user requests to infrastructure running in AWS or on-premises.
Why it exists: DNS is critical infrastructure - if DNS fails, users can't reach your application even if it's running perfectly. Route 53 provides 100% availability SLA and advanced routing policies for high availability and disaster recovery.
Real-world analogy: Route 53 is like a GPS navigation system. When you want to go somewhere (access a website), GPS (Route 53) tells you the best route based on current conditions (traffic, road closures). If your usual route is blocked (server down), GPS automatically reroutes you to an alternate path (healthy server).
Route 53 Routing Policies:
1. Simple Routing:
Returns single resource (one IP address)
No health checks
Use case: Single server, no failover needed
2. Weighted Routing:
Distributes traffic across multiple resources based on weights
Example: 70% to us-east-1, 30% to us-west-2
Use case: A/B testing, gradual migration, traffic distribution
3. Latency-Based Routing:
Routes to resource with lowest latency for user
Route 53 measures latency from user's location to each region
Use case: Global applications, optimize user experience
4. Failover Routing:
Routes to primary resource, fails over to secondary if primary unhealthy
Requires health checks
Use case: Active-passive DR, simple failover
5. Geolocation Routing:
Routes based on user's geographic location
Example: EU users ā eu-west-1, US users ā us-east-1
Use case: Content localization, data residency compliance
6. Geoproximity Routing:
Routes based on geographic location with bias
Can shift traffic toward or away from resources
Use case: Gradual traffic migration, load balancing with geographic preference
7. Multi-Value Answer Routing:
Returns multiple IP addresses (up to 8)
Client chooses which to use
Health checks ensure only healthy IPs returned
Use case: Simple load balancing, multiple healthy resources
Health Checks:
Route 53 health checks monitor endpoint health and automatically route traffic away from unhealthy endpoints.
Health Check Types:
Endpoint Health Check: Monitors specific IP or domain
Protocol: HTTP, HTTPS, TCP
Interval: 30 seconds (standard) or 10 seconds (fast)
Amazon Route 53 routing policies and health checks
Critical Takeaways
Loose Coupling: Decouple components using queues (SQS), pub/sub (SNS), and event buses (EventBridge). This enables independent scaling, fault isolation, and easier maintenance.
Multi-AZ for High Availability: Always deploy across multiple Availability Zones. Use RDS Multi-AZ for databases, ALB across multiple AZs, and Auto Scaling with minimum 2 instances per AZ.
SQS vs SNS: Use SQS for point-to-point messaging (producer ā queue ā consumer). Use SNS for fan-out (publisher ā topic ā multiple subscribers). Combine them for powerful patterns.
Auto Scaling: Use target tracking policies for dynamic scaling, scheduled policies for predictable patterns, and set appropriate min/max/desired capacity for cost control and availability.
DR Strategy Selection: Choose based on RTO/RPO requirements and budget. Backup & Restore (cheapest, slowest), Pilot Light (moderate), Warm Standby (faster), Active-Active (fastest, most expensive).
Health Checks: Always configure health checks for load balancers, Auto Scaling, and Route 53. Health checks enable automatic detection and recovery from failures.
Route 53 Routing: Use latency-based routing for global applications, failover routing for DR, weighted routing for A/B testing, and geolocation for compliance.
Lambda for Events: Use Lambda for event-driven processing (S3 uploads, SQS messages, EventBridge events). Lambda scales automatically and you only pay for execution time.
Self-Assessment Checklist
Test yourself before moving on:
I understand the difference between loose coupling and tight coupling
I can explain when to use SQS vs SNS
I know how SQS visibility timeout works
I understand the SNS + SQS fan-out pattern
I can describe how EventBridge routes events
I know when to use Lambda vs EC2
I understand Multi-AZ deployments for RDS
I can explain how ALB health checks work
I know how Auto Scaling policies work (target tracking, step, scheduled)
I understand the 4 DR strategies and when to use each
I can calculate RTO and RPO for different scenarios
I know Route 53 routing policies and their use cases
I understand how Route 53 health checks enable failover
sequenceDiagram
participant P as Producer
participant SQS as SQS Queue
participant C1 as Consumer 1
participant C2 as Consumer 2
P->>SQS: Send Message 1
P->>SQS: Send Message 2
P->>SQS: Send Message 3
Note over SQS: Messages stored<br/>redundantly across AZs
C1->>SQS: Poll for messages
SQS-->>C1: Return Message 1
Note over C1: Processing...<br/>(Visibility timeout: 30s)
C2->>SQS: Poll for messages
SQS-->>C2: Return Message 2
C1->>SQS: Delete Message 1
Note over SQS: Message 1 removed
C2->>SQS: Delete Message 2
Note over SQS: Message 2 removed
See: diagrams/03_domain2_sqs_standard_flow.mmd
Diagram Explanation (Detailed): This sequence diagram illustrates how SQS Standard queues handle message processing with multiple consumers. The Producer sends three messages to the SQS queue, which stores them redundantly across multiple Availability Zones for durability (99.999999999% durability). When Consumer 1 polls the queue, it receives Message 1, which immediately becomes invisible to other consumers for the visibility timeout period (default 30 seconds). This prevents duplicate processing. Meanwhile, Consumer 2 can poll and receive Message 2 simultaneously, enabling parallel processing. The visibility timeout gives each consumer time to process and delete the message. If a consumer fails to delete the message within the timeout, it becomes visible again for retry. After successful processing, consumers explicitly delete messages from the queue. This pattern enables horizontal scaling - you can add more consumers to process messages faster. The at-least-once delivery guarantee means messages might be delivered multiple times, so your processing logic should be idempotent. Standard queues provide unlimited throughput (thousands of messages per second) and best-effort ordering, making them ideal for high-throughput scenarios where strict ordering isn't required.
Detailed Example 1: E-commerce Order Processing An e-commerce platform receives 10,000 orders per minute during Black Friday sales. Each order needs to be validated, charged, and fulfilled. The system uses an SQS Standard queue to decouple order submission from processing. When a customer places an order, the web application sends a message to the SQS queue containing order details (order ID, customer ID, items, total). The message is immediately acknowledged, and the customer sees "Order received" within 100ms. Behind the scenes, 50 EC2 instances running order processing workers continuously poll the queue using long polling (20-second wait time to reduce empty responses). Each worker receives a batch of up to 10 messages, processes them in parallel, and deletes successfully processed messages. If a worker crashes while processing, the visibility timeout (set to 5 minutes) ensures the message becomes visible again for another worker to retry. The system handles the traffic spike without losing orders, and customers don't experience delays because order submission is decoupled from processing.
Detailed Example 2: Image Processing Pipeline A photo-sharing application allows users to upload images that need to be resized into multiple formats (thumbnail, medium, large). When a user uploads an image to S3, an S3 event notification sends a message to an SQS queue. The message contains the S3 bucket name and object key. A fleet of Lambda functions (configured with SQS as an event source) automatically polls the queue and processes images in parallel. Each Lambda function downloads the original image from S3, creates three resized versions using ImageMagick, uploads them back to S3, and deletes the message from the queue. If a Lambda function times out (15-minute limit), the message becomes visible again after the visibility timeout (10 minutes) and another Lambda function retries it. The system automatically scales based on queue depth - AWS Lambda can scale to 1,000 concurrent executions, processing 1,000 images simultaneously. This architecture handles traffic spikes without provisioning servers and only charges for actual processing time.
Detailed Example 3: Log Aggregation System A distributed application running on 500 EC2 instances needs to centralize logs for analysis. Each instance sends log entries to an SQS queue (up to 256 KB per message). A log aggregation service with 10 consumer instances polls the queue, batches log entries, and writes them to S3 in compressed format every 5 minutes. The visibility timeout is set to 10 minutes to allow time for batching and S3 upload. If a consumer crashes, another consumer picks up the messages after the timeout. The system uses SQS's at-least-once delivery, so the log aggregation service deduplicates entries based on a unique log ID before writing to S3. This architecture handles 100,000 log entries per second without losing data, and the decoupled design allows the log aggregation service to be updated without affecting the application instances.
ā Must Know (Critical Facts):
Unlimited throughput: SQS Standard can handle thousands of messages per second per API action (SendMessage, ReceiveMessage, DeleteMessage)
At-least-once delivery: Messages are delivered at least once, but occasionally more than once (design for idempotency)
Best-effort ordering: Messages are generally delivered in the order sent, but not guaranteed (use FIFO for strict ordering)
Visibility timeout: Default 30 seconds, configurable 0 seconds to 12 hours (set based on processing time)
Message retention: Default 4 days, configurable 1 minute to 14 days (messages auto-delete after retention period)
Message size: Maximum 256 KB per message (use S3 for larger payloads with Extended Client Library)
Long polling: Reduces empty responses and costs by waiting up to 20 seconds for messages (recommended over short polling)
Dead Letter Queue: Automatically moves messages that fail processing after maxReceiveCount attempts (useful for debugging)
SQS FIFO Queue Flow
š SQS FIFO Message Flow Diagram:
sequenceDiagram
participant P as Producer
participant SQS as SQS FIFO Queue
participant C as Consumer
P->>SQS: Send Message 1 (Group A)
P->>SQS: Send Message 2 (Group A)
P->>SQS: Send Message 3 (Group B)
P->>SQS: Send Message 4 (Group A)
Note over SQS: Strict ordering<br/>within message groups
C->>SQS: Poll for messages
SQS-->>C: Message 1 (Group A)
C->>SQS: Delete Message 1
C->>SQS: Poll for messages
SQS-->>C: Message 2 (Group A)
Note over C: Must process in order<br/>within Group A
C->>SQS: Delete Message 2
C->>SQS: Poll for messages
SQS-->>C: Message 3 (Group B)
Note over SQS: Group B can be processed<br/>in parallel with Group A
See: diagrams/03_domain2_sqs_fifo_flow.mmd
Diagram Explanation (Detailed): This sequence diagram demonstrates SQS FIFO (First-In-First-Out) queue behavior with message groups. The Producer sends four messages, with Messages 1, 2, and 4 belonging to Group A, and Message 3 belonging to Group B. FIFO queues guarantee strict ordering within each message group - Messages 1, 2, and 4 will be delivered to consumers in exactly that order. The Consumer must process and delete Message 1 before receiving Message 2 from Group A. However, Message 3 from Group B can be processed in parallel because it's in a different message group. This allows for parallelism while maintaining ordering where it matters. Message groups are defined by the MessageGroupId attribute set by the producer. FIFO queues also provide exactly-once processing using MessageDeduplicationId - if the same message is sent twice within the 5-minute deduplication interval, SQS automatically discards the duplicate. This is critical for financial transactions or inventory updates where duplicate processing would cause errors. FIFO queues have a throughput limit of 300 messages per second (3,000 with batching), which is lower than Standard queues but sufficient for most ordered processing scenarios. The queue name must end with .fifo suffix.
Detailed Example 1: Stock Trading Order Processing A stock trading platform receives buy and sell orders that must be processed in the exact order received to ensure fair pricing. Each user's orders are assigned a MessageGroupId based on their user ID. When User A places three orders (Buy 100 shares, Sell 50 shares, Buy 25 shares), they're sent to an SQS FIFO queue with MessageGroupId="UserA". The order processing system polls the queue and receives orders in exact sequence. It processes "Buy 100" first, updating the user's portfolio, then "Sell 50", then "Buy 25". Meanwhile, User B's orders (MessageGroupId="UserB") are processed in parallel by another consumer, maintaining ordering per user while allowing concurrent processing across users. The exactly-once delivery guarantee ensures that if the producer retries due to a network error, duplicate orders aren't created. The system uses MessageDeduplicationId based on a hash of order details (user ID + timestamp + order type + quantity). This architecture ensures regulatory compliance (orders must be processed in sequence) while maintaining high throughput (thousands of users trading simultaneously).
Detailed Example 2: Banking Transaction Processing A banking system processes account transactions (deposits, withdrawals, transfers) that must be applied in order to maintain accurate balances. Each account's transactions use MessageGroupId based on account number. When Account 12345 has three transactions (Deposit $1000, Withdraw $500, Deposit $200), they're sent to an SQS FIFO queue. The transaction processor receives them in exact order, updating the account balance sequentially: $0 ā $1000 ā $500 ā $700. If the processor crashes after the first transaction, the visibility timeout ensures the second transaction isn't processed until the first is confirmed deleted. The exactly-once processing prevents duplicate transactions - if a deposit message is sent twice due to a retry, SQS deduplicates it using MessageDeduplicationId (transaction ID). This prevents the dreaded "double deposit" bug. The system processes 10,000 accounts concurrently (each account is a message group), achieving 300 transactions per second per account while maintaining strict ordering and exactly-once semantics.
ā Must Know (Critical Facts):
Strict ordering: Messages within a message group are delivered in exact FIFO order (guaranteed)
Diagram Explanation (Detailed): This architecture diagram illustrates the SNS fan-out pattern, where a single message published to an SNS topic is automatically delivered to multiple subscribers simultaneously. The Producer Application publishes one message to the SNS Topic (e.g., "Order Placed" event). SNS immediately fans out this message to all four subscribers: SQS Queue 1 for order processing, SQS Queue 2 for inventory updates, a Lambda function for sending email notifications, and an HTTP endpoint for an external system. Each subscriber receives the same message independently and processes it according to its own logic. This pattern decouples the producer from consumers - the producer doesn't need to know how many systems need the data or how they process it. If a new system needs order data, you simply add another subscription without changing the producer. SNS provides at-least-once delivery to each subscriber with automatic retries (up to 100,015 retries over 23 days for HTTP endpoints). The fan-out pattern is ideal for event-driven architectures where multiple systems need to react to the same event. SNS supports up to 12.5 million subscriptions per topic and 100,000 topics per account, enabling massive scale. Message filtering allows subscribers to receive only relevant messages based on message attributes, reducing unnecessary processing.
Detailed Example 1: E-commerce Order Workflow When a customer places an order on an e-commerce website, multiple backend systems need to be notified simultaneously. The order service publishes an "OrderPlaced" message to an SNS topic containing order details (order ID, customer ID, items, total, shipping address). SNS fans out to five subscribers: (1) SQS queue for payment processing - charges the customer's credit card, (2) SQS queue for inventory management - reserves items and updates stock levels, (3) SQS queue for shipping - creates shipping label and schedules pickup, (4) Lambda function - sends order confirmation email to customer, (5) HTTP endpoint - notifies external analytics platform for business intelligence. Each system processes the order independently and at its own pace. If the email service is down, it doesn't affect payment or shipping. The SQS queues buffer messages, so if inventory management is slow, messages wait in the queue without blocking other systems. This architecture reduces order processing time from 5 seconds (sequential) to 1 second (parallel) and improves reliability - if one system fails, others continue working.
Detailed Example 2: IoT Sensor Data Distribution An IoT platform collects temperature data from 10,000 sensors deployed in warehouses. Each sensor publishes temperature readings to an SNS topic every minute. SNS fans out to multiple subscribers: (1) Kinesis Data Firehose - stores all readings in S3 for long-term analysis, (2) Lambda function - checks for temperature anomalies and triggers alerts if temperature exceeds thresholds, (3) SQS queue - feeds real-time dashboard showing current temperatures, (4) HTTP endpoint - sends data to third-party monitoring service. The fan-out pattern allows adding new consumers without modifying sensor code. When the company adds a machine learning system to predict equipment failures, they simply add another subscription. SNS handles 10,000 messages per minute (167 per second) easily, and each subscriber processes data independently. Message filtering is used so the alert Lambda only receives messages where temperature > 80°F, reducing unnecessary invocations and costs.
ā Must Know (Critical Facts):
Fan-out pattern: One message published to SNS is delivered to all subscribers simultaneously (parallel processing)
Message filtering: Subscribers can filter messages based on message attributes (reduces unnecessary processing)
Delivery retries: Automatic retries with exponential backoff (up to 100,015 retries for HTTP)
Message size: Maximum 256 KB per message (same as SQS)
Throughput: Unlimited (can handle millions of messages per second)
Durability: Messages stored redundantly across multiple AZs (99.999999999% durability)
SNS + SQS pattern: Combine for reliable fan-out with buffering and retry logic (best practice)
EventBridge Event Routing
š EventBridge Event Routing Diagram:
graph TB
subgraph "Event Sources"
EC2[EC2 State Change]
S3[S3 Object Created]
Custom[Custom Application]
end
subgraph "EventBridge"
Bus[Event Bus]
Rule1[Rule 1: EC2 Stopped]
Rule2[Rule 2: S3 Upload]
Rule3[Rule 3: Custom Event]
end
subgraph "Targets"
Lambda1[Lambda: Notify Team]
Lambda2[Lambda: Process File]
SQS[SQS: Queue for Processing]
SNS[SNS: Alert Topic]
end
EC2 --> Bus
S3 --> Bus
Custom --> Bus
Bus --> Rule1
Bus --> Rule2
Bus --> Rule3
Rule1 --> Lambda1
Rule1 --> SNS
Rule2 --> Lambda2
Rule3 --> SQS
style Bus fill:#ff9800
style Rule1 fill:#e1f5fe
style Rule2 fill:#e1f5fe
style Rule3 fill:#e1f5fe
See: diagrams/03_domain2_eventbridge_routing.mmd
Diagram Explanation (Detailed): This diagram shows EventBridge's powerful event routing capabilities. EventBridge receives events from three sources: EC2 state changes (AWS service events), S3 object creation (AWS service events), and custom application events. All events flow into the Event Bus, which acts as a central router. EventBridge Rules evaluate each event against pattern matching criteria and route matching events to appropriate targets. Rule 1 matches EC2 "stopped" events and routes them to both a Lambda function (to notify the operations team) and an SNS topic (to send alerts). Rule 2 matches S3 "ObjectCreated" events and routes them to a Lambda function for file processing. Rule 3 matches custom application events and routes them to an SQS queue for asynchronous processing. EventBridge supports complex pattern matching using JSON-based event patterns, allowing you to filter events by specific attributes (e.g., only EC2 instances in production environment, only S3 uploads to specific bucket prefix). Each rule can have up to 5 targets, and EventBridge automatically retries failed deliveries with exponential backoff. EventBridge also provides schema registry to discover event structures and generate code bindings, making it easier to work with events. The service integrates with 90+ AWS services and SaaS applications (Salesforce, Zendesk, etc.), making it the central nervous system for event-driven architectures.
Detailed Example 1: Automated Security Response A company uses EventBridge to automatically respond to security events. When an EC2 instance's security group is modified (CloudTrail event), EventBridge receives the event and evaluates it against a rule that matches "ModifySecurityGroup" actions. The rule routes the event to three targets: (1) Lambda function that checks if the change violates security policies (e.g., opening port 22 to 0.0.0.0/0) and automatically reverts unauthorized changes, (2) SNS topic that notifies the security team via email and Slack, (3) SQS queue that feeds a security audit dashboard. The entire response happens within 5 seconds of the security group change, preventing potential breaches. EventBridge's pattern matching allows filtering to only trigger on high-risk changes (e.g., only alert if port 22, 3389, or 3306 is opened to the internet). This automated response reduces security incident response time from hours (manual detection) to seconds (automated).
Detailed Example 2: Multi-Account Event Aggregation An enterprise with 50 AWS accounts uses EventBridge to centralize monitoring. Each account has an Event Bus that forwards events to a central monitoring account's Event Bus using cross-account event routing. The central account has rules that process events from all accounts: (1) Rule for EC2 state changes routes to Lambda for inventory tracking, (2) Rule for RDS failures routes to SNS for immediate alerts, (3) Rule for S3 access denied events routes to SQS for security analysis. EventBridge's schema registry automatically discovers event structures from all accounts, making it easy to write rules. The central monitoring team can see events from all accounts in one place, reducing operational complexity. EventBridge handles 10,000 events per second across all accounts without performance degradation.
ā Must Know (Critical Facts):
Event pattern matching: JSON-based patterns filter events by attributes (more flexible than SNS filtering)
Multiple targets: Each rule can route to up to 5 targets simultaneously (Lambda, SQS, SNS, Step Functions, etc.)
Schema registry: Automatically discovers event structures and generates code bindings (reduces development time)
Cross-account routing: Events can be routed across AWS accounts (centralized monitoring)
Diagram Explanation (Detailed): This architecture diagram shows a highly available Application Load Balancer (ALB) deployment across three Availability Zones. Users access the application through Route 53, which resolves the domain name to the ALB's DNS name. The ALB is deployed in public subnets across all three AZs (us-east-1a, us-east-1b, us-east-1c), providing automatic failover if an entire AZ fails. Behind the ALB, EC2 instances run in private subnets (no direct internet access) across all three AZs, registered with a Target Group. The ALB continuously performs health checks on each instance (default: every 30 seconds, checking /health endpoint). If an instance fails two consecutive health checks (unhealthy threshold), the ALB stops routing traffic to it and marks it unhealthy. When the instance passes two consecutive health checks (healthy threshold), traffic resumes. The ALB uses round-robin or least outstanding requests algorithm to distribute traffic across healthy instances. If an entire AZ fails (e.g., power outage in us-east-1a), the ALB automatically routes all traffic to instances in the remaining two AZs within seconds. The ALB operates at Layer 7 (HTTP/HTTPS), allowing advanced routing based on URL path, hostname, HTTP headers, and query strings. It also provides SSL/TLS termination, reducing CPU load on backend instances. The ALB supports WebSocket and HTTP/2, making it suitable for modern web applications.
Detailed Example 1: Microservices Routing A company runs a microservices application with three services: user service (/users/), order service (/orders/), and product service (/products/). A single ALB routes traffic to different target groups based on URL path. Requests to example.com/users/ route to the user service target group (5 EC2 instances), requests to /orders/* route to the order service target group (10 EC2 instances - higher traffic), and requests to /products/* route to the product service target group (3 EC2 instances). Each target group has instances across three AZs for high availability. The ALB performs health checks on each service's /health endpoint. When the order service deploys a new version, the ALB's connection draining feature (default 300 seconds) ensures in-flight requests complete before instances are terminated. The ALB handles 10,000 requests per second, automatically scaling its capacity without manual intervention. This architecture reduces costs (one ALB instead of three) and simplifies management (single entry point).
Detailed Example 2: Blue-Green Deployment A company uses ALB for zero-downtime deployments. The production environment (blue) has 10 EC2 instances in one target group receiving 100% of traffic. When deploying a new version, they launch 10 new instances (green) in a second target group. The ALB is configured with weighted target groups: blue (100%), green (0%). After the green instances pass health checks, they gradually shift traffic: blue (90%), green (10%) for 10 minutes to monitor for errors. If metrics look good, they continue: blue (50%), green (50%), then blue (0%), green (100%). If errors occur, they instantly roll back by setting blue (100%), green (0%). The entire deployment takes 30 minutes with zero downtime. The ALB's health checks ensure only healthy instances receive traffic, and connection draining ensures no requests are dropped during the transition.
ā Must Know (Critical Facts):
Layer 7 load balancing: Routes based on HTTP/HTTPS content (URL path, hostname, headers, query strings)
Target types: EC2 instances, IP addresses, Lambda functions, containers (ECS/EKS)
Diagram Explanation (Detailed): This comprehensive diagram compares four disaster recovery strategies, showing the trade-offs between cost, Recovery Time Objective (RTO), and Recovery Point Objective (RPO). Backup & Restore (green) is the most cost-effective strategy, where production data is regularly backed up to S3 in another region. During a disaster, you restore from backups and rebuild infrastructure using CloudFormation or Terraform. This approach has the highest RTO (hours to days) and RPO (hours) because you must restore data and provision resources. Cost is minimal - only S3 storage ($0.023/GB-month) and occasional data transfer. Pilot Light (light orange) maintains core infrastructure components (database with replication) in the DR region but keeps compute resources minimal or stopped. During a disaster, you scale up compute resources (launch EC2 instances, increase RDS capacity). RTO improves to minutes-hours, and RPO to minutes because data is continuously replicated. Cost is moderate - running a small RDS instance and minimal compute. Warm Standby (orange) runs a scaled-down but fully functional environment in the DR region. All components are running but at minimum capacity (e.g., 2 instances instead of 20). During a disaster, you scale up to full capacity using Auto Scaling. RTO is minutes, and RPO is seconds because data replication is real-time. Cost is higher - running all services at reduced capacity. Active-Active (red) runs full production capacity in both regions simultaneously, with Route 53 distributing traffic between them. Both regions serve production traffic, so there's no "failover" - if one region fails, the other continues serving 100% of traffic. RTO and RPO are both seconds. Cost is highest - running full infrastructure in two regions. The choice depends on business requirements: e-commerce might use Warm Standby (RTO < 1 hour), while banking might require Active-Active (RTO < 1 minute).
Detailed Example 1: E-commerce Platform - Warm Standby An e-commerce company generates $10,000 per minute in revenue and can tolerate 15 minutes of downtime (RTO: 15 minutes, RPO: 1 minute). They implement Warm Standby DR strategy. Production Region (us-east-1): 50 EC2 instances behind ALB, RDS Multi-AZ database (db.r5.4xlarge), ElastiCache cluster (3 nodes), S3 for images. DR Region (us-west-2): 5 EC2 instances behind ALB (10% capacity), RDS read replica (db.r5.4xlarge) with automated promotion, ElastiCache cluster (1 node), S3 cross-region replication. The RDS read replica continuously replicates data from production (replication lag < 1 second). During normal operations, the DR region serves no traffic. When us-east-1 fails (detected by Route 53 health checks in 60 seconds), the company executes the DR plan: (1) Promote RDS read replica to primary (2 minutes), (2) Update Route 53 to point to us-west-2 ALB (1 minute), (3) Auto Scaling scales EC2 instances from 5 to 50 (10 minutes). Total RTO: 13 minutes. Data loss is minimal (RPO: 1 minute) because the read replica was nearly synchronized. Monthly DR cost: $2,000 (5 EC2 instances + RDS replica + ElastiCache + data transfer) vs $150,000 potential revenue loss from 15 minutes downtime.
Detailed Example 2: Financial Services - Active-Active A stock trading platform requires zero downtime (RTO: 0 seconds) and zero data loss (RPO: 0 seconds) due to regulatory requirements. They implement Active-Active DR strategy. Region 1 (us-east-1): 100 EC2 instances, Aurora Global Database (primary), ElastiCache, S3. Region 2 (eu-west-1): 100 EC2 instances, Aurora Global Database (secondary with < 1 second replication lag), ElastiCache, S3. Route 53 uses latency-based routing to direct users to the nearest region. Both regions serve production traffic simultaneously. Aurora Global Database replicates data bidirectionally with conflict resolution. When us-east-1 fails, Route 53 health checks detect the failure within 30 seconds and automatically route all traffic to eu-west-1. Users experience no downtime - they're simply routed to the other region. The Aurora secondary is promoted to primary (< 1 minute), and the system continues operating. Data loss is zero because replication lag was < 1 second. Monthly cost: $50,000 (double infrastructure) vs potential $1 million regulatory fines and reputation damage from downtime.
Detailed Example 3: SaaS Application - Pilot Light A SaaS company with 1,000 customers can tolerate 2 hours of downtime (RTO: 2 hours, RPO: 15 minutes). They implement Pilot Light DR strategy. Production Region (us-east-1): 20 EC2 instances, RDS Multi-AZ (db.m5.large), ElastiCache, S3. DR Region (us-west-2): RDS read replica (db.m5.large) continuously replicating, S3 cross-region replication, AMIs for EC2 instances, but no running EC2 instances. During normal operations, only the RDS read replica runs in DR region ($200/month). When us-east-1 fails, the DR plan executes: (1) Promote RDS read replica to primary (2 minutes), (2) Launch 20 EC2 instances from AMIs using CloudFormation (15 minutes), (3) Update Route 53 to point to new ALB (1 minute), (4) Warm up ElastiCache (30 minutes). Total RTO: 48 minutes. Data loss is 15 minutes (last RDS snapshot). Monthly DR cost: $200 vs $5,000 for Warm Standby - significant savings for acceptable RTO.
ā Must Know (Critical Facts):
RTO (Recovery Time Objective): Maximum acceptable downtime (how long to recover)
RPO (Recovery Point Objective): Maximum acceptable data loss (how much data can be lost)
Diagram Explanation (Detailed): This diagram illustrates Route 53's failover routing policy for disaster recovery. During normal operation (top), Route 53 continuously performs health checks on the Primary Region (us-east-1) every 30 seconds. When health checks pass, Route 53 returns the primary record's IP address to users, directing all traffic to us-east-1. The Secondary Region (us-west-2) is on standby, also monitored by health checks but receiving no traffic. When the primary region fails (bottom), Route 53 detects the failure after missing consecutive health checks (configurable, typically 3 failures = 90 seconds). Route 53 automatically updates DNS responses to return the secondary record's IP address, directing all traffic to us-west-2. Users experience a brief interruption (DNS TTL duration, typically 60 seconds) as their DNS caches expire and refresh with the new IP. The failover is automatic - no manual intervention required. Route 53 continues monitoring both regions. When the primary region recovers and passes health checks, Route 53 can automatically fail back (if configured) or wait for manual failback. Health checks can monitor HTTP/HTTPS endpoints, TCP connections, or CloudWatch alarms, providing flexible failure detection. Route 53's global network of DNS servers ensures health check results are consistent worldwide, preventing split-brain scenarios where some users see the primary as healthy while others see it as failed.
Detailed Example 1: Web Application Failover A media streaming company runs its application in us-east-1 (primary) and us-west-2 (secondary). Route 53 is configured with failover routing: Primary record points to us-east-1 ALB (priority 1), Secondary record points to us-west-2 ALB (priority 2). Health checks monitor the /health endpoint on both ALBs every 30 seconds. During normal operation, all 1 million users are routed to us-east-1. At 2 AM, a network issue causes us-east-1 to become unreachable. Route 53 health checks fail three consecutive times (90 seconds). Route 53 automatically updates DNS responses to return the us-west-2 ALB IP address. Users with expired DNS caches (TTL 60 seconds) immediately get the new IP and connect to us-west-2. Users with cached DNS entries experience errors for up to 60 seconds until their cache expires. Within 3 minutes, all users are successfully streaming from us-west-2. The company's monitoring team receives a CloudWatch alarm about the failover and investigates us-east-1. After fixing the network issue, they manually fail back to us-east-1 during a maintenance window to avoid another brief interruption.
ā Must Know (Critical Facts):
Failover routing: Automatically routes traffic to secondary when primary fails (active-passive DR)
Health check interval: 30 seconds (standard) or 10 seconds (fast), configurable
Failure threshold: Typically 3 consecutive failures before marking unhealthy (90 seconds with 30s interval)
DNS TTL impact: Users experience interruption equal to TTL duration (recommend 60 seconds for DR)
Health check types: HTTP/HTTPS endpoint, TCP connection, CloudWatch alarm, calculated health check
Automatic failback: Can be configured to automatically fail back when primary recovers (or manual)
Multi-region failover: Can chain multiple failover records (primary ā secondary ā tertiary)
Chapter Summary
What We Covered
ā High Availability Fundamentals: Multi-AZ deployments, Availability Zones, fault tolerance
ā Auto Scaling: Dynamic, predictive, and scheduled scaling policies for elastic capacity
ā Load Balancing: ALB, NLB, GWLB - when to use each type and their features
ā Decoupling Patterns: SQS, SNS, EventBridge for building loosely coupled architectures
ā Serverless Architectures: Lambda, Fargate, API Gateway for event-driven systems
ā Container Orchestration: ECS and EKS for managing containerized applications
ā Disaster Recovery: Four DR strategies (backup/restore, pilot light, warm standby, active-active)
ā RTO/RPO: Understanding recovery objectives and selecting appropriate DR strategies
ā Multi-Region Architectures: Global databases, cross-region replication, Route 53 failover
ā Monitoring & Observability: CloudWatch, X-Ray, Health Dashboard for system visibility
Critical Takeaways
Multi-AZ is for HA, Read Replicas are for performance: Don't confuse these two concepts
Auto Scaling requires proper health checks: ELB health checks can trigger instance replacement
ALB for HTTP/HTTPS, NLB for TCP/UDP: Choose based on protocol and performance needs
SQS for decoupling, SNS for fan-out, EventBridge for routing: Each has specific use cases
Lambda scales automatically: No need to manage servers or capacity
ECS for AWS-native, EKS for Kubernetes: Choose based on team expertise and requirements
DR strategy depends on RTO/RPO: Lower RTO/RPO = higher cost
Aurora Global Database for active-active: < 1 second replication lag across regions
Route 53 failover for automatic DR: Health checks trigger automatic failover
Monitoring is essential: Can't improve what you don't measure
Self-Assessment Checklist
Test yourself before moving on:
I can explain the difference between Multi-AZ and Read Replicas
I can design Auto Scaling policies for different workload patterns
I can choose the appropriate load balancer type for a given scenario
I can design decoupled architectures using SQS, SNS, and EventBridge
I understand when to use Lambda vs Fargate vs EC2
I can explain the four DR strategies and their RTO/RPO characteristics
I can calculate appropriate RTO/RPO for business requirements
I can design multi-region architectures with automatic failover
I can implement monitoring and observability for distributed systems
I can troubleshoot common resilience issues (scaling, failover, health checks)
ā Task 2.2 - Highly Available and Fault-Tolerant Architectures: Multi-AZ deployments, multi-Region strategies, Route 53 routing policies, disaster recovery (backup/restore, pilot light, warm standby, active-active), RDS Multi-AZ, Aurora Global Database, automated failover
Critical Takeaways
Loose Coupling is Essential for Resilience: Decouple components using queues (SQS), topics (SNS), and event buses (EventBridge). When one component fails, others continue operating independently.
Design for Failure: Assume everything fails. Use multiple Availability Zones for high availability, multiple Regions for disaster recovery, and implement automatic failover mechanisms.
Horizontal Scaling Over Vertical: Scale out (add more instances) rather than scale up (bigger instances). Use Auto Scaling groups with load balancers to distribute traffic across multiple instances.
Choose the Right DR Strategy: Match your disaster recovery strategy to your RPO/RTO requirements:
Backup/Restore: Hours (cheapest)
Pilot Light: 10s of minutes
Warm Standby: Minutes
Active-Active: Seconds (most expensive)
Leverage Managed Services: Use managed services like RDS Multi-AZ, Aurora, DynamoDB, and ECS Fargate to reduce operational overhead and increase resilience.
Event-Driven Architectures Scale Better: Use asynchronous communication patterns (SQS, SNS, EventBridge) instead of synchronous (direct API calls) for better scalability and fault tolerance.
Load Balancers are Critical: ALB for HTTP/HTTPS traffic with advanced routing, NLB for TCP/UDP with ultra-low latency, GLB for third-party virtual appliances.
Self-Assessment Checklist
Test yourself before moving to Domain 3. You should be able to:
Scalable and Loosely Coupled Architectures:
Design a queue-based architecture using SQS for decoupling
Implement pub/sub pattern using SNS for fanout
Configure EventBridge rules for event-driven workflows
Choose between SQS Standard (best-effort ordering) and FIFO (guaranteed ordering)
Design Lambda functions with proper concurrency limits
Implement API Gateway with caching and throttling
Choose between ALB (Layer 7) and NLB (Layer 4) for different use cases
Design microservices architecture using ECS or EKS
Implement Step Functions for workflow orchestration
Use ElastiCache (Redis or Memcached) for caching strategies
Highly Available and Fault-Tolerant Architectures:
Design multi-AZ deployments for high availability
Implement multi-Region architectures for disaster recovery
Configure Route 53 health checks and failover routing
Choose appropriate disaster recovery strategy based on RPO/RTO
Set up RDS Multi-AZ for automatic failover
Configure Aurora Global Database for cross-region replication
Implement DynamoDB Global Tables for multi-region active-active
Design Auto Scaling policies (target tracking, step scaling, scheduled)
Configure S3 Cross-Region Replication for data durability
Use CloudWatch alarms for automated recovery actions
Practice Questions
Try these from your practice test bundles:
Domain 2 Bundle 1: Questions 1-50 (scalability and loose coupling)
Loose Coupling: Use SQS for asynchronous processing, SNS for pub/sub, EventBridge for event-driven architectures - decouple components to improve resilience
Multi-AZ for HA: Deploy across multiple Availability Zones for fault tolerance - RDS Multi-AZ (1-2 min failover), Aurora (30 sec failover), ALB distributes traffic
Disaster Recovery: Choose strategy based on RTO/RPO - Backup/Restore (cheapest, hours), Pilot Light (minutes), Warm Standby (seconds), Multi-Site (no downtime)
Auto Scaling: Use dynamic scaling for variable workloads, predictive scaling for known patterns, scheduled scaling for predictable changes
Serverless for Scalability: Lambda scales automatically (1000 concurrent default), Fargate removes server management, API Gateway handles millions of requests
Self-Assessment Checklist
Test yourself before moving on:
I can explain the difference between SQS Standard and FIFO queues
I understand when to use SNS vs SQS vs EventBridge
I know how to design a loosely coupled architecture using queues
I can describe Multi-AZ deployment patterns for RDS and Aurora
I understand the four disaster recovery strategies and when to use each
I know how to configure Auto Scaling with different scaling policies
I can explain Route 53 routing policies (failover, weighted, latency, geolocation)
I understand Lambda concurrency and how to handle throttling
I know the difference between ALB, NLB, and GWLB
I can design a highly available, fault-tolerant architecture
Route 53: Health checks, failover routing, multi-region support
Disaster Recovery:
Strategy
RTO
RPO
Cost
Use Case
Backup/Restore
Hours
Hours
$
Non-critical, cost-sensitive
Pilot Light
10-30 min
Minutes
$$
Core systems only
Warm Standby
Minutes
Seconds
$$$
Business-critical
Multi-Site
Real-time
None
$$$$
Mission-critical
Auto Scaling Policies:
Target Tracking: Maintain metric at target (e.g., 70% CPU)
Step Scaling: Scale based on CloudWatch alarm thresholds
Scheduled: Scale at specific times (e.g., business hours)
Predictive: ML-based forecasting for known patterns
Decision Points:
Need message queue? ā SQS Standard (high throughput) or FIFO (ordering)
Need pub/sub? ā SNS
Need event routing? ā EventBridge
Need API management? ā API Gateway
Need serverless compute? ā Lambda (functions) or Fargate (containers)
Need load balancing? ā ALB (HTTP) or NLB (TCP) or GWLB (appliances)
Need high availability? ā Multi-AZ deployment + Auto Scaling
Need disaster recovery? ā Choose based on RTO/RPO requirements
Chapter Summary
What We Covered
This chapter covered Domain 2: Design Resilient Architectures (26% of the exam), the second most heavily weighted domain. We explored two major task areas:
ā Task 2.1: Design Scalable and Loosely Coupled Architectures
Microservices design principles and patterns
Event-driven architectures with SNS, SQS, EventBridge
ā Task 2.2: Design Highly Available and Fault-Tolerant Architectures
Multi-AZ and multi-region architectures
Disaster recovery strategies: Backup/Restore, Pilot Light, Warm Standby, Multi-Site
Auto Scaling for elasticity and availability
Route 53 health checks and failover routing
Database high availability: RDS Multi-AZ, Aurora, DynamoDB global tables
Immutable infrastructure and blue/green deployments
Monitoring and observability with CloudWatch and X-Ray
Critical Takeaways
Design for failure: Assume everything will fail. Use Multi-AZ deployments, Auto Scaling, and health checks to automatically recover from failures.
Loose coupling is essential: Decouple components with SQS queues, SNS topics, and EventBridge. This allows independent scaling and failure isolation.
Horizontal scaling over vertical: Add more instances (scale out) rather than bigger instances (scale up). Use Auto Scaling groups and load balancers.
Choose the right DR strategy: Match RTO/RPO requirements to cost. Backup/Restore is cheapest but slowest. Multi-Site is fastest but most expensive.
Stateless applications scale better: Store session state in ElastiCache or DynamoDB, not on EC2 instances. This enables unlimited horizontal scaling.
Use managed services: RDS Multi-AZ, Aurora, DynamoDB, and Lambda handle availability automatically. Don't build what AWS already provides.
Health checks are critical: Use Route 53 health checks, ALB target health checks, and Auto Scaling health checks to detect and replace failed components.
Async communication for resilience: Use SQS queues between components to handle traffic spikes and component failures gracefully.
Multi-region for disaster recovery: Use Route 53 failover routing, S3 cross-region replication, and DynamoDB global tables for geographic redundancy.
Monitor everything: Use CloudWatch metrics, alarms, and dashboards. Use X-Ray for distributed tracing. Set up automated responses to failures.
Key Services Quick Reference
Compute & Scaling:
EC2 Auto Scaling: Automatically adjust capacity based on demand
Lambda: Serverless functions, automatic scaling, pay per invocation
Fargate: Serverless containers, no server management
ECS: Container orchestration on EC2 or Fargate
EKS: Managed Kubernetes for complex container workloads
Elastic Beanstalk: PaaS for web applications, handles infrastructure
Stateless application ā Store sessions in ElastiCache or DynamoDB
Congratulations! You've completed Domain 2: Design Resilient Architectures. This is the second-largest domain (26% of the exam), and mastering resilience patterns is essential for real-world AWS architectures.
This chapter covered the essential concepts for designing resilient architectures on AWS, which accounts for 26% of the SAA-C03 exam. We explored two major task areas:
Task 2.1: Scalable and Loosely Coupled Architectures
ā Messaging services (SQS, SNS, EventBridge) for decoupling components
ā Serverless compute (Lambda, Fargate) for elastic scaling
ā Container orchestration (ECS, EKS) for microservices
ā API Gateway for RESTful and WebSocket APIs
ā Load balancing strategies (ALB, NLB, GLB)
ā Auto Scaling policies and lifecycle management
ā Caching strategies (CloudFront, ElastiCache)
ā Step Functions for workflow orchestration
ā Event-driven architecture patterns
Task 2.2: Highly Available and Fault-Tolerant Architectures
ā Multi-AZ deployments for high availability
ā Multi-region architectures for disaster recovery
ā Route 53 routing policies and health checks
ā RDS Multi-AZ and Aurora Global Database
ā DynamoDB Global Tables for multi-region replication
ā S3 Cross-Region Replication for data durability
ā Disaster recovery strategies (Backup/Restore, Pilot Light, Warm Standby, Active-Active)
ā Backup and restore strategies using AWS Backup
ā Monitoring and observability with CloudWatch and X-Ray
Critical Takeaways
Loose Coupling: Always decouple components using SQS queues, SNS topics, or EventBridge to prevent cascading failures and enable independent scaling.
Message Ordering: Use SQS FIFO queues when strict ordering is required; use Standard queues for maximum throughput when order doesn't matter.
Fan-Out Pattern: SNS + SQS fan-out enables one message to trigger multiple independent processing workflows without tight coupling.
Multi-AZ vs Multi-Region: Multi-AZ protects against AZ failures (automatic failover in minutes); Multi-Region protects against region failures (requires manual or automated failover).
RTO and RPO: Recovery Time Objective (how long to recover) and Recovery Point Objective (how much data loss acceptable) determine your DR strategy choice.
Auto Scaling Policies: Target Tracking for steady-state metrics, Step Scaling for threshold-based scaling, Scheduled for predictable patterns, Predictive for ML-based forecasting.
Load Balancer Selection: ALB for HTTP/HTTPS with advanced routing, NLB for TCP/UDP with ultra-low latency, GLB for third-party appliances.
Serverless Benefits: Lambda and Fargate eliminate server management, scale automatically, and charge only for actual usage (no idle costs).
State Management: Store session state in ElastiCache or DynamoDB (not on EC2 instances) to enable stateless application design and horizontal scaling.
Health Checks: Implement health checks at multiple layers (Route 53, ELB, Auto Scaling) to detect and route around failures automatically.
Self-Assessment Checklist
Test yourself before moving on. You should be able to:
Messaging and Decoupling:
Explain the difference between SQS Standard and FIFO queues
Describe when to use SQS vs SNS vs EventBridge
Design an SNS + SQS fan-out architecture
Configure SQS visibility timeout and dead-letter queues
Implement long polling to reduce costs
Serverless and Containers:
Explain when to use Lambda vs Fargate vs ECS on EC2
Configure Lambda concurrency limits and reserved concurrency
Design Step Functions workflows with error handling
Choose between ECS and EKS for container orchestration
Implement API Gateway with Lambda integration
Load Balancing and Auto Scaling:
Select the appropriate load balancer type (ALB vs NLB vs GLB)
Configure ALB path-based and host-based routing
Design Auto Scaling policies for different workload patterns
Implement lifecycle hooks for graceful instance termination
Configure cross-zone load balancing
High Availability:
Design Multi-AZ deployments for RDS, EFS, and ALB
Explain RDS Multi-AZ automatic failover process
Configure Aurora read replicas for read scaling
Implement Route 53 health checks and failover routing
Design stateless applications with external session storage
Disaster Recovery:
Calculate RTO and RPO for different DR strategies
Choose appropriate DR strategy based on business requirements
Design Backup and Restore strategy with AWS Backup
Implement Pilot Light architecture for critical systems
Configure Aurora Global Database for multi-region DR
Set up DynamoDB Global Tables for active-active replication
Design S3 Cross-Region Replication for data durability
Monitoring and Troubleshooting:
Configure CloudWatch alarms for Auto Scaling triggers
Use X-Ray for distributed tracing and bottleneck identification
Implement CloudWatch Logs for centralized logging
Monitor service quotas and request limit increases
Design retry strategies with exponential backoff
Practice Questions
Try these from your practice test bundles:
Domain 2 Bundle 1: Questions 1-25 (Focus: Messaging and decoupling)
Domain 2 Bundle 2: Questions 26-50 (Focus: High availability and DR)
Full Practice Test 1: Domain 2 questions (Mixed difficulty)
Expected score: 70%+ to proceed confidently
If you scored below 70%:
Review sections on messaging patterns (SQS, SNS, EventBridge)
Active-Active: Full capacity both regions, seconds RTO, highest cost
Auto Scaling:
Target Tracking: Maintain metric at target (e.g., 70% CPU)
Step Scaling: Scale based on alarm thresholds
Scheduled: Scale at specific times
Predictive: ML-based forecasting
Common Patterns:
Decouple ā SQS queue between components
Fan-out ā SNS + multiple SQS subscriptions
Ordering ā SQS FIFO with message group ID
Workflow ā Step Functions state machine
Stateless ā Store sessions in ElastiCache/DynamoDB
Global ā Route 53 + CloudFront + Multi-Region
Congratulations! You've completed Chapter 2: Design Resilient Architectures. You now understand how to build scalable, loosely coupled, highly available, and fault-tolerant systems on AWS.
Active-Active: Full capacity both regions, seconds RTO, highest cost
Auto Scaling:
Target Tracking: Maintain metric at target (e.g., 70% CPU)
Step Scaling: Scale based on alarm thresholds
Scheduled: Scale at specific times
Predictive: ML-based forecasting
Common Patterns:
Decouple ā SQS queue between components
Fan-out ā SNS + multiple SQS subscriptions
Ordering ā SQS FIFO with message group ID
Workflow ā Step Functions state machine
Stateless ā Store sessions in ElastiCache/DynamoDB
Global ā Route 53 + CloudFront + Multi-Region
Chapter Summary
What We Covered
This chapter covered the two critical task areas for designing resilient architectures on AWS:
ā Task 2.1: Scalable and Loosely Coupled Architectures
Decoupling patterns with SQS, SNS, and EventBridge
Serverless architectures with Lambda and Fargate
Container orchestration with ECS and EKS
API Gateway for RESTful and WebSocket APIs
Load balancing with ALB, NLB, and GLB
Caching strategies with CloudFront and ElastiCache
Microservices design patterns
Event-driven architectures
Auto Scaling for elastic compute
Step Functions for workflow orchestration
ā Task 2.2: Highly Available and Fault-Tolerant Architectures
Multi-AZ deployments for high availability
Multi-region architectures for disaster recovery
Route 53 routing policies for failover and load distribution
RDS Multi-AZ and Aurora Global Database
DynamoDB Global Tables for multi-region replication
S3 Cross-Region Replication (CRR)
Disaster recovery strategies (Backup/Restore, Pilot Light, Warm Standby, Active-Active)
Health checks and automated failover
Backup strategies with AWS Backup
Monitoring and observability with CloudWatch and X-Ray
Critical Takeaways
Decouple Everything: Use queues (SQS) and topics (SNS) to decouple components. This prevents cascading failures and enables independent scaling.
Design for Failure: Assume everything will fail. Implement health checks, automatic failover, and retry logic. Use multiple Availability Zones.
Scale Horizontally: Add more instances rather than bigger instances. Use Auto Scaling groups with target tracking policies.
Choose the Right DR Strategy: Match your RTO/RPO requirements to cost. Backup/Restore is cheapest but slowest. Active-Active is fastest but most expensive.
Use Managed Services: Let AWS handle the heavy lifting. RDS Multi-AZ, Aurora, DynamoDB, and S3 provide built-in high availability.
Implement Caching: Cache at every layer - CloudFront for edge, ElastiCache for application, DAX for DynamoDB, RDS read replicas for databases.
Stateless Applications: Store session state externally (ElastiCache, DynamoDB). This enables easy horizontal scaling and failover.
Monitor Everything: Use CloudWatch for metrics and alarms. Use X-Ray for distributed tracing. Set up composite alarms for complex failure scenarios.
Self-Assessment Checklist
Test yourself before moving on. You should be able to:
Decoupling and Messaging:
Explain when to use SQS Standard vs FIFO queues
Design a fan-out pattern with SNS and SQS
Configure SQS visibility timeout and dead-letter queues
Implement event-driven architecture with EventBridge
Use Step Functions to orchestrate complex workflows
Design asynchronous processing with Lambda and SQS
Implement message filtering with SNS
Handle ordering requirements with SQS FIFO
Serverless and Containers:
Design serverless applications with Lambda and API Gateway
Configure Lambda concurrency limits and reserved capacity
Choose between ECS and EKS for container orchestration
Decide when to use Fargate vs EC2 launch type
Implement service discovery in ECS
Configure Lambda event source mappings
Use Lambda layers for code reuse
Design Lambda destinations for success/failure handling
Load Balancing and Auto Scaling:
Choose between ALB, NLB, and GLB for different use cases
Configure ALB path-based and host-based routing
Set up health checks for load balancers
Design Auto Scaling policies (target tracking, step, scheduled)
Implement lifecycle hooks for graceful shutdown
Configure cross-zone load balancing
Use NLB for ultra-low latency requirements
Implement sticky sessions with ALB
High Availability:
Design multi-AZ architectures for high availability
Configure RDS Multi-AZ for automatic failover
Implement Aurora Global Database for multi-region
Set up DynamoDB Global Tables
Configure S3 Cross-Region Replication
Use Route 53 health checks and failover routing
Implement EFS for shared file storage across AZs
Design for no single points of failure
Disaster Recovery:
Calculate RTO and RPO for business requirements
Choose appropriate DR strategy (Backup/Restore, Pilot Light, Warm Standby, Active-Active)
Implement automated backups with AWS Backup
Configure cross-region backup replication
Design pilot light architecture with minimal running resources
This chapter covered the two critical task areas for designing resilient architectures on AWS:
ā Task 2.1: Scalable and Loosely Coupled Architectures
Microservices vs monolithic architectures
Event-driven architectures with EventBridge
Message queuing with SQS (Standard and FIFO)
Pub/sub messaging with SNS
API Gateway for RESTful and WebSocket APIs
Serverless compute with Lambda
Container orchestration with ECS and EKS
Workflow orchestration with Step Functions
Caching strategies with CloudFront and ElastiCache
Load balancing with ALB, NLB, and GLB
Auto Scaling for elastic capacity
ā Task 2.2: Highly Available and Fault-Tolerant Architectures
Multi-AZ deployments for high availability
Multi-region architectures for disaster recovery
Route 53 routing policies for failover and load distribution
RDS Multi-AZ and Aurora for database resilience
DynamoDB Global Tables for multi-region replication
S3 Cross-Region Replication for data durability
Disaster recovery strategies (Backup/Restore, Pilot Light, Warm Standby, Active-Active)
RTO and RPO considerations
Health checks and monitoring with CloudWatch
Automated failover and recovery
Critical Takeaways
Loose Coupling is Key: Decouple components using queues (SQS), topics (SNS), and event buses (EventBridge). This allows independent scaling and failure isolation.
Stateless Design: Design applications to be stateless. Store session state in ElastiCache or DynamoDB, not on EC2 instances. This enables horizontal scaling.
Multi-AZ by Default: Always deploy across multiple Availability Zones. Use RDS Multi-AZ, Aurora with multiple replicas, ALB across AZs, and Auto Scaling groups spanning AZs.
Choose the Right DR Strategy: Match your DR strategy to your RTO/RPO requirements. Backup/Restore is cheapest but slowest. Active-Active is fastest but most expensive.
Automate Everything: Use Auto Scaling, health checks, and automated failover. Don't rely on manual intervention during failures.
Cache Aggressively: Use CloudFront for edge caching, ElastiCache for application caching, and DAX for DynamoDB. Caching reduces load and improves performance.
Message Ordering Matters: Use SQS FIFO when order matters (e.g., financial transactions). Use Standard SQS when order doesn't matter and you need maximum throughput.
Serverless for Scalability: Lambda and Fargate automatically scale to handle load. No need to provision capacity in advance.
Self-Assessment Checklist
Test yourself before moving on:
I can explain the difference between SQS Standard and FIFO
I understand when to use SNS vs SQS vs EventBridge
I can design a microservices architecture with loose coupling
I know how to implement event-driven patterns
I understand Lambda concurrency and scaling limits
I can design a multi-AZ architecture for high availability
I know the four disaster recovery strategies and when to use each
I understand RTO and RPO and how to calculate them
I can configure Route 53 for failover routing
I know the difference between RDS Multi-AZ and read replicas
I understand Aurora's high availability features
I can design a caching strategy for different use cases
Practice Questions
Try these from your practice test bundles:
Domain 2 Bundle 1: Questions 1-25 (Scalability and loose coupling)
Exam Weight: 24% of exam questions (approximately 16 out of 65 questions)
Section 1: High-Performing Storage Solutions
Introduction
The problem: Different workloads have vastly different storage requirements. A database needs low-latency block storage with high IOPS. A data lake needs cost-effective object storage for petabytes of data. A shared file system needs concurrent access from multiple servers. Using the wrong storage type results in poor performance, high costs, or both.
The solution: AWS provides multiple storage services optimized for different use cases. Understanding the characteristics of each service (performance, durability, cost, access patterns) enables you to choose the right storage for each workload.
Why it's tested: Storage performance directly impacts application performance. This domain represents 24% of the exam and tests your ability to select and configure storage services for optimal performance and cost.
Core Concepts
Amazon S3 Performance Optimization
What it is: Amazon S3 is object storage built to store and retrieve any amount of data from anywhere. S3 automatically scales to handle high request rates and provides 99.999999999% (11 9's) durability.
Why it exists: Traditional file systems don't scale to petabytes of data or millions of requests per second. S3 provides virtually unlimited scalability with built-in redundancy, versioning, and lifecycle management.
Real-world analogy: S3 is like a massive warehouse with infinite capacity. You can store anything (objects), organize with labels (metadata and tags), and retrieve items instantly. The warehouse automatically expands as you add more items, and items are replicated to multiple locations for safety.
S3 Performance Characteristics:
Request Rate Limits (per prefix):
GET/HEAD: 5,500 requests per second per prefix
PUT/COPY/POST/DELETE: 3,500 requests per second per prefix
Prefix: Any string between bucket name and object name
What it is: Amazon Elastic Block Store (EBS) provides block-level storage volumes for EC2 instances. EBS volumes are network-attached storage that persist independently of instance lifetime.
Why it exists: Instance store (ephemeral storage) is lost when instance stops. Applications need persistent storage that survives instance failures, can be backed up (snapshots), and can be attached to different instances.
Real-world analogy: EBS is like an external hard drive that you can plug into different computers. The drive retains data even when unplugged. You can make copies (snapshots) and create new drives from those copies.
EBS Volume Types:
General Purpose SSD (gp3) - Balanced price/performance:
Snapshots normally have performance penalty on first access (lazy loading)
Fast Snapshot Restore eliminates this penalty
Cost: $0.75 per snapshot per AZ per hour
Use case: Disaster recovery, quick instance launches
Amazon EFS Performance
What it is: Amazon Elastic File System (EFS) is a fully managed, elastic, shared file system for Linux workloads. Multiple EC2 instances can access the same EFS file system simultaneously.
Why it exists: EBS volumes can only be attached to one instance at a time. Applications that need shared file access (web servers serving same content, data processing pipelines, content management systems) require a shared file system.
Real-world analogy: EFS is like a shared network drive in an office. Multiple employees (EC2 instances) can access the same files simultaneously. When one person updates a file, others see the changes immediately. The drive automatically expands as you add more files.
EFS Performance Modes:
General Purpose (default):
Latency: Low latency (single-digit milliseconds)
Throughput: Up to 7,000 file operations per second
Use Case: Web serving, content management, development
Max I/O:
Latency: Higher latency (tens of milliseconds)
Throughput: >7,000 file operations per second
Use Case: Big data, media processing, high parallelism
Independent: Throughput independent of storage size
Cost: $6/MB/s-month
Use Case: Consistent high throughput needed
Elastic (recommended):
Automatic: Scales throughput automatically based on workload
Up to: 3 GB/s reads, 1 GB/s writes
Cost: Pay for throughput used (no provisioning)
Use Case: Unpredictable workloads, simplicity
Detailed Example 3: Shared Web Content with EFS
Scenario: You're running a WordPress site on multiple EC2 instances behind an ALB. All instances need access to the same uploaded media files (images, videos). Requirements:
Shared access from all web servers
Automatic scaling (don't want to manage storage)
Cost-effective
Architecture:
ALB: Distributes traffic to web servers
Auto Scaling Group: 2-10 EC2 instances
EFS: Shared file system for WordPress uploads
RDS: Database (separate from file storage)
Implementation:
Step 1: Create EFS File System:
# Create EFS file system
aws efs create-file-system \
--performance-mode generalPurpose \
--throughput-mode elastic \
--encrypted \
--tags Key=Name,Value=wordpress-media
# Create mount targets in each AZ
aws efs create-mount-target \
--file-system-id fs-12345678 \
--subnet-id subnet-1a \
--security-groups sg-efs
aws efs create-mount-target \
--file-system-id fs-12345678 \
--subnet-id subnet-1b \
--security-groups sg-efs
# Install EFS mount helper
sudo yum install -y amazon-efs-utils
# Create mount point
sudo mkdir -p /var/www/html/wp-content/uploads
# Mount EFS
sudo mount -t efs -o tls fs-12345678:/ /var/www/html/wp-content/uploads
# Add to /etc/fstab for automatic mount on boot
echo "fs-12345678:/ /var/www/html/wp-content/uploads efs _netdev,tls 0 0" | sudo tee -a /etc/fstab
EBS: Would need to sync files between instances (complex, error-prone)
EBS: Each instance needs separate volume (100 GB Ć 10 instances = 1 TB)
EBS Cost: 1 TB Ć $0.10 = $100/month
EFS Savings: $68.75/month (69% reduction)
Section 2: High-Performing Compute Solutions
Introduction
The problem: Different workloads have different compute requirements. A web server needs consistent CPU. A batch job needs high CPU for short bursts. A machine learning model needs GPU acceleration. Using the wrong compute type results in poor performance or wasted money.
The solution: AWS provides multiple compute options optimized for different workloads. Understanding instance families, sizing, and pricing models enables you to choose the right compute for each workload.
Why it's tested: Compute is the foundation of most applications. This section tests your ability to select appropriate instance types, configure auto scaling, and optimize compute costs while maintaining performance.
Core Concepts
EC2 Instance Types and Families
What they are: EC2 instance types are combinations of CPU, memory, storage, and networking capacity. Instance families are groups of instance types optimized for specific workloads.
Why they exist: One size doesn't fit all. A database needs lots of memory. A video encoder needs powerful CPU. A machine learning model needs GPU. Instance families provide optimized hardware for each use case.
Real-world analogy: Instance types are like vehicles. A sports car (compute-optimized) is fast but has little cargo space. A truck (memory-optimized) carries heavy loads but isn't fast. An SUV (general purpose) balances both. You choose based on your needs.
Instance Families:
General Purpose (T, M, A):
Balance: CPU, memory, networking
T3/T3a: Burstable CPU (baseline + burst credits)
Use case: Web servers, dev/test, small databases
Cost: $0.0104/hour (t3.medium)
M5/M5a: Consistent performance
Use case: Application servers, medium databases
Cost: $0.096/hour (m5.xlarge)
M6i: Latest generation (Intel Ice Lake)
Use case: General workloads, best price/performance
Cost: $0.192/hour (m6i.xlarge)
Compute Optimized (C):
High CPU: High CPU-to-memory ratio
C5/C5a: Intel/AMD processors
Use case: Batch processing, media transcoding, gaming servers
Cost: $0.085/hour (c5.xlarge)
C6i: Latest generation
Use case: High-performance computing, scientific modeling
Cost: $0.17/hour (c6i.xlarge)
Memory Optimized (R, X, Z):
High Memory: High memory-to-CPU ratio
R5/R5a: General memory-intensive
Use case: In-memory databases (Redis, Memcached), big data
Cost: $0.252/hour (r5.xlarge)
X1e: Extreme memory (up to 3,904 GB)
Use case: SAP HANA, in-memory databases
Cost: $26.688/hour (x1e.32xlarge)
Z1d: High frequency + memory
Use case: Electronic design automation, gaming
Cost: $0.744/hour (z1d.xlarge)
Storage Optimized (I, D, H):
High I/O: NVMe SSD instance store
I3/I3en: High IOPS, low latency
Use case: NoSQL databases, data warehousing
Cost: $0.312/hour (i3.xlarge)
D2: Dense HDD storage
Use case: MapReduce, Hadoop, log processing
Cost: $0.69/hour (d2.xlarge)
Accelerated Computing (P, G, F):
GPU/FPGA: Specialized processors
P3: NVIDIA V100 GPUs
Use case: Machine learning training, HPC
Cost: $3.06/hour (p3.2xlarge)
G4: NVIDIA T4 GPUs
Use case: ML inference, graphics workstations
Cost: $1.20/hour (g4dn.xlarge)
F1: FPGA
Use case: Genomics, financial analytics
Cost: $1.65/hour (f1.2xlarge)
Instance Sizing:
nano: 0.5 vCPU, 0.5 GB RAM
micro: 1 vCPU, 1 GB RAM
small: 1 vCPU, 2 GB RAM
medium: 2 vCPU, 4 GB RAM
large: 2 vCPU, 8 GB RAM
xlarge: 4 vCPU, 16 GB RAM
2xlarge: 8 vCPU, 32 GB RAM
4xlarge: 16 vCPU, 64 GB RAM
(continues to 96xlarge for some families)
Detailed Example 4: Right-Sizing EC2 Instances
Scenario: You're running a web application on m5.2xlarge instances (8 vCPU, 32 GB RAM). CloudWatch shows:
Week 2: Traffic spike, Auto Scaling adds 4 more instances ā Handles load
Week 3: Traffic normal, scales back to 4 instances ā Cost optimized
Result:
Performance: Same (adequate CPU/memory)
Cost: $2,803 ā $561/month (80% savings)
Scalability: Still scales to 12 instances during peaks
Section 3: High-Performing Database Solutions
Introduction
The problem: Databases are often the performance bottleneck in applications. Slow queries, insufficient IOPS, connection limits, and lack of caching can degrade application performance. Choosing the wrong database type or configuration results in poor performance and high costs.
The solution: AWS provides multiple database services optimized for different data models and access patterns. Understanding database types, performance tuning, caching strategies, and read scaling enables you to build high-performing data layers.
Why it's tested: Database performance is critical for most applications. This section tests your ability to select appropriate database services, configure for performance, and implement caching strategies.
Core Concepts
Amazon RDS Performance Optimization
What it is: Amazon RDS is a managed relational database service supporting MySQL, PostgreSQL, MariaDB, Oracle, and SQL Server. RDS handles provisioning, patching, backup, and recovery.
Why it exists: Managing database servers is complex - patching, backups, replication, failover. RDS automates these tasks, allowing you to focus on application development and performance tuning.
RDS Performance Factors:
1. Instance Type:
db.t3: Burstable CPU (dev/test, small workloads)
db.m5: General purpose (balanced CPU/memory)
db.r5: Memory optimized (large datasets, caching)
db.x1e: Extreme memory (SAP HANA, in-memory)
2. Storage Type:
General Purpose SSD (gp3): 3,000-16,000 IOPS, 125-1,000 MB/s
Provisioned IOPS SSD (io1): Up to 64,000 IOPS, 1,000 MB/s
Magnetic: Legacy, not recommended
3. Read Replicas:
Asynchronous replication from primary
Offload read traffic from primary
Up to 15 read replicas per primary
Can be in different regions
4. Connection Pooling:
RDS Proxy manages connection pool
Reduces connection overhead
Improves scalability
Detailed Example 5: Database Performance Tuning
Scenario: You're running a MySQL database on RDS. Performance issues:
# Before: Direct connection to RDS
import pymysql
# Write connection (primary)
write_conn = pymysql.connect(
host='mydb.abc123.us-east-1.rds.amazonaws.com',
user='admin',
password='password',
database='myapp'
)
# After: Connection through RDS Proxy
write_conn = pymysql.connect(
host='mydb-proxy.proxy-abc123.us-east-1.rds.amazonaws.com',
user='admin',
password='password',
database='myapp'
)
# Read connection (proxy distributes to replicas)
read_conn = pymysql.connect(
host='mydb-proxy.proxy-abc123.us-east-1.rds.amazonaws.com',
user='admin',
password='password',
database='myapp'
)
# Application logic
def get_user(user_id):
cursor = read_conn.cursor() # Use read connection
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
return cursor.fetchone()
def update_user(user_id, name):
cursor = write_conn.cursor() # Use write connection
cursor.execute("UPDATE users SET name = %s WHERE id = %s", (name, user_id))
write_conn.commit()
Performance Results:
Before:
Primary CPU: 90%
IOPS: 300 (throttled)
Query latency: 500ms (slow)
Connections: 500 (high overhead)
After:
Primary CPU: 30% (writes only)
Replica 1 CPU: 35% (reads)
Replica 2 CPU: 35% (reads)
IOPS: 3,000 (no throttling)
Query latency: 50ms (10x faster)
Connections: 50 (pooled by RDS Proxy)
Cost:
Storage upgrade: $10 ā $40/month (+$30)
Read replicas: 2 Ć $146/month (+$292)
RDS Proxy: $0.015/hour Ć 730 = $11/month (+$11)
Total increase: $333/month
Value: 10x performance improvement, handles 3x more traffic
Amazon DynamoDB Performance
What it is: Amazon DynamoDB is a fully managed NoSQL database that provides single-digit millisecond performance at any scale. DynamoDB automatically scales throughput and storage.
Why it exists: Relational databases struggle with massive scale (millions of requests per second, petabytes of data). DynamoDB provides consistent performance at any scale without manual sharding or capacity planning.
Real-world analogy: DynamoDB is like a massive library with instant retrieval. No matter how many books (items) or how many people (requests), you always get your book in the same time (single-digit milliseconds). The library automatically expands as you add more books.
DynamoDB Performance Characteristics:
Capacity Modes:
On-Demand:
Throughput: Unlimited (scales automatically)
Pricing: $1.25 per million write requests, $0.25 per million read requests
Use Case: Unpredictable workloads, new applications
Provisioned:
Throughput: Specify read/write capacity units (RCU/WCU)
Pricing: $0.00065 per WCU-hour, $0.00013 per RCU-hour
Auto Scaling: Automatically adjusts capacity based on load
Use Case: Predictable workloads, cost optimization
# Get top 10 players for game
response = dynamodb.query(
TableName='GameLeaderboard',
KeyConditionExpression='game_id = :game_id',
ExpressionAttributeValues={':game_id': 'game123'},
ScanIndexForward=False, # Descending order (highest score first)
Limit=10
)
Without DAX:
10,000 queries/sec Ć $0.25 per million = $2.50/sec = $6,480/day
Latency: 5ms (DynamoDB)
With DAX:
import amazondax
# Create DAX client
dax = amazondax.AmazonDaxClient()
# Query through DAX (same API as DynamoDB)
response = dax.query(
TableName='GameLeaderboard',
KeyConditionExpression='game_id = :game_id',
ExpressionAttributeValues={':game_id': 'game123'},
Limit=10
)
DynamoDB capacity modes (on-demand vs provisioned)
DynamoDB Accelerator (DAX) for caching
Partition key design for even distribution
Critical Takeaways
S3 Performance: Use multiple prefixes for high request rates (5,500 GET/sec per prefix). Use multipart upload for large files. Use Transfer Acceleration for long-distance uploads. Use CloudFront for frequently accessed objects.
EBS Selection: Use gp3 for most workloads (better price/performance than gp2). Use io2 for high-IOPS databases. Use st1 for throughput-intensive workloads. Use sc1 for infrequently accessed data.
EFS vs EBS: Use EFS for shared file access across multiple instances. Use EBS for single-instance block storage. EFS automatically scales; EBS requires manual resizing.
Instance Selection: Match instance family to workload (compute-optimized for CPU, memory-optimized for RAM, storage-optimized for I/O). Use burstable instances (T3) for variable workloads. Right-size based on actual utilization.
Database Performance: Use read replicas to offload read traffic. Use RDS Proxy for connection pooling. Upgrade storage to gp3 for better IOPS. Use appropriate instance type for workload.
DynamoDB Optimization: Design partition keys for even distribution. Use DAX for read-heavy workloads (95%+ cost reduction). Use batch operations to reduce request count. Choose on-demand for unpredictable workloads, provisioned for predictable.
Caching Strategy: Use CloudFront for static content. Use DAX for DynamoDB. Use ElastiCache for application caching. Caching reduces latency and costs.
Self-Assessment Checklist
Test yourself before moving on:
I understand S3 performance limits (requests per prefix)
I know when to use multipart upload
I can explain the difference between gp3 and io2 EBS volumes
I understand when to use EFS vs EBS
I know the different EC2 instance families and their use cases
I can right-size EC2 instances based on utilization
I understand how RDS read replicas improve performance
I know when to use RDS Proxy
I understand DynamoDB capacity modes (on-demand vs provisioned)
I can explain how DAX improves DynamoDB performance
I know how to design DynamoDB partition keys
I understand caching strategies (CloudFront, DAX, ElastiCache)
Practice Questions
Try these from your practice test bundles:
Domain 3 Bundle 1: Questions 1-25 (Storage and compute)
Domain 3 Bundle 2: Questions 26-50 (Database and caching)
Full Practice Test 1: Questions 38-53 (Domain 3 questions)
Expected score: 70%+ to proceed confidently
If you scored below 70%:
Review sections: Focus on areas where you missed questions
Key topics to strengthen:
S3 performance optimization techniques
EBS volume type selection
EC2 instance family characteristics
RDS read replica use cases
DynamoDB partition key design
Quick Reference Card
Storage Services:
S3: Object storage, unlimited scale, 5,500 GET/sec per prefix
EBS gp3: General purpose SSD, 3,000-16,000 IOPS, $0.08/GB-month
EBS io2: High-performance SSD, up to 64,000 IOPS, $0.125/GB-month
Diagram Explanation: This decision tree shows how to optimize S3 performance based on different requirements. For high request rates (>5,500 GET/sec), distribute objects across multiple prefixes to scale beyond single-prefix limits. For large objects (>100 MB), use multipart upload to parallelize uploads and improve reliability. For users far from the S3 region, enable Transfer Acceleration to route data over AWS's optimized network. For frequently accessed content, use CloudFront to cache at edge locations and reduce latency. For selective data retrieval, use S3 Select to filter data server-side and reduce data transfer.
ā Must Know (S3 Performance):
S3 supports 5,500 GET/sec and 3,500 PUT/sec per prefix (not per bucket)
Use multiple prefixes to scale beyond these limits (e.g., date-based prefixes)
Multipart upload is recommended for objects >100 MB and required for >5 GB
Transfer Acceleration can improve upload speeds by 50-500% for long distances
S3 Select reduces data transfer by filtering data server-side
CloudFront caching reduces S3 costs and improves latency for end users
When to use S3 Performance Features:
ā Use multiple prefixes when: Request rate exceeds 5,500 GET/sec or 3,500 PUT/sec
ā Use multipart upload when: Objects are >100 MB or upload reliability is critical
ā Use Transfer Acceleration when: Users are >1,000 miles from S3 region
ā Use S3 Select when: You need only a subset of data from large objects
ā Use CloudFront when: Content is accessed frequently from multiple locations
ā Don't use Transfer Acceleration when: Users are in same region as bucket (no benefit)
ā Don't use S3 Select when: You need the entire object (adds processing cost)
Amazon EBS Performance Optimization
What it is: Amazon Elastic Block Store (EBS) provides block-level storage volumes for EC2 instances. EBS volumes are network-attached storage that persist independently of instance lifetime.
Why it exists: EC2 instances need persistent storage that survives instance termination. Instance store (ephemeral storage) is lost when instance stops. EBS provides durable, high-performance block storage with snapshots, encryption, and multiple volume types optimized for different workloads.
Real-world analogy: EBS is like an external hard drive that you can attach to your computer (EC2 instance). You can detach it, attach it to a different computer, take snapshots (backups), and choose different drive types (SSD vs HDD) based on your needs.
EBS Volume Types and Performance:
Volume Type
Use Case
IOPS
Throughput
Latency
Cost
gp3 (General Purpose SSD)
Most workloads
3,000-16,000
125-1,000 MB/s
Single-digit ms
$0.08/GB-month
gp2 (General Purpose SSD)
Legacy, variable performance
100-16,000 (burst)
128-250 MB/s
Single-digit ms
$0.10/GB-month
io2 (Provisioned IOPS SSD)
High-performance databases
100-64,000
256-4,000 MB/s
Sub-millisecond
$0.125/GB-month + $0.065/IOPS
io2 Block Express
Highest performance
256,000 IOPS
4,000 MB/s
Sub-millisecond
$0.125/GB-month + $0.065/IOPS
st1 (Throughput Optimized HDD)
Big data, data warehouses
500 (max)
500 MB/s
Low ms
$0.045/GB-month
sc1 (Cold HDD)
Infrequent access
250 (max)
250 MB/s
Low ms
$0.015/GB-month
How EBS Performance Works:
1. IOPS (Input/Output Operations Per Second):
Measures number of read/write operations per second
gp3: Baseline 3,000 IOPS (regardless of size), can provision up to 16,000
io2: Provision exactly what you need (100-64,000 IOPS)
2. Throughput (MB/s):
Measures amount of data transferred per second
gp3: Baseline 125 MB/s, can provision up to 1,000 MB/s
gp2: Scales with IOPS (250 MB/s max)
st1: 500 MB/s max (optimized for sequential reads)
3. Burst Performance (gp2 only):
gp2 volumes accumulate I/O credits when idle
Can burst to 3,000 IOPS for short periods
Credit balance: 5.4 million I/O credits (30 minutes at 3,000 IOPS)
Problem: Credits deplete quickly under sustained load
Detailed Example 1: Database Server (High IOPS)
Scenario: You're running a PostgreSQL database with 500 transactions per second. Each transaction requires 10 IOPS (reads + writes). You need 5,000 IOPS sustained.
Option 1: gp2 (Legacy):
Need 5,000 IOPS Ć· 3 IOPS/GB = 1,667 GB volume
Cost: 1,667 GB Ć $0.10 = $166.70/month
Problem: Paying for storage you don't need just to get IOPS
Problem: Expensive for throughput-optimized workload
Option 2: st1 (Recommended):
Throughput: 500 MB/s (max)
Storage: 10,000 GB (10 TB)
Cost: 10,000 GB Ć $0.045 = $450/month
Savings: $385/month (46% cheaper)
Trade-off: Lower IOPS (500 max), but not needed for sequential reads
Detailed Example 3: Log Archive Storage (Infrequent Access)
Scenario: You need to store 50 TB of application logs for compliance. Logs are accessed once per month for audits.
Option 1: gp3:
Storage: 50,000 GB
Cost: 50,000 GB Ć $0.08 = $4,000/month
Problem: Paying for performance you don't need
Option 2: sc1 (Recommended):
Storage: 50,000 GB
Cost: 50,000 GB Ć $0.015 = $750/month
Savings: $3,250/month (81% cheaper)
Trade-off: Lower throughput (250 MB/s), but acceptable for infrequent access
š EBS Volume Type Selection Diagram:
graph TD
A[Select EBS Volume Type] --> B{Workload Type?}
B -->|Transactional| C{IOPS Requirement?}
C -->|< 16,000 IOPS| D[gp3 General Purpose SSD]
C -->|> 16,000 IOPS| E[io2 Provisioned IOPS SSD]
C -->|> 64,000 IOPS| F[io2 Block Express]
B -->|Throughput-Intensive| G{Access Pattern?}
G -->|Frequent Access| H[st1 Throughput Optimized HDD]
G -->|Infrequent Access| I[sc1 Cold HDD]
B -->|Boot Volume| J[gp3 or gp2]
style D fill:#c8e6c9
style E fill:#fff3e0
style F fill:#ffebee
style H fill:#c8e6c9
style I fill:#e1f5fe
style J fill:#c8e6c9
See: diagrams/04_domain3_ebs_volume_selection.mmd
Diagram Explanation: This decision tree helps select the appropriate EBS volume type based on workload characteristics. For transactional workloads (databases, applications), choose based on IOPS requirements: gp3 for most workloads (<16,000 IOPS), io2 for high-performance databases (16,000-64,000 IOPS), or io2 Block Express for extreme performance (>64,000 IOPS). For throughput-intensive workloads (big data, data warehouses), choose st1 for frequently accessed data or sc1 for infrequently accessed data. For boot volumes, gp3 or gp2 are appropriate choices.
ā Must Know (EBS Performance):
gp3 is the default choice for most workloads (better price/performance than gp2)
gp3 provides 3,000 IOPS and 125 MB/s baseline regardless of volume size
gp2 performance scales with size (3 IOPS per GB), making it expensive for high IOPS
io2 is for high-performance databases requiring >16,000 IOPS or sub-millisecond latency
st1 is for throughput-intensive workloads (big data, data warehouses)
sc1 is for infrequently accessed data (lowest cost per GB)
EBS volumes are AZ-specific (cannot attach to instance in different AZ)
Use EBS snapshots for backups (stored in S3, incremental)
EBS Performance Optimization Techniques:
1. Use EBS-Optimized Instances:
Provides dedicated bandwidth for EBS traffic
Prevents network contention between EBS and application traffic
Most modern instance types are EBS-optimized by default
Performance Impact: Up to 2x better EBS performance
VolumeThroughputPercentage: Percentage of provisioned throughput used
VolumeQueueLength: Number of pending I/O requests (should be low)
Amazon EFS Performance Optimization
What it is: Amazon Elastic File System (EFS) is a fully managed, elastic, shared file system for Linux-based workloads. Multiple EC2 instances can access EFS concurrently.
Why it exists: EBS volumes can only be attached to one instance at a time. Applications that need shared file access (web servers, content management, development environments) require a shared file system. EFS provides NFS-compatible shared storage that automatically scales.
Real-world analogy: EFS is like a shared network drive in an office. Multiple employees (EC2 instances) can access the same files simultaneously. The drive automatically expands as you add more files, and you only pay for what you use.
EFS Performance Modes:
Performance Mode
Throughput
Latency
Use Case
Cost
General Purpose
Up to 7,000 file ops/sec
Low (single-digit ms)
Most workloads
$0.30/GB-month
Max I/O
>7,000 file ops/sec
Higher (double-digit ms)
Big data, media processing
$0.30/GB-month
EFS Throughput Modes:
Throughput Mode
Throughput
Scaling
Cost
Bursting
50 MB/s per TB (baseline), burst to 100 MB/s
Scales with size
Included
Provisioned
1-1,024 MB/s (fixed)
Independent of size
$6/MB/s-month
Elastic
Scales automatically
Automatic
$0.30/GB-month (read), $0.90/GB-month (write)
How EFS Performance Works:
Bursting Throughput Mode:
Baseline: 50 MB/s per TB of storage
Burst: 100 MB/s per TB (using burst credits)
Burst credits: Accumulate when below baseline, deplete when above
Example: 1 TB file system
Baseline: 50 MB/s
Burst: 100 MB/s (for limited time)
Minimum: 1 MB/s (even for small file systems)
Provisioned Throughput Mode:
Provision exact throughput needed (1-1,024 MB/s)
Independent of storage size
Use case: Small file system needing high throughput
Consideration: Expensive for small dataset with high throughput needs
š EFS Performance Architecture Diagram:
graph TB
subgraph "EFS Shared File System"
EFS[EFS File System<br/>500 GB, 25 MB/s]
end
subgraph "Availability Zone 1"
EC2_1[Web Server 1]
EC2_2[Web Server 2]
EC2_3[Web Server 3]
end
subgraph "Availability Zone 2"
EC2_4[Web Server 4]
EC2_5[Web Server 5]
end
EC2_1 -.NFS Mount.-> EFS
EC2_2 -.NFS Mount.-> EFS
EC2_3 -.NFS Mount.-> EFS
EC2_4 -.NFS Mount.-> EFS
EC2_5 -.NFS Mount.-> EFS
EFS --> MT1[Mount Target AZ-1]
EFS --> MT2[Mount Target AZ-2]
style EFS fill:#c8e6c9
style MT1 fill:#e1f5fe
style MT2 fill:#e1f5fe
See: diagrams/04_domain3_efs_shared_access.mmd
Diagram Explanation: This diagram shows how EFS provides shared file system access across multiple EC2 instances in different Availability Zones. The EFS file system is accessed through mount targets in each AZ. All instances mount the same file system using NFS protocol, enabling shared access to the same files. This architecture is ideal for web servers serving static content, development environments, or any application requiring shared file access.
ā Must Know (EFS Performance):
EFS provides shared file system access (multiple instances can mount simultaneously)
Performance scales with storage size in Bursting mode (50 MB/s per TB baseline)
Use Provisioned Throughput when small file system needs high throughput
Use Elastic Throughput for variable workloads (automatic scaling)
General Purpose mode: Up to 7,000 file ops/sec (most workloads)
Max I/O mode: >7,000 file ops/sec (big data, many small files)
EFS is more expensive than EBS ($0.30/GB vs $0.08/GB for gp3)
Use EFS Infrequent Access (IA) for files not accessed frequently (90% cost savings)
When to use EFS vs EBS:
ā Use EFS when: Multiple instances need shared access to same files
ā Use EFS when: File system needs to scale automatically
ā Use EFS when: Application uses standard file system operations (POSIX)
ā Use EBS when: Single instance needs block storage
ā Use EBS when: Need highest IOPS (>16,000) or lowest latency
ā Use EBS when: Cost is primary concern (EBS is cheaper)
ā Don't use EFS when: Only one instance needs access (use EBS instead)
ā Don't use EFS when: Need Windows file system (use FSx for Windows instead)
Amazon FSx Performance Optimization
What it is: Amazon FSx provides fully managed third-party file systems optimized for specific workloads. FSx offers Windows File Server, Lustre (HPC), NetApp ONTAP, and OpenZFS.
Why it exists: Some applications require specific file system features not available in EFS. Windows applications need SMB protocol and Active Directory integration. High-performance computing needs parallel file systems like Lustre. FSx provides these specialized file systems as managed services.
FSx for Windows File Server:
Use case: Windows applications, Active Directory integration, SMB protocol
Performance: Up to 2 GB/s throughput, millions of IOPS
FSx for Windows: Use for Windows applications needing SMB protocol and AD integration
FSx for Lustre: Use for HPC workloads needing extreme performance (ML, video, genomics)
FSx for NetApp ONTAP: Use for multi-protocol access (NFS, SMB, iSCSI) and advanced data management
FSx for OpenZFS: Use for Linux workloads needing ZFS features (snapshots, compression)
FSx for Lustre integrates with S3 (can use S3 as data repository)
FSx for Lustre Scratch: Temporary data, no replication, lowest cost
FSx for Lustre Persistent: Production data, replicated, higher cost
Section 2: High-Performing Compute Solutions
Introduction
The problem: Different workloads have vastly different compute requirements. A web server needs consistent CPU for handling requests. A batch job needs massive parallel processing. A microservice needs to scale from zero to thousands of instances instantly. Using the wrong compute service results in poor performance, high costs, or operational complexity.
The solution: AWS provides multiple compute services optimized for different use cases. Understanding the characteristics of each service (performance, scalability, cost, operational overhead) enables you to choose the right compute for each workload.
Why it's tested: Compute is the foundation of every application. This section tests your ability to select and configure compute services for optimal performance, scalability, and cost.
Core Concepts
EC2 Instance Types and Families
What it is: Amazon EC2 provides virtual servers (instances) in the cloud. EC2 offers hundreds of instance types optimized for different workloads, organized into instance families.
Why it exists: Different applications have different resource requirements. A database needs lots of memory. A video encoder needs powerful CPUs. A machine learning model needs GPUs. EC2 provides specialized instance types optimized for each workload.
Real-world analogy: EC2 instance types are like different types of vehicles. A sports car (compute-optimized) is fast but has limited cargo space. A truck (memory-optimized) can carry heavy loads but isn't as fast. A van (general purpose) balances both. You choose the vehicle based on your needs.
EC2 Instance Families:
Family
Optimized For
vCPU:Memory Ratio
Use Cases
Example Types
T3/T3a
Burstable CPU
1:2
Variable workloads, dev/test
t3.micro, t3.medium
M5/M6i
General Purpose
1:4
Balanced workloads, web servers
m5.large, m6i.xlarge
C5/C6i
Compute Optimized
1:2
CPU-intensive, batch processing
c5.2xlarge, c6i.4xlarge
R5/R6i
Memory Optimized
1:8
In-memory databases, caching
r5.xlarge, r6i.2xlarge
I3/I3en
Storage Optimized
1:8 + NVMe SSD
NoSQL databases, data warehouses
i3.2xlarge, i3en.6xlarge
P3/P4
GPU Accelerated
GPUs
Machine learning, video encoding
p3.2xlarge, p4d.24xlarge
G4
Graphics Accelerated
GPUs
Graphics workloads, game streaming
g4dn.xlarge
Instance Size Naming Convention:
Format: {family}{generation}.{size}
Example: m5.2xlarge
m: General purpose family
5: 5th generation
2xlarge: Size (8 vCPUs, 32 GB RAM)
Instance Sizes (using M5 as example):
m5.large: 2 vCPUs, 8 GB RAM
m5.xlarge: 4 vCPUs, 16 GB RAM
m5.2xlarge: 8 vCPUs, 32 GB RAM
m5.4xlarge: 16 vCPUs, 64 GB RAM
m5.8xlarge: 32 vCPUs, 128 GB RAM
m5.12xlarge: 48 vCPUs, 192 GB RAM
m5.16xlarge: 64 vCPUs, 256 GB RAM
m5.24xlarge: 96 vCPUs, 384 GB RAM
Detailed Example 1: Web Application Server
Scenario: You're running a web application with moderate traffic (100 requests/sec). CPU usage varies between 20-60% throughout the day.
Option 1: T3 Burstable Instance (Recommended):
Instance: t3.medium (2 vCPUs, 4 GB RAM)
Baseline: 20% CPU utilization
Burst: Up to 100% CPU when needed
CPU Credits: Accumulate when below baseline, spend when above
Cost: $0.0416/hour = $30/month
Benefits: Cost-effective for variable workloads
Option 2: M5 General Purpose Instance:
Instance: m5.large (2 vCPUs, 8 GB RAM)
Performance: Consistent 100% CPU available
Cost: $0.096/hour = $70/month
When to use: Sustained high CPU usage (>40% average)
How T3 CPU Credits Work:
Baseline: t3.medium earns 24 CPU credits/hour (20% of 2 vCPUs)
Burst: Spending 100% CPU consumes 120 CPU credits/hour (2 vCPUs Ć 60 min)
Credit Balance: Maximum 288 credits (24 hours of baseline)
Detailed Example 2: Database Server (Memory-Intensive)
Scenario: You're running PostgreSQL with a 100 GB working set (data that must fit in memory for good performance). Need 128 GB RAM.
Option 1: M5 General Purpose:
Instance: m5.8xlarge (32 vCPUs, 128 GB RAM)
Cost: $1.536/hour = $1,121/month
Problem: Paying for 32 vCPUs when you only need 8
Option 2: R5 Memory Optimized (Recommended):
Instance: r5.4xlarge (16 vCPUs, 128 GB RAM)
Cost: $1.008/hour = $736/month
Savings: $385/month (34% cheaper)
Benefits: Same memory, fewer vCPUs (better ratio for database)
Detailed Example 3: Batch Processing (CPU-Intensive)
Scenario: You're running video encoding jobs that max out CPU for hours. Need to process 1,000 videos per day.
Option 1: M5 General Purpose:
Instance: m5.4xlarge (16 vCPUs, 64 GB RAM)
Cost: $0.768/hour
Processing: 10 videos/hour
Time: 100 hours/day
Daily cost: 100 hours Ć $0.768 = $76.80
Option 2: C5 Compute Optimized (Recommended):
Instance: c5.4xlarge (16 vCPUs, 32 GB RAM)
Cost: $0.68/hour
Processing: 12 videos/hour (better CPU performance)
Time: 83 hours/day
Daily cost: 83 hours Ć $0.68 = $56.44
Savings: $20.36/day (27% cheaper)
š EC2 Instance Family Selection Diagram:
graph TD
A[Select EC2 Instance Type] --> B{Workload Characteristics?}
B -->|Variable CPU| C[T3/T3a Burstable]
B -->|Balanced| D[M5/M6i General Purpose]
B -->|CPU-Intensive| E[C5/C6i Compute Optimized]
B -->|Memory-Intensive| F[R5/R6i Memory Optimized]
B -->|Storage-Intensive| G[I3/I3en Storage Optimized]
B -->|GPU Workload| H{GPU Type?}
H -->|ML Training| I[P3/P4 GPU Instances]
H -->|Graphics| J[G4 Graphics Instances]
C --> K[Web servers, dev/test]
D --> L[Application servers, microservices]
E --> M[Batch processing, HPC]
F --> N[Databases, caching]
G --> O[NoSQL, data warehouses]
style C fill:#e1f5fe
style D fill:#c8e6c9
style E fill:#fff3e0
style F fill:#f3e5f5
style G fill:#ffebee
style I fill:#ffe0b2
style J fill:#ffe0b2
Diagram Explanation: This decision tree helps select the appropriate EC2 instance family based on workload characteristics. For variable CPU workloads, use T3/T3a burstable instances. For balanced workloads, use M5/M6i general purpose. For CPU-intensive workloads, use C5/C6i compute optimized. For memory-intensive workloads, use R5/R6i memory optimized. For storage-intensive workloads, use I3/I3en storage optimized. For GPU workloads, choose P3/P4 for ML training or G4 for graphics.
ā Must Know (EC2 Instance Types):
T3 burstable instances are cost-effective for variable workloads (accumulate CPU credits)
M5 general purpose instances provide balanced CPU/memory (1:4 ratio)
C5 compute optimized instances provide high CPU-to-memory ratio (1:2 ratio)
R5 memory optimized instances provide high memory-to-CPU ratio (1:8 ratio)
I3 storage optimized instances provide NVMe SSD for high IOPS
Instance size doubles resources with each step (large ā xlarge ā 2xlarge)
Use Compute Optimizer to get right-sizing recommendations
Newer generations (M6i vs M5) provide better price/performance
EC2 Performance Optimization Techniques:
1. Use Placement Groups for Low Latency:
Cluster: Instances in same AZ, low-latency network (10 Gbps)
Spread: Instances on different hardware (max 7 per AZ)
Partition: Instances in different partitions (for distributed systems)
Scheduled Scaling: Scale based on time (e.g., business hours)
AWS Lambda Performance Optimization
What it is: AWS Lambda is a serverless compute service that runs code in response to events. You don't manage servers; AWS automatically scales and manages infrastructure.
Why it exists: Managing servers is complex and expensive. You pay for idle capacity, handle scaling, patch operating systems, and monitor infrastructure. Lambda eliminates this operational overhead by running code only when needed and automatically scaling.
Real-world analogy: Lambda is like hiring a contractor for specific tasks instead of a full-time employee. You only pay when they're working (per request), they bring their own tools (runtime), and you don't manage their schedule (automatic scaling).
Key Insight: For CPU-intensive workloads, increasing memory often reduces execution time proportionally, resulting in same cost but better performance.
Detailed Example 2: API Backend (Low Latency)
Scenario: You're building an API that queries DynamoDB and returns results. Need <100ms response time.
Cold Start Problem:
Cold start: 500ms (Lambda initialization)
Warm start: 10ms (Lambda already initialized)
Problem: First request after idle period is slow
Solution 1: Provisioned Concurrency:
Pre-initializes Lambda functions
Eliminates cold starts
Cost: $0.000004167 per GB-second (in addition to execution cost)
Concurrency: 1,000 Lambda functions running in parallel
Time: 1,000,000 records Ć· 1,000 = 1,000 records per function
Time per function: 1,000 Ć 100ms = 100 seconds
Total time: 100 seconds (1.7 minutes)
Speedup: 1,000x faster
How to Achieve Parallelism:
Use S3 event notifications (one Lambda per object)
Use SQS with batch size (Lambda polls queue)
Use Step Functions Map state (parallel execution)
Use Kinesis Data Streams (one Lambda per shard)
š Lambda Performance Optimization Diagram:
graph TB
A[Lambda Performance Optimization] --> B{Optimization Goal?}
B -->|Reduce Cost| C{Workload Type?}
C -->|CPU-Intensive| D[Increase Memory<br/>Faster = Same Cost]
C -->|I/O-Intensive| E[Minimize Memory<br/>Waiting ā CPU]
B -->|Reduce Latency| F{Cold Start Issue?}
F -->|Yes| G[Provisioned Concurrency]
F -->|No| H[Optimize Code]
B -->|Increase Throughput| I[Parallel Invocations]
I --> J[S3 Events]
I --> K[SQS Batching]
I --> L[Kinesis Shards]
style D fill:#c8e6c9
style E fill:#c8e6c9
style G fill:#fff3e0
style I fill:#e1f5fe
See: diagrams/04_domain3_lambda_optimization.mmd
Diagram Explanation: This decision tree shows Lambda performance optimization strategies based on goals. To reduce cost for CPU-intensive workloads, increase memory (faster execution = same cost). For I/O-intensive workloads, minimize memory (waiting doesn't use CPU). To reduce latency with cold start issues, use Provisioned Concurrency. To increase throughput, use parallel invocations via S3 events, SQS batching, or Kinesis shards.
ā Must Know (Lambda Performance):
Lambda allocates CPU proportional to memory (1,769 MB = 1 vCPU)
For CPU-intensive workloads, increasing memory reduces execution time proportionally
Cold starts occur on first invocation or after idle period (100-1,000ms)
Provisioned Concurrency eliminates cold starts but costs more
Lambda scales automatically up to concurrency limit (1,000 default)
Use parallel invocations for high throughput (S3 events, SQS, Kinesis)
Lambda timeout maximum is 15 minutes (use Step Functions for longer workflows)
Ephemeral storage (/tmp) is 512 MB default, can increase to 10 GB
Section 3: High-Performing Database Solutions
Introduction
The problem: Databases are often the performance bottleneck in applications. Slow queries, connection limits, insufficient IOPS, and poor caching strategies result in slow response times and poor user experience.
The solution: AWS provides multiple database services optimized for different data models and access patterns. Understanding database performance characteristics (IOPS, throughput, latency, connection pooling, caching) enables you to design high-performing data layers.
Why it's tested: Database performance directly impacts application performance. This section tests your ability to select and configure database services for optimal performance.
Core Concepts
Amazon RDS Performance Optimization
What it is: Amazon RDS is a managed relational database service supporting MySQL, PostgreSQL, MariaDB, Oracle, and SQL Server. RDS handles backups, patching, and replication.
Why it exists: Managing database servers is complex. You must handle backups, replication, failover, patching, and monitoring. RDS automates these operational tasks, allowing you to focus on application development.
Real-world analogy: RDS is like hiring a database administrator who handles all maintenance tasks. You focus on your application while RDS handles backups, updates, and keeping the database running.
RDS Performance Factors:
1. Instance Type:
db.t3: Burstable CPU, cost-effective for variable workloads
db.m5: General purpose, balanced CPU/memory
db.r5: Memory optimized, high memory for large working sets
db.x1e: Extreme memory, up to 3,904 GB RAM
2. Storage Type:
gp3: General purpose SSD, 3,000-16,000 IOPS, 125-1,000 MB/s
Can be in different regions (cross-region read replicas)
4. RDS Proxy:
Connection pooling and management
Reduces database connections
Improves scalability for serverless applications
Automatic failover (faster than DNS-based failover)
Detailed Example 1: E-Commerce Database (High Read Traffic)
Scenario: You have an e-commerce site with 10,000 product page views per minute. Each page view requires 5 database queries. Database CPU is at 80% due to read queries.
Detailed Example 2: Serverless Application (Connection Pooling)
Scenario: You have a Lambda function that queries RDS. Each Lambda invocation creates a new database connection. With 1,000 concurrent Lambda executions, you hit the database connection limit (100 connections).
gp3 storage provides better price/performance than gp2
Use Performance Insights to identify slow queries
Multi-AZ provides high availability but NOT performance improvement
Cross-region read replicas have higher replication lag (network latency)
Amazon Aurora Performance Optimization
What it is: Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. Aurora provides up to 5x performance of MySQL and 3x performance of PostgreSQL.
Why it exists: Traditional databases were designed for single servers with local storage. Cloud databases need to scale across multiple servers and storage nodes. Aurora was built from the ground up for cloud architecture, providing better performance, availability, and scalability.
Real-world analogy: Aurora is like a high-performance sports car designed specifically for racing, while RDS is like a regular car modified for racing. Both can race, but the purpose-built car performs better.
Aurora Performance Advantages:
1. Storage Architecture:
Traditional RDS: Single EBS volume (limited IOPS)
Aurora: Distributed storage across 6 copies in 3 AZs
What it is: Amazon DynamoDB is a fully managed NoSQL database that provides single-digit millisecond latency at any scale. DynamoDB automatically scales to handle millions of requests per second.
Why it exists: Relational databases struggle with massive scale and require complex sharding. NoSQL databases like DynamoDB are designed for horizontal scaling, providing consistent performance regardless of data size.
Real-world analogy: DynamoDB is like a massive filing system where you can instantly retrieve any document by its ID. The system automatically adds more filing cabinets as you add more documents, and retrieval time stays constant.
DynamoDB Performance Characteristics:
Capacity Modes:
On-Demand: Pay per request, automatic scaling, no capacity planning
Provisioned: Specify RCU/WCU, predictable cost, can use Auto Scaling
Read/Write Capacity Units:
RCU (Read Capacity Unit): 1 strongly consistent read/sec for items up to 4 KB
WCU (Write Capacity Unit): 1 write/sec for items up to 1 KB
Eventually consistent reads: 2 reads per RCU (half the cost)
Performance: No throttling, scales to millions of users
Key Principle: Partition key should have high cardinality (many unique values) to distribute data evenly.
Detailed Example 2: DynamoDB Accelerator (DAX) for Caching
Scenario: You have a product catalog with 100,000 products. Each product page view requires reading product details. You have 10,000 page views per minute.
ā Infrequently accessed items (low cache hit rate)
ā Must Know (DynamoDB Performance):
Partition key design is critical (use high-cardinality keys)
Hot partitions cause throttling (distribute data evenly)
Use DAX for read-heavy workloads (microsecond latency)
On-Demand mode: No capacity planning, pay per request
Provisioned mode: Predictable cost, can use Auto Scaling
Eventually consistent reads are half the cost of strongly consistent
Global Secondary Indexes (GSI) enable different query patterns
Avoid Scan operations in production (reads entire table)
Section 4: High-Performing Network Architectures
Introduction
The problem: Network latency and bandwidth limitations impact application performance. Users far from your servers experience slow load times. Inefficient routing increases costs. Poor network design creates bottlenecks.
The solution: AWS provides multiple networking services to optimize performance. CloudFront caches content at edge locations. Global Accelerator routes traffic over AWS's optimized network. VPC design and load balancing strategies improve throughput and reduce latency.
Why it's tested: Network performance affects user experience. This section tests your ability to design network architectures for optimal performance and cost.
Core Concepts
Amazon CloudFront Performance Optimization
What it is: Amazon CloudFront is a content delivery network (CDN) that caches content at edge locations worldwide. CloudFront reduces latency by serving content from the location closest to users.
Why it exists: Serving content from a single region results in high latency for distant users. A user in Australia accessing content in US-East-1 experiences 200-300ms latency. CloudFront caches content at 400+ edge locations, reducing latency to 10-50ms.
Real-world analogy: CloudFront is like having local warehouses in every city instead of one central warehouse. Customers get products faster because they're shipped from the nearest warehouse.
CloudFront Performance Characteristics:
Latency Reduction:
Direct to S3: 100-300ms (depends on distance)
Via CloudFront: 10-50ms (edge location nearby)
Improvement: 2-10x faster
Cache Hit Ratio:
High cache hit ratio (>80%): Most requests served from edge
Low cache hit ratio (<50%): Many requests go to origin (slower, more expensive)
Detailed Example: Global Website
Scenario: You have a website with users worldwide. Static assets (images, CSS, JavaScript) are 10 MB per page. You have 1 million page views per day.
Without CloudFront:
Data transfer: 1M views Ć 10 MB = 10 TB/day
S3 data transfer cost: 10 TB Ć $0.09/GB = $900/day
Regional Edge Caches provide additional caching between edge and origin
AWS Global Accelerator
What it is: AWS Global Accelerator routes traffic over AWS's global network infrastructure instead of the public internet. It provides static IP addresses that route to optimal AWS endpoints.
Why it exists: Public internet routing is unpredictable and can be slow. Global Accelerator uses AWS's private network, which is faster and more reliable than public internet.
Real-world analogy: Global Accelerator is like taking a private highway instead of public roads. The private highway has less traffic, better maintenance, and faster speeds.
Performance Benefits:
Latency reduction: 10-60% faster than public internet
Consistent performance: AWS network is more reliable
Automatic failover: Routes to healthy endpoints
Static IPs: No DNS caching issues
When to use Global Accelerator vs CloudFront:
CloudFront: Static content, caching, HTTP/HTTPS
Global Accelerator: Dynamic content, TCP/UDP, non-HTTP protocols
Chapter Summary
What We Covered
ā Section 1: High-Performing Storage Solutions
S3 performance optimization (prefixes, multipart upload, Transfer Acceleration)
EBS volume types and performance characteristics (gp3, io2, st1, sc1)
EFS performance modes and throughput modes
FSx for specialized file systems (Windows, Lustre, ONTAP, OpenZFS)
S3 Performance: Use multiple prefixes for >5,500 GET/sec. Use multipart upload for >100 MB objects. Use Transfer Acceleration for long-distance uploads. Use CloudFront for frequently accessed content.
EBS Selection: Use gp3 for most workloads (better price/performance than gp2). Use io2 for high-IOPS databases (>16,000 IOPS). Use st1 for throughput-intensive workloads. Use sc1 for infrequently accessed data.
EFS vs EBS: Use EFS for shared file access across multiple instances. Use EBS for single-instance block storage. EFS automatically scales; EBS requires manual resizing.
EC2 Instance Selection: Match instance family to workload (T3 for variable, M5 for balanced, C5 for CPU, R5 for memory, I3 for storage). Use Compute Optimizer for right-sizing recommendations.
Lambda Optimization: For CPU-intensive workloads, increasing memory reduces execution time proportionally (same cost, better performance). Use Provisioned Concurrency to eliminate cold starts. Use parallel invocations for high throughput.
RDS Performance: Use read replicas to offload read traffic (up to 5 replicas). Use RDS Proxy for connection pooling (critical for Lambda). Use gp3 storage for better price/performance. Use Performance Insights to identify slow queries.
Aurora Advantages: Up to 15 read replicas (vs 5 for RDS). <10ms replication lag (vs 100ms+ for RDS). 30-second failover (vs 60-120 seconds for RDS). Continuous backup with no performance impact.
DynamoDB Optimization: Design partition keys for even distribution (high cardinality). Use DAX for read-heavy workloads (microsecond latency). Use On-Demand mode for unpredictable workloads. Avoid Scan operations in production.
CloudFront Performance: Caches content at 400+ edge locations. Aim for >80% cache hit ratio. Use Cache-Control headers to control TTL. Reduces latency by 2-10x for global users.
Global Accelerator: Routes traffic over AWS network (10-60% faster than internet). Use for dynamic content and non-HTTP protocols. Provides static IPs and automatic failover.
Self-Assessment Checklist
Test yourself before moving on:
I understand S3 performance limits (5,500 GET/sec per prefix)
I know when to use multipart upload and Transfer Acceleration
I can explain the difference between gp3 and io2 EBS volumes
I understand when to use EFS vs EBS
I know the different EC2 instance families and their use cases
I can right-size EC2 instances based on utilization
I understand how Lambda memory affects CPU and performance
I know when to use Provisioned Concurrency
I understand how RDS read replicas improve performance
I know when to use RDS Proxy
I can explain Aurora's performance advantages over RDS
I understand DynamoDB partition key design principles
I know when to use DAX for DynamoDB
I understand how CloudFront reduces latency
I can explain when to use Global Accelerator vs CloudFront
Practice Questions
Try these from your practice test bundles:
Domain 3 Bundle 1: Questions 1-25 (Storage and compute)
Domain 3 Bundle 2: Questions 26-50 (Database and networking)
Full Practice Test 1: Questions 38-53 (Domain 3 questions)
Expected score: 75%+ to proceed confidently
If you scored below 75%:
Review sections: Focus on areas where you missed questions
Key topics to strengthen:
S3 performance optimization techniques
EBS volume type selection criteria
EC2 instance family characteristics
Lambda memory and concurrency
RDS read replica use cases
DynamoDB partition key design
CloudFront caching strategies
Quick Reference Card
Storage Services:
S3: Object storage, 5,500 GET/sec per prefix, unlimited scale
EBS gp3: General purpose SSD, 3,000-16,000 IOPS, $0.08/GB-month
EBS io2: High-performance SSD, up to 64,000 IOPS, $0.125/GB-month
EFS: Shared file system, 50 MB/s per TB, $0.30/GB-month
FSx Lustre: HPC file system, 200 MB/s per TB, $0.145/GB-month
Compute Services:
T3: Burstable CPU, cost-effective for variable workloads
M5: General purpose, balanced CPU/memory (1:4 ratio)
C5: Compute optimized, high CPU-to-memory ratio (1:2 ratio)
R5: Memory optimized, high memory-to-CPU ratio (1:8 ratio)
ā Task 3.4 - High-Performing Network Architectures: CloudFront edge caching, Global Accelerator, VPC design for performance, Direct Connect, load balancer optimization
ā Task 3.5 - Data Ingestion and Transformation: Kinesis streaming, Glue ETL, Athena query optimization, EMR big data processing, data lake architectures
Critical Takeaways
Match Storage to Workload: Use S3 for object storage with 11 9's durability, EBS for block storage with low latency, EFS for shared file systems, and FSx for specialized workloads (Windows, Lustre, NetApp).
Choose the Right Compute: EC2 for full control, Lambda for event-driven serverless, Fargate for serverless containers, and ECS/EKS for container orchestration. Match instance types to workload characteristics.
Database Performance is Multi-Faceted: Consider read/write patterns, use read replicas for read-heavy workloads, implement caching with ElastiCache, and choose between relational (RDS/Aurora) and NoSQL (DynamoDB) based on data structure.
Edge Services Reduce Latency: Use CloudFront for content delivery, Global Accelerator for static IP and TCP/UDP optimization, and Route 53 latency-based routing for global applications.
Caching is Critical: Implement caching at multiple layers - CloudFront for static content, ElastiCache for database queries, DAX for DynamoDB, API Gateway for API responses.
Streaming vs. Batch Processing: Use Kinesis for real-time streaming data, Glue for batch ETL, and EMR for large-scale data processing. Choose based on latency requirements.
Optimize Data Transfer: Use S3 Transfer Acceleration for long-distance uploads, multipart upload for large files, and VPC endpoints to avoid internet traffic.
Self-Assessment Checklist
Test yourself before moving to Domain 4. You should be able to:
High-Performing Storage:
Choose appropriate S3 storage class based on access patterns
Optimize S3 performance using prefixes and multipart upload
Select EBS volume type (gp3, io2, st1, sc1) based on IOPS/throughput needs
Configure EFS performance mode (General Purpose vs. Max I/O)
Choose FSx file system (Windows, Lustre, NetApp, OpenZFS) for specific workloads
Implement S3 Transfer Acceleration for global uploads
Use Storage Gateway for hybrid cloud storage
High-Performing Compute:
Select EC2 instance family (C, M, R, T, I, G, P) based on workload
Configure EC2 placement groups (Cluster, Spread, Partition)
Optimize Lambda function memory and timeout settings
Implement Lambda provisioned concurrency for consistent performance
Choose between ECS EC2 and ECS Fargate based on requirements
Configure Auto Scaling policies for optimal performance and cost
Use Compute Optimizer for right-sizing recommendations
High-Performing Databases:
Choose between RDS and Aurora based on performance needs
Configure RDS read replicas for read-heavy workloads
Select DynamoDB capacity mode (On-Demand vs. Provisioned)
Design DynamoDB partition keys for even distribution
Implement ElastiCache (Redis or Memcached) for caching
Use DynamoDB DAX for microsecond latency
Configure RDS Proxy for connection pooling
High-Performing Networks:
Configure CloudFront distributions with optimal caching policies
Use Global Accelerator for static IP and improved performance
Design VPC with appropriate subnet sizing and routing
Choose between ALB and NLB based on performance requirements
Implement Direct Connect for consistent network performance
Use VPC endpoints to reduce latency and data transfer costs
Configure Route 53 latency-based routing for global applications
Data Ingestion and Transformation:
Design Kinesis Data Streams for real-time data ingestion
Use Kinesis Data Firehose for data delivery to S3/Redshift
Configure Glue ETL jobs for data transformation
Optimize Athena queries with partitioning and columnar formats
Choose between EMR and Glue for big data processing
Implement data lake architecture with Lake Formation
Use QuickSight for data visualization
Practice Questions
Try these from your practice test bundles:
Domain 3 Bundle 1: Questions 1-50 (storage and compute performance)
Domain 3 Bundle 2: Questions 1-50 (database and network performance)
Aurora performance features (Parallel Query, Global Database)
DynamoDB capacity modes and DAX
ElastiCache (Redis vs Memcached)
ā Network Optimization
CloudFront edge caching
Global Accelerator
VPC endpoints (Gateway vs Interface)
Direct Connect and LAG
ā Data Ingestion and Analytics
Kinesis (Data Streams, Firehose, Analytics)
Glue ETL and Data Catalog
Athena query optimization
EMR for big data processing
Critical Takeaways
Storage Performance: Use S3 prefixes for parallelization (3500 PUT/5500 GET per prefix), gp3 for cost-effective IOPS, io2 for mission-critical workloads, EFS for shared file access
Compute Optimization: Choose instance types based on workload (compute-optimized for CPU, memory-optimized for RAM, storage-optimized for I/O), use Auto Scaling for elasticity, Lambda for event-driven
Database Performance: Aurora for high-performance relational (15 read replicas), DynamoDB for single-digit millisecond NoSQL, DAX for microsecond caching, ElastiCache for sub-millisecond
Network Acceleration: CloudFront for global content delivery (50+ edge locations), Global Accelerator for static IP and health-based routing, VPC endpoints to avoid internet gateway
Data Analytics: Kinesis for real-time streaming, Glue for ETL, Athena for serverless SQL on S3, EMR for big data frameworks (Spark, Hadoop)
Self-Assessment Checklist
Test yourself before moving on:
I can explain S3 performance optimization techniques (prefixes, multipart, Transfer Acceleration)
I understand EBS volume types and when to use each (gp3, io2, st1, sc1)
I know the difference between EFS performance modes
I can select appropriate EC2 instance types for different workloads
I understand Lambda memory and concurrency optimization
I know when to use RDS vs Aurora vs DynamoDB
I can explain DynamoDB capacity modes (On-Demand vs Provisioned)
I understand CloudFront caching strategies
I know when to use Global Accelerator vs CloudFront
I can design a high-performing data ingestion pipeline with Kinesis
Practice Questions
Try these from your practice test bundles:
Domain 3 Bundle 1: Questions 1-25 (Storage and compute)
Domain 3 Bundle 2: Questions 1-25 (Database and network)
Storage Services Bundle: Questions 1-25
Database Services Bundle: Questions 1-25
Expected score: 75%+ to proceed
If you scored below 75%:
Review sections: S3 performance, EBS volume types, Database selection, CloudFront vs Global Accelerator
Focus on: Understanding performance characteristics and when to use each service
Quick Reference Card
Storage Performance:
S3: 3,500 PUT/5,500 GET per prefix, multipart for >100MB, Transfer Acceleration for global uploads
EBS gp3: 3,000 IOPS baseline, up to 16,000 IOPS, 125-1,000 MB/s
EBS io2: Up to 64,000 IOPS, 1,000 MB/s, 99.999% durability
EFS: 10+ GB/s aggregate, General Purpose (low latency) or Max I/O (high throughput)
CloudFront for global content delivery and edge caching
Global Accelerator for TCP/UDP performance improvement
VPC networking: subnets, route tables, endpoints
Direct Connect for dedicated high-bandwidth connectivity
Load balancing strategies for optimal traffic distribution
ā Task 3.5: Determine High-Performing Data Ingestion and Transformation Solutions
Kinesis Data Streams for real-time data ingestion
Kinesis Firehose for near real-time delivery to data stores
Glue for serverless ETL and data cataloging
Athena for serverless SQL queries on S3
EMR for big data processing with Hadoop/Spark
Critical Takeaways
Choose the right storage for the workload: S3 for objects, EBS for block, EFS for shared files. Match storage class to access patterns (Frequent ā IA ā Glacier).
IOPS matter for databases: Use io2 Block Express for highest IOPS (256,000). Use gp3 for cost-effective performance. Provision IOPS for consistent performance.
Right-size compute instances: Use Compute Optimizer recommendations. Match instance family to workload (c5 for compute, r5 for memory, i3 for storage).
Lambda optimization is critical: More memory = more CPU. Use provisioned concurrency for consistent latency. Use layers for shared code. Optimize cold starts.
Caching reduces latency and cost: Use CloudFront for static content, ElastiCache for database queries, DAX for DynamoDB, API Gateway caching for APIs.
Database choice affects performance: Aurora for high-performance relational, DynamoDB for single-digit millisecond NoSQL, ElastiCache for sub-millisecond caching.
Read replicas for read-heavy workloads: RDS supports up to 5 read replicas, Aurora supports up to 15. Use for reporting and analytics without impacting primary.
Global performance requires edge services: CloudFront for content delivery, Global Accelerator for TCP/UDP, Route 53 latency-based routing for optimal endpoint selection.
Real-time vs batch processing: Kinesis Data Streams for real-time (sub-second), Kinesis Firehose for near real-time (60 seconds), Glue/EMR for batch (minutes to hours).
Partition data for performance: S3 prefixes for parallel requests, DynamoDB partition keys for even distribution, Athena partitions for faster queries.
Key Services Quick Reference
Storage Services:
S3: Object storage, 11 9's durability, 5,500 GET/3,500 PUT per prefix per second
S3 Intelligent-Tiering: Automatic cost optimization based on access patterns
EBS gp3: General purpose SSD, 3,000-16,000 IOPS, 125-1,000 MB/s
EBS io2: Provisioned IOPS SSD, up to 64,000 IOPS, 99.999% durability
EBS io2 Block Express: Up to 256,000 IOPS, 4,000 MB/s, sub-millisecond latency
EFS: Shared file storage, automatic scaling, bursting and provisioned throughput
FSx Lustre: HPC file system, up to 1 TB/s throughput, millions IOPS
FSx Windows: Windows file server, SMB protocol, Active Directory integration
Enhanced Networking: SR-IOV, up to 100 Gbps, lower latency, higher PPS
VPC Endpoints: Private connectivity, no internet gateway, reduced latency
Data Ingestion Quick Facts
Kinesis Data Streams: Real-time, 1 MB/s per shard, 24h-365d retention
Kinesis Data Firehose: Near real-time (60s), auto-scaling, S3/Redshift delivery
Kinesis Data Analytics: SQL on streams, real-time analytics
Glue: Serverless ETL, data catalog, crawlers for schema discovery
Athena: Serverless SQL on S3, pay per query, Presto-based
EMR: Managed Hadoop/Spark, big data processing, auto-scaling
Decision Points
High IOPS database ā io2 or io2 Block Express EBS volumes
Reduce database load ā ElastiCache or DAX for caching
Global content delivery ā CloudFront with edge locations
Low-latency HPC ā Cluster placement group with enhanced networking
Variable Lambda workload ā Provisioned concurrency for predictable latency
Read-heavy database ā Read replicas (up to 15 for Aurora)
Real-time analytics ā Kinesis Data Streams + Lambda or Kinesis Data Analytics
Large file uploads ā S3 multipart upload + Transfer Acceleration
Congratulations! You've completed Domain 3: Design High-Performing Architectures. Performance optimization is critical for real-world applications, and this domain (24% of the exam) tests your ability to choose the right services for optimal performance.
Task 3.5: Determine High-Performing Data Ingestion and Transformation
ā Kinesis Data Streams for real-time streaming
ā Kinesis Data Firehose for near real-time delivery
ā Kinesis Data Analytics for stream processing
ā Glue for serverless ETL
ā Athena for serverless SQL on S3
ā EMR for big data processing
ā Lake Formation for data lake management
Critical Takeaways
Match Storage to Workload: Use gp3 for general purpose, io2 for high IOPS databases, st1 for throughput-intensive workloads, and sc1 for cold data.
Cache Aggressively: Implement caching at multiple layers (CloudFront, ElastiCache, DAX) to reduce latency and database load.
Choose the Right Compute: Use Lambda for event-driven, Fargate for containers without management, EC2 for full control, and Batch for large-scale batch jobs.
Database Performance: Use read replicas for read scaling, Aurora for best performance, DynamoDB for single-digit millisecond latency, and caching for frequently accessed data.
Global Performance: Use CloudFront for content delivery, Global Accelerator for static IPs and health checks, and multi-region deployments for global applications.
Network Optimization: Use Direct Connect for consistent low latency, Enhanced Networking for high throughput, and VPC endpoints to avoid internet traffic.
Real-Time Processing: Use Kinesis Data Streams for real-time analytics, Firehose for near real-time delivery, and Lambda for stream processing.
Right-Size Everything: Use Compute Optimizer, Performance Insights, and CloudWatch metrics to continuously optimize resource sizing.
Self-Assessment Checklist
Test yourself before moving on. Can you:
Storage Performance
Choose the appropriate EBS volume type for different workloads?
Explain when to use EFS vs FSx vs S3?
Optimize S3 performance with multipart upload and Transfer Acceleration?
Select the right S3 storage class for access patterns?
Configure EFS performance and throughput modes?
Use Storage Gateway for hybrid storage scenarios?
Compute Performance
Select the appropriate EC2 instance type for workloads?
Configure placement groups for low-latency applications?
Implement Auto Scaling with appropriate policies?
Optimize Lambda memory and concurrency settings?
Choose between ECS and EKS for container workloads?
Use Batch for large-scale batch processing?
Database Performance
Choose between RDS, Aurora, and DynamoDB?
Configure read replicas for read scaling?
Implement database caching with ElastiCache or DAX?
Use RDS Proxy for connection pooling?
Optimize DynamoDB with partition key design?
Select appropriate database capacity modes?
Network Performance
Configure CloudFront for global content delivery?
Use Global Accelerator for static anycast IPs?
Implement Direct Connect for dedicated connectivity?
Enable Enhanced Networking for high throughput?
Choose the appropriate load balancer type?
Use VPC endpoints for private connectivity?
Data Ingestion and Analytics
Design real-time streaming architectures with Kinesis?
Use Glue for serverless ETL jobs?
Query S3 data with Athena?
Process big data with EMR?
Build data lakes with Lake Formation?
Practice Questions
Try these from your practice test bundles:
Beginner Level (Build Confidence):
Domain 3 Bundle 1: Questions 1-20
Storage Services Bundle: Questions 1-15
Expected score: 70%+ to proceed
Intermediate Level (Test Understanding):
Domain 3 Bundle 2: Questions 1-20
Compute Services Bundle: Questions 1-15
Database Services Bundle: Questions 1-15
Expected score: 75%+ to proceed
Advanced Level (Challenge Yourself):
Full Practice Test 2: Domain 3 questions
Expected score: 70%+ to proceed
If you scored below target:
Below 60%: Review storage and compute fundamentals
60-70%: Focus on database and network optimization
70-80%: Review quick facts and decision points
80%+: Outstanding! Move to next domain
Quick Reference Card
Copy this to your notes for quick review:
Storage Performance
gp3: 3,000-16,000 IOPS, 125-1,000 MB/s, general purpose
io2: Up to 64,000 IOPS, 1,000 MB/s, high-performance databases
io2 Block Express: Up to 256,000 IOPS, 4,000 MB/s, largest databases
st1: 500 IOPS, 500 MB/s, throughput-intensive (big data, data warehouses)
sc1: 250 IOPS, 250 MB/s, cold data, lowest cost
Compute Performance
General Purpose: t3, t4g (burstable), m5, m6g (balanced)
ā Database Performance: RDS, Aurora, DynamoDB, ElastiCache, and caching strategies
ā Network Performance: CloudFront, Global Accelerator, Direct Connect, and network optimization
ā Data Ingestion: Kinesis, Glue, Athena, EMR, and real-time analytics
ā Performance Monitoring: CloudWatch, X-Ray, and performance troubleshooting
Critical Takeaways
Choose the Right Storage: Match storage type to access pattern - S3 for objects, EBS for block, EFS for shared file, FSx for specialized workloads
Optimize Compute: Use appropriate instance types (compute-optimized for CPU, memory-optimized for RAM), placement groups for HPC, and provisioned concurrency for Lambda
Cache Aggressively: Implement caching at multiple layers (CloudFront edge, ElastiCache/DAX, application) to reduce latency and database load
Scale Databases Properly: Use read replicas for read-heavy workloads, Aurora for high performance, DynamoDB for massive scale
Leverage Edge Services: Use CloudFront for global content delivery, Global Accelerator for static IPs and health checks
Monitor and Optimize: Use CloudWatch metrics, X-Ray tracing, and Compute Optimizer recommendations to continuously improve performance
Self-Assessment Checklist
Test yourself before moving on. You should be able to:
Storage Performance:
Choose between S3 storage classes based on access patterns
This chapter covered the essential concepts for designing high-performing architectures on AWS, which accounts for 24% of the SAA-C03 exam. We explored five major task areas:
Task 3.1: High-Performing Storage Solutions
ā S3 storage classes and performance optimization
ā EBS volume types (gp3, io2, st1, sc1) and use cases
ā EFS performance modes and throughput modes
ā FSx file systems (Windows, Lustre, NetApp ONTAP)
ā Storage Gateway for hybrid cloud storage
ā DataSync for large-scale data migration
Task 3.2: High-Performing Compute Solutions
ā EC2 instance families and types selection
ā Placement groups (Cluster, Spread, Partition)
ā Enhanced networking and ENA
ā Auto Scaling policies and strategies
ā Lambda memory and concurrency optimization
ā ECS and EKS capacity providers
ā Batch for large-scale batch processing
Task 3.3: High-Performing Database Solutions
ā RDS instance types and storage optimization
ā Aurora Serverless and performance features
ā DynamoDB capacity modes and DAX caching
ā ElastiCache (Redis vs Memcached)
ā Database read replicas and replication
ā RDS Proxy for connection pooling
Task 3.4: High-Performing Network Architectures
ā CloudFront edge locations and caching
ā Global Accelerator for global applications
ā Direct Connect for dedicated connectivity
ā VPC design and subnet optimization
ā Load balancer performance characteristics
ā PrivateLink for private connectivity
Task 3.5: Data Ingestion and Transformation
ā Kinesis Data Streams for real-time ingestion
ā Kinesis Firehose for serverless delivery
ā Glue for ETL and data cataloging
ā Athena for serverless SQL queries
ā EMR for big data processing
ā Lake Formation for data lake management
Critical Takeaways
Storage Performance: Choose gp3 for general purpose (16,000 IOPS), io2 Block Express for extreme performance (256,000 IOPS), EFS for shared file systems.
EBS Optimization: Use gp3 instead of gp2 (20% cheaper, configurable IOPS/throughput), enable EBS optimization on instances, use Fast Snapshot Restore for quick recovery.
S3 Performance: Use multipart upload for files >100 MB, enable Transfer Acceleration for global uploads, implement request rate optimization (3,500 PUT/5,500 GET per prefix).
Compute Selection: Memory-optimized (R/X) for databases, Compute-optimized (C) for batch processing, General purpose (M/T) for web servers, GPU (P/G) for ML/graphics.
Placement Groups: Cluster for low-latency HPC (single AZ), Spread for critical instances (max 7 per AZ), Partition for distributed systems (Hadoop, Cassandra).
Lambda Optimization: More memory = more CPU (1,769 MB = 1 vCPU), use Provisioned Concurrency for consistent latency, optimize package size for faster cold starts.
Database Caching: ElastiCache for general caching, DAX for DynamoDB (microsecond latency), RDS Proxy for connection pooling (reduce connection overhead).
Aurora Performance: Up to 5x faster than MySQL, 3x faster than PostgreSQL, 15 read replicas, automatic failover <30 seconds, parallel query for analytics.
DynamoDB Optimization: Use On-Demand for unpredictable workloads, Provisioned for steady-state (cheaper), design partition keys for even distribution, use GSI for query flexibility.
CloudFront Benefits: Reduce origin load by 60-90%, cache at 450+ edge locations, Origin Shield for additional caching layer, signed URLs for private content.
Global Accelerator: Static anycast IPs, intelligent routing to optimal endpoint, instant regional failover, TCP/UDP support (not just HTTP).
Kinesis Streams: 1 MB/s write per shard, 2 MB/s read per shard, 1,000 records/s per shard, 24-hour default retention (up to 365 days).
Data Format Optimization: Convert CSV to Parquet (10x compression, 100x faster queries), use columnar formats for analytics, partition data by query patterns.
Network Performance: Enhanced networking (25 Gbps), Elastic Fabric Adapter for HPC (100 Gbps), placement groups for low latency (<1 ms).
Monitoring: Use CloudWatch for metrics, X-Ray for distributed tracing, Performance Insights for database bottlenecks, VPC Flow Logs for network analysis.
Self-Assessment Checklist
Test yourself before moving on. You should be able to:
Storage Performance:
Select appropriate EBS volume type based on IOPS and throughput requirements
Explain the difference between gp3 and io2 Block Express
Choose between EFS and FSx for different file system needs
Optimize S3 performance with multipart upload and Transfer Acceleration
Design hybrid storage solutions with Storage Gateway
Compute Optimization:
Select appropriate EC2 instance family for different workload types
Configure placement groups for HPC and distributed applications
Optimize Lambda function memory and concurrency settings
Choose between ECS on EC2 vs Fargate based on requirements
Design Auto Scaling policies for predictable and variable workloads
Database Performance:
Select appropriate RDS instance type and storage configuration
Explain when to use Aurora vs RDS vs DynamoDB
Configure DynamoDB partition keys for even distribution
Implement caching with ElastiCache or DAX
Design read replica strategy for read-heavy workloads
Use RDS Proxy to reduce connection overhead
Network Performance:
Configure CloudFront for optimal caching and performance
Explain when to use Global Accelerator vs CloudFront
Design Direct Connect for hybrid connectivity
Select appropriate load balancer based on performance needs
Optimize VPC design for high-throughput applications
Data Ingestion:
Design Kinesis Data Streams architecture with appropriate shard count
Choose between Kinesis Streams and Firehose
Configure Glue ETL jobs for data transformation
Optimize Athena queries with partitioning and columnar formats
Select appropriate EMR instance types for big data processing
Performance Monitoring:
Configure CloudWatch metrics and alarms for performance monitoring
Use X-Ray for distributed tracing and bottleneck identification
Analyze RDS Performance Insights for database optimization
Implement VPC Flow Logs for network performance analysis
Use Compute Optimizer for right-sizing recommendations
Practice Questions
Try these from your practice test bundles:
Domain 3 Bundle 1: Questions 1-25 (Focus: Storage and compute)
Domain 3 Bundle 2: Questions 26-50 (Focus: Database and networking)
Full Practice Test 2: Domain 3 questions (Mixed difficulty)
Expected score: 70%+ to proceed confidently
If you scored below 70%:
Review EBS volume types and use cases
Focus on database selection criteria (RDS vs Aurora vs DynamoDB)
Study CloudFront vs Global Accelerator differences
Practice Lambda optimization techniques
Review Kinesis architecture and shard calculations
RDS Read Replicas: Read scaling (up to 15 replicas)
DynamoDB DAX: Microsecond caching
Performance Optimization Checklist:
Use gp3 instead of gp2 for EBS (20% cheaper)
Enable S3 Transfer Acceleration for global uploads
Implement CloudFront for static content delivery
Use ElastiCache/DAX for frequently accessed data
Configure RDS read replicas for read-heavy workloads
Use Provisioned Concurrency for Lambda (consistent latency)
Enable enhanced networking on EC2 instances
Use placement groups for low-latency HPC
Convert data to Parquet for analytics (10x compression)
Partition data by query patterns in Athena
Congratulations! You've completed Chapter 3: Design High-Performing Architectures. You now understand how to optimize storage, compute, database, network, and data ingestion for maximum performance on AWS.
Global Accelerator for global traffic optimization
VPC design for performance
Direct Connect for dedicated connectivity
Load balancer selection (ALB, NLB, GLB)
VPC endpoints for private connectivity
Enhanced networking for EC2
ā Task 3.5: High-Performing Data Ingestion and Transformation
Kinesis Data Streams for real-time streaming
Kinesis Firehose for data delivery
Glue for ETL and data cataloging
Athena for serverless SQL queries
EMR for big data processing
Lake Formation for data lakes
Data format optimization (Parquet, ORC)
Critical Takeaways
Choose the Right Storage: Match storage type to access patterns. Use gp3 for general purpose, io2 for high IOPS, S3 for object storage, EFS for shared file systems.
Right-Size Compute: Use Compute Optimizer recommendations. Choose instance families based on workload (C for compute, R for memory, I for storage).
Implement Caching Everywhere: Cache at edge (CloudFront), application (ElastiCache), and database (DAX, read replicas) layers.
Optimize Database Performance: Use Aurora for high performance, DynamoDB for single-digit millisecond latency, ElastiCache for sub-millisecond caching.
Use CDN for Global Performance: CloudFront reduces latency for global users. Use Origin Shield for additional caching layer.
Partition and Compress Data: Use Parquet format for analytics (10x compression). Partition data by query patterns in Athena.
Scale Horizontally: Add more instances rather than bigger instances. Use read replicas for read-heavy workloads.
Monitor Performance: Use CloudWatch for metrics, Performance Insights for databases, X-Ray for distributed tracing.
Self-Assessment Checklist
Test yourself before moving on. You should be able to:
Storage Performance:
Choose appropriate EBS volume type for workload (gp3, io2, st1, sc1)
Configure S3 Transfer Acceleration for global uploads
Implement S3 multipart upload for large files
Select EFS performance mode (General Purpose vs Max I/O)
Choose FSx file system type (Windows, Lustre, NetApp ONTAP)
Optimize S3 performance with prefixes and parallelization
Use Storage Gateway for hybrid storage scenarios
Configure DataSync for large-scale migrations
Compute Performance:
Select appropriate EC2 instance family (C, M, R, I, T, P, G)
Configure Lambda memory for optimal performance
Implement Lambda Provisioned Concurrency for consistent latency
Use EC2 placement groups for low-latency HPC
Configure Auto Scaling with appropriate policies
Choose between Fargate and EC2 launch type for containers
Use Batch for large-scale batch processing
Implement Compute Optimizer recommendations
Database Performance:
Choose between RDS and Aurora based on performance needs
Configure Aurora Serverless v2 for variable workloads
Implement DynamoDB DAX for microsecond caching
Design DynamoDB partition keys for even distribution
Use RDS Proxy for connection pooling
Configure read replicas for read-heavy workloads
Choose between ElastiCache Redis and Memcached
Optimize database queries with Performance Insights
Network Performance:
Configure CloudFront for edge caching
Use Global Accelerator for global traffic optimization
Choose appropriate load balancer (ALB, NLB, GLB)
Implement VPC endpoints for private connectivity
Configure Direct Connect for dedicated bandwidth
Use enhanced networking on EC2 instances
Optimize VPC design for performance
Implement CloudFront Origin Shield
Data Ingestion and Analytics:
Design streaming architecture with Kinesis Data Streams
Use Kinesis Firehose for data delivery to S3/Redshift
Configure Glue ETL jobs for data transformation
Optimize Athena queries with partitioning
Choose appropriate data format (Parquet, ORC, JSON)
ā Task 3.5: High-Performing Data Ingestion and Transformation
Kinesis Data Streams for real-time streaming
Kinesis Data Firehose for data delivery
Glue for ETL and data cataloging
Athena for serverless SQL queries
EMR for big data processing
Lake Formation for data lake management
QuickSight for data visualization
Data format optimization (Parquet, ORC)
Critical Takeaways
Choose the Right Storage: Use gp3 for general purpose (cheaper than gp2), io2 Block Express for high IOPS (>64,000), EFS for shared file systems, and FSx for specialized workloads.
Instance Selection Matters: Match instance type to workload - compute-optimized (C) for CPU-intensive, memory-optimized (R/X) for in-memory databases, storage-optimized (I/D) for high IOPS.
Cache Everything: Use CloudFront for static content, ElastiCache for application data, DAX for DynamoDB, and RDS read replicas for read-heavy workloads.
Serverless for Variable Workloads: Lambda and Fargate automatically scale. Use Provisioned Concurrency for Lambda when you need consistent low latency.
Database Performance: Use Aurora for high performance and scalability. Use DynamoDB for single-digit millisecond latency. Use ElastiCache to reduce database load.
Network Optimization: Use CloudFront to reduce latency globally. Use Direct Connect for consistent network performance. Use VPC endpoints to avoid internet gateway.
Data Format Matters: Convert to Parquet for analytics (10x compression). Partition data by query patterns in Athena. Use columnar formats for analytical workloads.
Monitoring is Essential: Use CloudWatch for metrics, X-Ray for distributed tracing, and Performance Insights for database performance.
Self-Assessment Checklist
Test yourself before moving on:
I can choose the right EBS volume type for different workloads
I understand when to use EFS vs FSx vs S3
I know how to optimize S3 performance with Transfer Acceleration
I can select the appropriate EC2 instance type for a workload
I understand Lambda memory and concurrency configuration
I know when to use ECS vs EKS vs Fargate
I can design a caching strategy with multiple layers
I understand the difference between Aurora and RDS
I know when to use DynamoDB vs RDS
I can configure read replicas for read scaling
I understand CloudFront caching behaviors
I know when to use Global Accelerator vs CloudFront
I can design a data ingestion pipeline with Kinesis
I understand data format optimization for analytics
Practice Questions
Try these from your practice test bundles:
Domain 3 Bundle 1: Questions 1-25 (Storage and compute performance)
Domain 3 Bundle 2: Questions 1-25 (Database and network performance)
Storage Services Bundle: Questions 1-30
Database Services Bundle: Questions 1-30
Compute Services Bundle: Questions 1-30
Expected score: 75%+ to proceed confidently
If you scored below 75%:
Review EBS volume types and their IOPS limits
Focus on understanding Lambda concurrency and memory configuration
Study database caching strategies (ElastiCache, DAX, read replicas)
Practice CloudFront caching and invalidation scenarios
Prerequisites: Chapters 1-3 (understanding of services before optimizing costs)
Exam Weight: 20% of exam questions (approximately 13 out of 65 questions)
Section 1: Cost-Optimized Storage Solutions
Introduction
The problem: Storage costs can spiral out of control without proper management. Storing infrequently accessed data in expensive storage, not using lifecycle policies, and paying for unnecessary data transfer all waste money.
The solution: AWS provides multiple storage classes with different price points. Understanding access patterns, implementing lifecycle policies, and optimizing data transfer enables significant cost savings without sacrificing availability or durability.
Why it's tested: Storage is often the largest AWS cost component. This domain represents 20% of the exam and tests your ability to optimize storage costs while meeting performance and availability requirements.
Core Concepts
S3 Storage Classes and Lifecycle Policies
What they are: S3 offers multiple storage classes optimized for different access patterns and durability requirements. Lifecycle policies automatically transition objects between storage classes based on age or access patterns.
Why they exist: Not all data needs the same level of access speed or durability. Frequently accessed data needs fast retrieval. Infrequently accessed data can tolerate slower retrieval for lower cost. Lifecycle policies automate cost optimization without manual intervention.
S3 Storage Classes:
S3 Standard - Frequent access:
Durability: 99.999999999% (11 9's)
Availability: 99.99%
Retrieval: Milliseconds
Cost: $0.023/GB-month (first 50 TB)
Use Case: Frequently accessed data, primary storage
S3 Intelligent-Tiering - Unknown/changing access:
Automatic: Moves objects between tiers based on access patterns
Tiers: Frequent (same as Standard), Infrequent (40% cheaper), Archive (68% cheaper), Deep Archive (95% cheaper)
Monitoring: $0.0025 per 1,000 objects
Cost: Same as Standard for frequent, cheaper for infrequent
Use Case: Unknown access patterns, automatic optimization
S3 Standard-IA - Infrequent access:
Durability: 99.999999999% (11 9's)
Availability: 99.9%
Retrieval: Milliseconds
Cost: $0.0125/GB-month (46% cheaper than Standard)
Retrieval Fee: $0.01/GB
Minimum: 30 days, 128 KB per object
Use Case: Backups, disaster recovery, infrequently accessed data
S3 One Zone-IA - Infrequent access, single AZ:
Durability: 99.999999999% (11 9's) within single AZ
Availability: 99.5%
Retrieval: Milliseconds
Cost: $0.01/GB-month (57% cheaper than Standard)
Retrieval Fee: $0.01/GB
Use Case: Reproducible data, secondary backups
S3 Glacier Instant Retrieval - Archive with instant access:
Durability: 99.999999999% (11 9's)
Availability: 99.9%
Retrieval: Milliseconds
Cost: $0.004/GB-month (83% cheaper than Standard)
Retrieval Fee: $0.03/GB
Minimum: 90 days, 128 KB per object
Use Case: Medical images, news archives (rarely accessed but need instant retrieval)
S3 Glacier Flexible Retrieval - Archive with flexible retrieval:
Glacier Deep Archive: 10 GB/year Ć $0.02 = $0.20/year
Total: ~$30/year (negligible compared to storage savings)
Section 2: Cost-Optimized Compute Solutions
Introduction
The problem: Running EC2 instances 24/7 at On-Demand prices is expensive. Many workloads don't need continuous availability or can tolerate interruptions. Not using Reserved Instances, Savings Plans, or Spot Instances wastes money.
The solution: AWS provides multiple pricing models for EC2. Understanding workload characteristics and commitment levels enables 50-90% cost savings without sacrificing performance.
Why it's tested: Compute is typically the second-largest AWS cost. This section tests your ability to select appropriate pricing models and optimize compute costs.
Core Concepts
EC2 Pricing Models
On-Demand - Pay by the hour/second:
Pricing: Standard hourly rate (e.g., $0.096/hour for m5.xlarge)
Commitment: None
Flexibility: Start/stop anytime
Use Case: Short-term, unpredictable workloads, testing
Reserved Instances - 1 or 3-year commitment:
Discount: 40-60% vs On-Demand
Payment: All Upfront, Partial Upfront, No Upfront
Types:
Standard RI: Highest discount (60%), no flexibility
Convertible RI: Lower discount (54%), can change instance family
Use Case: Steady-state workloads, predictable usage
Savings Plans - 1 or 3-year commitment:
Discount: Up to 72% vs On-Demand
Flexibility: Apply to any instance family, size, region, OS
Types:
Compute Savings Plans: Most flexible, 66% discount
EC2 Instance Savings Plans: Less flexible, 72% discount
Use Case: Flexible workloads, multiple instance types
Spot Instances - Bid on spare capacity:
Discount: Up to 90% vs On-Demand
Interruption: Can be terminated with 2-minute warning
Use Case: Fault-tolerant, flexible workloads (batch, big data, CI/CD)
Detailed Example 2: Compute Cost Optimization Strategy
Scenario: You're running a web application with the following workload:
Cost optimization strategies for different workloads
Critical Takeaways
S3 Lifecycle Policies: Automatically transition objects to cheaper storage classes based on age. Can save 70-90% on storage costs for infrequently accessed data.
Storage Class Selection: Use Standard for frequent access, Standard-IA for infrequent access (>30 days), Glacier for archives (>90 days), Deep Archive for long-term retention (>180 days).
Savings Plans: Most flexible commitment option. Compute Savings Plans apply to any instance family/region. EC2 Instance Savings Plans offer higher discounts but less flexibility.
Reserved Instances: Good for predictable workloads with specific instance requirements. Standard RIs offer highest discount (60%) but no flexibility. Convertible RIs offer flexibility (54% discount).
Spot Instances: Up to 90% discount for fault-tolerant workloads. Must handle 2-minute interruption warnings. Best for batch processing, big data, CI/CD.
Cost Optimization Strategy: Use Savings Plans for baseline, On-Demand for variable peaks, Spot for fault-tolerant batch workloads. Can achieve 40-60% total cost reduction.
Intelligent-Tiering: Automatic cost optimization for unknown access patterns. Monitors access and moves objects between tiers. No retrieval fees, small monitoring fee.
Self-Assessment Checklist
Test yourself before moving on:
I understand all S3 storage classes and their use cases
I can design S3 lifecycle policies for cost optimization
I know the minimum storage durations for each storage class
I understand the difference between Savings Plans and Reserved Instances
I know when to use Spot Instances
I can handle Spot Instance interruptions
I understand how to optimize costs for different workload patterns
I can calculate cost savings for different pricing models
Practice Questions
Try these from your practice test bundles:
Domain 4 Bundle 1: Questions 1-25 (Storage and compute costs)
Savings Plans: Up to 72% discount, 1-3 year commitment
Reserved Instances: Up to 60% discount, 1-3 year commitment
On-Demand: No discount, no commitment
Decision Points:
Infrequent access (>30 days) ā Use S3 Standard-IA
Archive (>90 days) ā Use S3 Glacier
Long-term archive (>180 days) ā Use S3 Glacier Deep Archive
Unknown access pattern ā Use S3 Intelligent-Tiering
Steady-state workload ā Use Savings Plans or Reserved Instances
Fault-tolerant batch ā Use Spot Instances
Variable workload ā Use On-Demand
Next Chapter: 06_integration - Integration & Cross-Domain Scenarios
Section 3: Cost-Optimized Database Solutions
Introduction
The problem: Database costs can be significant, especially for high-throughput or large-storage workloads. Running oversized instances, not using serverless options, and paying for unused capacity waste money.
The solution: AWS provides multiple database pricing models and optimization strategies. Understanding workload patterns, using serverless options, and right-sizing instances enables significant cost savings.
Core Concepts
RDS Cost Optimization
RDS Pricing Factors:
Instance Type: db.t3 (burstable) vs db.m5 (general) vs db.r5 (memory)
Storage: gp2 vs gp3 vs io1 (IOPS costs)
Multi-AZ: Doubles instance cost (but necessary for production)
Backups: Automated backups (free up to DB size), manual snapshots (charged)
Data Transfer: Cross-region replication, read replica traffic
Cost Optimization Strategies:
1. Right-Size Instances:
Monitor CPU, memory, IOPS utilization
Downsize if consistently under 50% utilization
Use CloudWatch metrics and RDS Performance Insights
What it is: Aurora Serverless v2 is an on-demand, auto-scaling configuration for Amazon Aurora. It automatically scales database capacity based on application demand.
Why it exists: Traditional databases require provisioning fixed capacity. During low traffic, you pay for idle capacity. During spikes, you may not have enough capacity. Aurora Serverless eliminates this waste by scaling automatically.
How it works:
Define Capacity Range: Set minimum and maximum ACUs (Aurora Capacity Units)
Automatic Scaling: Aurora scales up/down in 0.5 ACU increments
Pay Per Second: Only pay for ACUs used per second
Instant Scaling: Scales in seconds (vs minutes for instance resizing)
Pricing:
ACU: $0.12 per ACU-hour (MySQL/PostgreSQL)
Storage: $0.10/GB-month
I/O: $0.20 per million requests
Detailed Example 4: Aurora Serverless Cost Comparison
Scenario: E-commerce database with variable traffic:
Wait, that's more expensive! Let's recalculate with realistic scaling:
Realistic Scenario (gradual scaling):
Baseline (6 hours/day): 2 ACUs
Ramp up (2 hours/day): 4 ACUs average
Normal (14 hours/day): 8 ACUs
Peak (2 hours/day): 16 ACUs average (not full 32)
Usage:
2 ACUs Ć 6 Ć 30 = 360 ACU-hours
4 ACUs Ć 2 Ć 30 = 240 ACU-hours
8 ACUs Ć 14 Ć 30 = 3,360 ACU-hours
16 ACUs Ć 2 Ć 30 = 960 ACU-hours
Total: 4,920 ACU-hours/month
Cost: 4,920 Ć $0.12 = $590/month
Comparison:
Provisioned: $423/month (fixed capacity)
Serverless: $590/month (variable capacity)
When Serverless Wins:
If traffic is more variable (long idle periods)
If peak is rare (< 10% of time)
If you want to avoid over-provisioning
When Provisioned Wins:
If traffic is consistent (> 50% at peak capacity)
If you can use Reserved Instances (40-60% discount)
If predictable workload
Section 4: Cost-Optimized Network Architectures
Introduction
The problem: Data transfer costs can be significant, especially for high-traffic applications. Cross-region transfers, NAT Gateway costs, and unnecessary data movement waste money.
The solution: Understanding data transfer pricing, using VPC endpoints, optimizing NAT Gateway usage, and leveraging CloudFront enables significant cost savings.
Core Concepts
Data Transfer Pricing
AWS Data Transfer Costs:
Inbound (to AWS):
Free: All data transfer into AWS from internet
Outbound (from AWS to internet):
First 10 TB/month: $0.09/GB
Next 40 TB/month: $0.085/GB
Next 100 TB/month: $0.07/GB
Over 150 TB/month: $0.05/GB
Inter-Region (between AWS regions):
Cost: $0.02/GB (both directions)
Intra-Region (within same region):
Same AZ: Free (if using private IP)
Different AZ: $0.01/GB (each direction)
VPC Peering:
Same Region: $0.01/GB
Different Region: $0.02/GB
NAT Gateway:
Hourly: $0.045/hour
Data Processed: $0.045/GB
Detailed Example 5: Network Cost Optimization
Scenario: Web application with:
EC2 instances: Private subnets, need internet access for updates
S3 access: Frequent reads/writes to S3
Data transfer: 10 TB/month to internet, 5 TB/month to S3
What they are: VPC endpoints enable private connections between your VPC and AWS services without using internet gateway, NAT device, VPN, or AWS Direct Connect.
Types:
Gateway Endpoints (Free):
Services: S3, DynamoDB
Cost: Free (no hourly or data charges)
Routing: Uses route table entries
Interface Endpoints (Paid):
Services: Most AWS services (EC2, SNS, SQS, etc.)
Cost: $0.01/hour per AZ + $0.01/GB data processed
Implementation: ENI in your subnet
When to Use:
ā High S3/DynamoDB traffic from private subnets
ā Want to avoid NAT Gateway data processing charges
ā Need private connectivity to AWS services
ā Security requirement (no internet access)
Chapter Summary
What We Covered
This chapter covered the "Design Cost-Optimized Architectures" domain, which represents 20% of the SAA-C03 exam. We explored four major areas:
ā Section 1: Cost-Optimized Storage Solutions
S3 storage classes and pricing
S3 lifecycle policies for automatic cost optimization
Cost comparison and use cases for each storage class
S3 Lifecycle Policies: Automatically transition objects to cheaper storage classes. Can save 70-96% on storage costs for infrequently accessed data.
Storage Class Selection: Standard ($0.023/GB) for frequent access, Standard-IA ($0.0125/GB) for infrequent, Glacier ($0.004/GB) for archives, Deep Archive ($0.00099/GB) for long-term.
Compute Optimization: Use Savings Plans (66-72% discount) for baseline, On-Demand for variable peaks, Spot (90% discount) for fault-tolerant workloads.
Database Right-Sizing: Monitor utilization, downsize if under 50% CPU/memory. Switch to gp3 storage (20% cheaper than gp2). Use Reserved Instances for 40-60% discount.
Aurora Serverless: Best for variable workloads with long idle periods. Pay per ACU per second. Not always cheaper than provisioned for consistent workloads.
Network Optimization: Use VPC endpoints (free for S3/DynamoDB) to avoid NAT Gateway data processing charges ($0.045/GB). Use CloudFront to reduce data transfer costs.
Data Transfer: Inbound is free. Outbound starts at $0.09/GB. Cross-region is $0.02/GB. Cross-AZ is $0.01/GB. Optimize by keeping traffic within same AZ when possible.
Self-Assessment Checklist
Test yourself before moving on:
I understand all S3 storage classes and their pricing
I can design S3 lifecycle policies for cost optimization
I know when to use Reserved Instances vs Savings Plans
I understand Spot Instance use cases and limitations
I can right-size RDS instances based on utilization
I know when Aurora Serverless is cost-effective
I understand data transfer pricing (inbound, outbound, cross-region, cross-AZ)
I know how VPC endpoints reduce costs
I can calculate cost savings for different optimization strategies
Use RDS Reserved Instances for production databases
Add VPC endpoints for S3/DynamoDB
Use CloudFront for static content delivery
Delete old snapshots and unused resources
Monitor costs with AWS Cost Explorer
Next Chapter: 06_integration - Integration & Cross-Domain Scenarios
Section 2: Cost-Optimized Compute Solutions
Introduction
The problem: Compute is often the largest AWS cost after storage. Running instances 24/7 when only needed during business hours, using On-Demand pricing for predictable workloads, and over-provisioning instances all waste money.
The solution: AWS provides multiple pricing models (On-Demand, Reserved Instances, Savings Plans, Spot Instances) and instance types optimized for different workloads. Understanding usage patterns and selecting appropriate pricing models can reduce compute costs by 50-90%.
Why it's tested: Compute cost optimization is critical for AWS cost management. This section tests your ability to select appropriate pricing models and instance types for different workload patterns.
Core Concepts
EC2 Pricing Models
What they are: AWS offers four pricing models for EC2 instances, each optimized for different usage patterns and commitment levels.
Why they exist: Different workloads have different characteristics. Production workloads run 24/7 and benefit from commitment discounts. Development workloads run during business hours and benefit from flexible pricing. Batch jobs tolerate interruptions and benefit from spot pricing.
EC2 Pricing Models Comparison:
Pricing Model
Discount
Commitment
Flexibility
Interruption
Use Case
On-Demand
0%
None
Full
No
Variable workloads, short-term
Reserved Instances
Up to 72%
1 or 3 years
Limited
No
Steady-state workloads
Savings Plans
Up to 72%
1 or 3 years
High
No
Flexible compute usage
Spot Instances
Up to 90%
None
Full
Yes (2-min warning)
Fault-tolerant workloads
Detailed Example 1: Production Web Application (Reserved Instances)
Scenario: You run a web application on 10 Ć m5.large instances (2 vCPUs, 8 GB RAM each) 24/7 for production. Application has been stable for 2 years and will continue for 3+ years.
Option 1: On-Demand Pricing:
Cost per instance: $0.096/hour
Total cost: 10 instances Ć $0.096/hour Ć 24 hours Ć 365 days = $8,410/year
3-year cost: $25,230
Option 2: 1-Year Standard Reserved Instance (All Upfront):
Upfront cost: $561 per instance
Hourly cost: $0 (paid upfront)
Total cost: 10 instances Ć $561 = $5,610/year
Savings: $2,800/year (33% discount)
3-year cost: $16,830 (need to renew each year)
Option 3: 3-Year Standard Reserved Instance (All Upfront):
Upfront cost: $1,424 per instance
Hourly cost: $0 (paid upfront)
Total cost: 10 instances Ć $1,424 = $14,240 for 3 years
Inflexibility: Can't easily shift between workloads
Option 3: Compute Savings Plan (Recommended):
Commitment: $30/day ($900/month, $10,800/year)
Discount: 40% on committed amount
Savings: $4,320/year (30% overall savings)
Flexibility: Applies to any instance family, size, region, OS
Overage: $10/day charged at On-Demand rates
How Savings Plans Work:
Commit to $30/day of compute usage
First $30/day gets 40% discount ($18/day actual cost)
Usage above $30/day charged at On-Demand rates
Commitment applies to any EC2, Fargate, or Lambda usage
Detailed Example 3: Batch Processing (Spot Instances)
Scenario: You run nightly batch jobs processing 1,000 files. Each file takes 10 minutes to process. Jobs can be interrupted and restarted without data loss.
Option 1: On-Demand Instances:
Instance: c5.4xlarge (16 vCPUs, 32 GB RAM)
Cost: $0.68/hour
Processing: 6 files/hour (10 min each)
Time: 1,000 files Ć· 6 = 167 hours
Total cost: 167 hours Ć $0.68 = $113.56/day
Option 2: Spot Instances (Recommended):
Instance: c5.4xlarge
Spot price: $0.068/hour (90% discount)
Processing: 6 files/hour
Time: 1,000 files Ć· 6 = 167 hours (may take longer due to interruptions)
Result: Reduces interruption frequency (more capacity pools)
š EC2 Pricing Model Selection Diagram:
graph TD
A[Select EC2 Pricing Model] --> B{Workload Characteristics?}
B -->|Steady-State 24/7| C{Commitment Length?}
C -->|3 Years| D[3-Year Reserved Instance<br/>44% discount]
C -->|1 Year| E[1-Year Reserved Instance<br/>33% discount]
C -->|Flexible| F[Compute Savings Plan<br/>40% discount]
B -->|Variable Usage| G{Need Flexibility?}
G -->|Yes| H[Compute Savings Plan<br/>Applies to any instance]
G -->|No| I[On-Demand<br/>No commitment]
B -->|Fault-Tolerant| J[Spot Instances<br/>Up to 90% discount]
B -->|Short-Term| K[On-Demand<br/>No commitment]
style D fill:#c8e6c9
style E fill:#c8e6c9
style F fill:#fff3e0
style H fill:#fff3e0
style J fill:#e1f5fe
Diagram Explanation: This decision tree helps select the appropriate EC2 pricing model based on workload characteristics. For steady-state 24/7 workloads, use Reserved Instances (3-year for maximum savings, 1-year for shorter commitment) or Compute Savings Plans for flexibility. For variable usage, use Compute Savings Plans if you need flexibility across instance types, or On-Demand if you need no commitment. For fault-tolerant workloads that can handle interruptions, use Spot Instances for up to 90% discount. For short-term or unpredictable workloads, use On-Demand pricing.
ā Must Know (EC2 Cost Optimization):
Reserved Instances provide up to 72% discount for 1-3 year commitments
Savings Plans provide similar discounts with more flexibility (any instance type/region)
Spot Instances provide up to 90% discount but can be interrupted with 2-minute notice
Use Spot for fault-tolerant workloads (batch processing, data analysis, CI/CD)
Compute Optimizer provides right-sizing recommendations based on actual usage
Graviton instances (ARM-based) provide 20-40% better price/performance
Use Auto Scaling to match capacity to demand (avoid over-provisioning)
Stop instances when not needed (dev/test environments during off-hours)
AWS Lambda Cost Optimization
What it is: Lambda charges based on number of requests and duration (GB-seconds). Optimizing memory allocation and execution time directly reduces costs.
Why it matters: Lambda costs can add up quickly with millions of invocations. Understanding the relationship between memory, CPU, and execution time enables cost optimization.
Lambda Pricing:
Requests: $0.20 per 1 million requests
Duration: $0.0000166667 per GB-second
Free Tier: 1 million requests + 400,000 GB-seconds per month
Detailed Example: Lambda Memory Optimization
Scenario: You have a Lambda function that processes images (CPU-intensive). Function runs 10 million times per month.
Recommendation: Use minimum memory for I/O-bound workloads
Section 3: Cost-Optimized Database Solutions
Introduction
The problem: Database costs can be significant, especially for production workloads running 24/7. Over-provisioned instances, expensive storage, and inefficient capacity modes waste money.
The solution: AWS provides multiple database pricing models (On-Demand, Reserved Instances, Serverless) and storage options. Understanding workload patterns and selecting appropriate pricing models can reduce database costs by 40-70%.
Why it's tested: Database cost optimization is critical for overall AWS cost management. This section tests your ability to select appropriate database services and pricing models.
Core Concepts
RDS Cost Optimization
What it is: RDS offers Reserved Instances for 1-3 year commitments, providing significant discounts over On-Demand pricing.
RDS Reserved Instance Discounts:
1-Year Standard RI: Up to 40% discount
3-Year Standard RI: Up to 60% discount
Payment options: All Upfront, Partial Upfront, No Upfront
Detailed Example: Production Database
Scenario: You run a PostgreSQL database on db.r5.2xlarge (8 vCPUs, 64 GB RAM) 24/7 for production.
Option 1: On-Demand:
Cost: $1.008/hour
Annual cost: $1.008 Ć 24 Ć 365 = $8,830/year
Option 2: 1-Year Reserved Instance (All Upfront):
Upfront cost: $5,300
Hourly cost: $0
Annual cost: $5,300
Savings: $3,530/year (40% discount)
Option 3: 3-Year Reserved Instance (All Upfront):
Upfront cost: $12,700 (for 3 years)
Hourly cost: $0
Annual equivalent: $4,233/year
Savings: $4,597/year (52% discount)
Aurora Serverless Cost Optimization
What it is: Aurora Serverless automatically scales database capacity based on application demand. You pay only for the capacity used (measured in Aurora Capacity Units - ACUs).
Why it exists: Traditional databases require provisioning fixed capacity, resulting in over-provisioning for peak load. Aurora Serverless scales automatically, reducing costs for variable workloads.
Aurora Serverless v2 Pricing:
ACU: Aurora Capacity Unit (2 GB RAM, equivalent CPU/network)
Cost: $0.12 per ACU-hour
Scaling: 0.5 ACU minimum, 128 ACU maximum
Scaling speed: Instant (sub-second)
Detailed Example: Development Database
Scenario: You have a development database used during business hours (8 AM - 6 PM, Monday-Friday). Peak usage requires 8 ACUs, idle usage requires 0.5 ACUs.
Option 1: RDS db.r5.large (Provisioned):
Capacity: 2 vCPUs, 16 GB RAM (always running)
Cost: $0.252/hour Ć 24 hours Ć 365 days = $2,207/year
Utilization: 25% (only used 50 hours/week out of 168 hours)
Option 2: Aurora Serverless v2 (Recommended):
Business hours (50 hours/week): 8 ACUs Ć $0.12 = $0.96/hour
Off hours (118 hours/week): 0.5 ACUs Ć $0.12 = $0.06/hour
Rule of thumb: If traffic is predictable and >1M requests/month, use Provisioned
ā Must Know (Database Cost Optimization):
Use RDS Reserved Instances for production databases (40-60% discount)
Use Aurora Serverless for unpredictable or infrequent workloads
Stop RDS instances when not needed (dev/test environments)
Use DynamoDB Provisioned Capacity for predictable traffic (80% cheaper)
Use DynamoDB On-Demand for unpredictable traffic (no capacity planning)
Use read replicas to offload read traffic (cheaper than scaling primary)
Use Aurora for high-traffic applications (better price/performance than RDS)
Delete old database snapshots (storage costs add up)
Section 4: Cost-Optimized Network Architectures
Introduction
The problem: Data transfer costs can be significant, especially for applications with high traffic or multi-region architectures. Inefficient routing, unnecessary data transfer, and not using VPC endpoints waste money.
The solution: AWS provides multiple networking options to optimize costs. VPC endpoints eliminate data transfer charges for AWS services. CloudFront reduces origin requests. Proper network design minimizes cross-region and cross-AZ data transfer.
Why it's tested: Network costs are often overlooked but can be substantial. This section tests your ability to design cost-optimized network architectures.
Core Concepts
Data Transfer Costs
What they are: AWS charges for data transfer between regions, between AZs, and out to the internet. Understanding these costs is critical for cost optimization.
Data Transfer Pricing (simplified):
Inbound to AWS: Free
Within same AZ (private IP): Free
Between AZs (same region): $0.01/GB each direction
Between regions: $0.02/GB
Out to internet: $0.09/GB (first 10 TB)
Detailed Example 1: Multi-AZ Application
Scenario: You have a web application with EC2 instances in multiple AZs for high availability. Application transfers 1 TB/day between AZs.
Use private IPs: Ensure instances communicate via private IPs (not public)
Minimize cross-AZ traffic: Cache data locally, use read replicas in same AZ
Result: Reduce cross-AZ traffic by 80% = $2,949/year savings
VPC Endpoints Cost Optimization
What they are: VPC endpoints enable private connectivity to AWS services without using internet gateway, NAT gateway, or VPN. This eliminates data transfer charges and improves security.
VPC Endpoint Types:
Gateway Endpoints: Free (S3, DynamoDB)
Interface Endpoints: $0.01/hour per AZ + $0.01/GB data processed
Detailed Example: S3 Access from EC2
Scenario: You have 100 EC2 instances accessing S3. Each instance downloads 10 GB/day from S3.
Option 1: NAT Gateway (Without VPC Endpoint):
Data transfer: 100 instances Ć 10 GB/day = 1,000 GB/day
Data transfer costs (cross-AZ, cross-region, internet)
VPC endpoints to eliminate NAT Gateway costs
CloudFront for global content delivery
Network design to minimize data transfer
Critical Takeaways
S3 Lifecycle: Transition infrequently accessed data to cheaper storage classes (Standard-IA, Glacier). Use Intelligent-Tiering for unknown access patterns.
EC2 Pricing: Use Reserved Instances or Savings Plans for steady-state workloads (40-72% discount). Use Spot for fault-tolerant workloads (up to 90% discount).
Right-Sizing: Use Compute Optimizer to identify over-provisioned instances. Target 70-80% utilization. Stop instances when not needed.
Database Optimization: Use RDS Reserved Instances for production databases. Use Aurora Serverless for variable workloads. Use DynamoDB Provisioned Capacity for predictable traffic.
VPC Endpoints: Always use Gateway Endpoints for S3 and DynamoDB (free). Eliminates NAT Gateway costs and improves security.
Data Transfer: Minimize cross-AZ and cross-region data transfer. Use private IPs within same AZ (free). Use CloudFront for global content delivery.
Cost Monitoring: Use AWS Cost Explorer to identify cost trends. Set up billing alerts. Use cost allocation tags to track costs by project/team.
Quick Wins: Switch EBS from gp2 to gp3 (20% cheaper). Delete old snapshots. Use S3 lifecycle policies. Add VPC endpoints for S3/DynamoDB.
Self-Assessment Checklist
Test yourself before moving on:
I understand S3 storage classes and when to use each
I know how to create S3 lifecycle policies
I can explain the difference between Reserved Instances and Savings Plans
I understand when to use Spot Instances
I know how Lambda memory affects cost
I can calculate cost savings for different EC2 pricing models
I understand when to use Aurora Serverless vs RDS
I know the difference between DynamoDB On-Demand and Provisioned
I understand data transfer costs (cross-AZ, cross-region, internet)
I know when to use VPC endpoints
I can explain how CloudFront reduces costs
I understand cost optimization strategies for each service
Practice Questions
Try these from your practice test bundles:
Domain 4 Bundle 1: Questions 1-25 (Storage and compute)
Domain 4 Bundle 2: Questions 26-50 (Database and network)
Full Practice Test 1: Questions 54-65 (Domain 4 questions)
Expected score: 75%+ to proceed confidently
If you scored below 75%:
Review sections: Focus on areas where you missed questions
Key topics to strengthen:
S3 storage class selection criteria
EC2 pricing model comparison
Reserved Instance vs Savings Plan differences
Spot Instance use cases
Database pricing optimization
Data transfer cost minimization
Quick Reference Card
S3 Storage Classes (by cost):
Deep Archive: $0.00099/GB-month (96% cheaper, 12-48 hour retrieval)
ā Task 4.4 - Cost-Optimized Network Architectures: Data transfer costs, NAT Gateway optimization, VPC endpoints, CloudFront cost savings, Direct Connect vs. VPN
Critical Takeaways
Storage Lifecycle Management Saves Money: Implement S3 lifecycle policies to automatically transition objects to cheaper storage classes (S3-IA, Glacier, Deep Archive) based on access patterns.
Compute Pricing Models Matter: Use Reserved Instances or Savings Plans for steady-state workloads (up to 72% savings), Spot Instances for fault-tolerant workloads (up to 90% savings), and On-Demand for unpredictable workloads.
Right-Sizing is Continuous: Use AWS Compute Optimizer and Cost Explorer to identify underutilized resources. Downsize or terminate idle resources regularly.
Data Transfer Costs Add Up: Keep data within the same Region when possible, use VPC endpoints to avoid internet data transfer charges, and leverage CloudFront for content delivery.
Serverless Can Be Cost-Effective: Lambda charges only for execution time, Aurora Serverless scales to zero when not in use, and DynamoDB On-Demand eliminates capacity planning.
Monitoring and Budgets Prevent Surprises: Set up AWS Budgets with alerts, use Cost Allocation Tags for granular tracking, and review Cost Explorer regularly.
Reserved Capacity Requires Planning: Commit to 1-year or 3-year terms for Reserved Instances, Savings Plans, or Reserved Capacity only after analyzing usage patterns.
Self-Assessment Checklist
Test yourself before moving to integration topics. You should be able to:
Cost-Optimized Storage:
Design S3 lifecycle policies to transition objects between storage classes
Choose appropriate S3 storage class based on access frequency and retrieval time
Optimize EBS volumes by selecting appropriate volume types (gp3 vs. gp2)
Implement EBS snapshot lifecycle policies to reduce backup costs
Use S3 Intelligent-Tiering for unpredictable access patterns
Calculate data transfer costs between Regions and to internet
Implement S3 Requester Pays for cost sharing
Cost-Optimized Compute:
Choose between On-Demand, Reserved Instances, Savings Plans, and Spot Instances
Calculate savings from Reserved Instances (Standard vs. Convertible)
Implement Spot Instances for fault-tolerant workloads
Use Auto Scaling to match capacity with demand
Right-size EC2 instances using Compute Optimizer recommendations
Optimize Lambda costs by adjusting memory and timeout settings
Choose between EC2 and Fargate based on cost and operational overhead
Cost-Optimized Databases:
Purchase RDS Reserved Instances for steady-state workloads
Use Aurora Serverless for variable workloads
Choose between DynamoDB On-Demand and Provisioned capacity
Implement caching with ElastiCache to reduce database load
Optimize backup retention periods to balance cost and compliance
Use read replicas to offload read traffic from primary database
Configure database auto-scaling to match demand
Cost-Optimized Networks:
Minimize data transfer costs by keeping traffic within same Region
Use VPC endpoints to avoid NAT Gateway and internet data transfer charges
Choose between NAT Gateway and NAT instance based on cost
Implement CloudFront to reduce origin data transfer costs
Calculate Direct Connect vs. VPN costs for hybrid connectivity
Optimize load balancer costs by choosing appropriate type (ALB vs. NLB)
Use Transit Gateway for hub-and-spoke network topology
Storage Lifecycle: Use S3 Intelligent-Tiering for automatic cost optimization, transition to Glacier for archives (90% cheaper), use gp3 instead of gp2 (20% cheaper)
Compute Savings: Reserved Instances save 40-60%, Spot Instances save 70-90%, Savings Plans offer flexibility, right-size instances to avoid over-provisioning
Database Cost Control: Aurora Serverless v2 for variable workloads, DynamoDB On-Demand for unpredictable traffic, Reserved capacity for steady-state, use read replicas instead of larger instances
Network Cost Reduction: Use VPC Endpoints to avoid NAT Gateway charges ($0.045/GB), CloudFront to reduce data transfer costs, keep traffic within same AZ when possible
Cost Monitoring: Use Cost Explorer for analysis, AWS Budgets for alerts, Cost Allocation Tags for tracking, Trusted Advisor for recommendations
Self-Assessment Checklist
Test yourself before moving on:
I can explain S3 storage classes and when to use each
I understand EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
I know how to optimize EBS costs (gp3, right-sizing, snapshots)
I can calculate savings from Reserved Instances vs On-Demand
I understand when to use Spot Instances and how to handle interruptions
I know the difference between Compute Savings Plans and EC2 Savings Plans
I can explain DynamoDB capacity modes and cost implications
I understand data transfer costs and how to minimize them
I know when to use NAT Gateway vs NAT Instance
I can design a cost-optimized architecture using multiple strategies
Practice Questions
Try these from your practice test bundles:
Domain 4 Bundle 1: Questions 1-25 (Storage and compute costs)
Domain 4 Bundle 2: Questions 1-25 (Database and network costs)
Expected score: 75%+ to proceed
If you scored below 75%:
Review sections: S3 lifecycle policies, EC2 pricing models, Data transfer costs
Focus on: Understanding cost implications of architectural decisions
Right-sizing is the #1 cost saver: Use Compute Optimizer to identify over-provisioned resources. Downsize instances that are consistently under 40% utilization.
Reserved capacity for steady workloads: 40-60% savings with Reserved Instances or Savings Plans. Commit to 1 or 3 years for predictable workloads.
Spot Instances for fault-tolerant workloads: 70-90% savings for batch processing, data analysis, containerized workloads. Not for databases or stateful applications.
S3 lifecycle policies automate cost savings: Transition to IA after 30 days, Glacier after 90 days, Deep Archive after 180 days. Delete after retention period.
Serverless reduces idle costs: Lambda and Fargate charge only for actual usage. No cost when idle. Perfect for variable or unpredictable workloads.
Data transfer costs add up: Keep traffic within same AZ when possible ($0 vs $0.01/GB). Use VPC endpoints to avoid NAT Gateway charges. Use CloudFront to reduce origin data transfer.
Delete unused resources: Unattached EBS volumes, old snapshots, unused load balancers, idle RDS instances. Set up AWS Budgets alerts to catch waste.
Aurora Serverless for variable databases: Pay per second, auto-scales, pauses when idle. Perfect for dev/test, infrequent workloads, unpredictable traffic.
DynamoDB on-demand for unpredictable traffic: No capacity planning, pay per request. Switch to provisioned when traffic becomes predictable for 20-30% savings.
Monitor and optimize continuously: Use Cost Explorer to identify trends, Trusted Advisor for recommendations, AWS Budgets for alerts. Cost optimization is ongoing.
Key Services Quick Reference
Cost Management Tools:
Cost Explorer: Visualize and analyze costs, identify trends, forecast spending
AWS Budgets: Set custom budgets, receive alerts when exceeding thresholds
Cost and Usage Report: Detailed billing data, integrate with Athena/QuickSight
Compute Optimizer: ML-based recommendations for right-sizing EC2, Lambda, EBS
Trusted Advisor: Best practice checks, cost optimization recommendations
Cost Allocation Tags: Track costs by project, team, environment
Focus on S3 storage class selection and lifecycle policies
Study data transfer costs and optimization strategies
Practice cost monitoring tool selection
If you scored 70-80%:
Review advanced topics: Savings Plans vs Reserved Instances
Study database cost optimization strategies
Practice network cost optimization
Focus on cost allocation and tagging strategies
If you scored 80%+:
Excellent! You've completed all four domains
Continue practicing with full practice tests
Review integration scenarios in the next chapter
Congratulations! You've completed all four exam domains (100% of exam content). You're now ready to practice integration scenarios and prepare for the exam.
Next Steps: Proceed to 06_integration to learn about cross-domain integration scenarios and advanced topics.
Chapter Summary
What We Covered
This chapter explored designing cost-optimized architectures on AWS, representing 20% of the SAA-C03 exam. We covered four major task areas:
Task 4.1: Design Cost-Optimized Storage Solutions
ā S3 storage classes and lifecycle policies
ā S3 Intelligent-Tiering for automatic cost optimization
ā Glacier and Glacier Deep Archive for long-term archival
Congratulations! You've completed Domain 4: Design Cost-Optimized Architectures. Cost optimization (20% of the exam) is critical for real-world AWS deployments, and understanding pricing models and optimization strategies will help you design cost-effective solutions.
Next Chapter: 06_integration - Integration & Advanced Topics
Chapter Summary
What We Covered
This chapter covered the four major task areas of Domain 4: Design Cost-Optimized Architectures (20% of exam):
Task 4.1: Design Cost-Optimized Storage Solutions
ā S3 storage classes and lifecycle policies
ā S3 Intelligent-Tiering for automatic optimization
ā Glacier and Glacier Deep Archive for long-term archival
ā EBS volume optimization (gp3 vs gp2, right-sizing)
ā Data transfer pricing (inter-AZ, inter-region, internet)
ā NAT Gateway vs NAT Instance cost comparison
ā VPC endpoints to eliminate data transfer costs
ā CloudFront for reduced origin transfer costs
ā Direct Connect vs VPN cost analysis
ā Load balancer cost optimization
ā Network cost monitoring and allocation
Critical Takeaways
Commitment Saves Money: Reserved Instances and Savings Plans offer up to 72% savings for predictable workloads. Commit for 1-3 years based on usage patterns.
Spot for Fault-Tolerant: Use Spot Instances for batch processing, big data, and containerized workloads. Save up to 90% compared to On-Demand.
Storage Lifecycle Management: Implement S3 lifecycle policies to automatically transition objects to cheaper storage classes. Use Intelligent-Tiering for unknown access patterns.
Right-Size Everything: Use Compute Optimizer, Trusted Advisor, and CloudWatch metrics to identify oversized resources. Downsize or use burstable instances.
Eliminate Data Transfer: Use VPC endpoints for AWS service access to avoid data transfer charges. Use CloudFront to reduce origin transfer costs.
Serverless for Variable Workloads: Aurora Serverless, Lambda, and DynamoDB On-Demand automatically scale and you pay only for what you use.
Monitor and Alert: Set up AWS Budgets with alerts, use Cost Explorer to identify trends, and implement cost allocation tags for accountability.
Delete Unused Resources: Regularly audit and delete unattached EBS volumes, old snapshots, unused Elastic IPs, and idle load balancers.
Self-Assessment Checklist
Test yourself before moving on. Can you:
Storage Cost Optimization
Choose the appropriate S3 storage class for access patterns?
Implement S3 lifecycle policies for automatic transitions?
Use S3 Intelligent-Tiering for unknown access patterns?
Select the right EBS volume type for cost vs performance?
Implement EFS lifecycle management for cost savings?
Optimize data transfer costs with VPC endpoints?
Compute Cost Optimization
Explain the difference between Reserved Instances and Savings Plans?
Choose between Standard and Convertible Reserved Instances?
Identify workloads suitable for Spot Instances?
Optimize Lambda costs with appropriate memory settings?
Use Fargate Spot for container cost savings?
Implement Auto Scaling for right-sizing?
Use Compute Optimizer for recommendations?
Database Cost Optimization
Choose between RDS and Aurora based on cost?
Use Aurora Serverless for variable workloads?
Select DynamoDB On-Demand vs Provisioned capacity?
Purchase DynamoDB Reserved Capacity for predictable workloads?
Optimize database storage and backup retention?
Use read replicas vs caching for cost efficiency?
Network Cost Optimization
Understand data transfer pricing between AZs and regions?
Choose between NAT Gateway and NAT Instance?
Use VPC endpoints to eliminate data transfer costs?
Implement CloudFront to reduce origin transfer costs?
Choose between Direct Connect and VPN based on cost?
Optimize load balancer costs?
Cost Monitoring
Set up AWS Budgets with alerts?
Use Cost Explorer to analyze spending trends?
Implement cost allocation tags?
Use Trusted Advisor for cost optimization recommendations?
Analyze Cost and Usage Reports?
Practice Questions
Try these from your practice test bundles:
Beginner Level (Build Confidence):
Domain 4 Bundle 1: Questions 1-20
Expected score: 70%+ to proceed
Intermediate Level (Test Understanding):
Domain 4 Bundle 2: Questions 1-20
Full Practice Test 1: Domain 4 questions
Expected score: 75%+ to proceed
Advanced Level (Challenge Yourself):
Full Practice Test 3: Domain 4 questions
Expected score: 70%+ to proceed
If you scored below target:
Below 60%: Review pricing models and storage classes
60-70%: Focus on Reserved Instances and Savings Plans
Data transfer pricing (same AZ, cross-AZ, cross-region, internet)
NAT Gateway vs NAT instance cost comparison
VPC endpoints to eliminate data transfer costs
CloudFront for reduced origin costs
Direct Connect vs VPN cost analysis
Load balancer cost optimization
Critical Takeaways
Reserved Capacity: Use Reserved Instances or Savings Plans for predictable workloads (up to 72% savings over On-Demand).
Spot Instances: Use Spot for fault-tolerant batch processing, data analysis, and containerized workloads (up to 90% savings).
S3 Lifecycle Policies: Automatically transition objects to cheaper storage classes (Standard ā IA ā Glacier ā Deep Archive) based on access patterns.
Right-Sizing: Use Compute Optimizer and Cost Explorer to identify oversized resources and right-size them.
Data Transfer Optimization: Use VPC endpoints to eliminate data transfer costs to S3/DynamoDB, CloudFront to reduce origin costs.
Serverless for Variable Workloads: Use Lambda, Aurora Serverless, or DynamoDB On-Demand for unpredictable workloads to pay only for what you use.
Cost Monitoring: Enable cost allocation tags, set up AWS Budgets with alerts, use Cost Explorer for analysis.
Delete Unused Resources: Regularly delete unattached EBS volumes, old snapshots, unused Elastic IPs, and idle resources.
Self-Assessment Checklist
Test yourself before moving on:
I understand the difference between Reserved Instances and Savings Plans
I know when to use Spot Instances vs On-Demand
I can design S3 lifecycle policies for cost optimization
I understand data transfer pricing between AZs and regions
I know how to use VPC endpoints to reduce costs
I can select the right database pricing model for a workload
I understand NAT Gateway vs NAT instance cost trade-offs
I know how to use Cost Explorer and AWS Budgets
I can identify cost optimization opportunities in an architecture
I understand the cost implications of different design choices
ā Network Cost Optimization: Data transfer costs, NAT Gateway alternatives, VPC endpoints, and CloudFront
ā Cost Monitoring: Cost Explorer, Budgets, Cost and Usage Reports, and cost allocation tags
ā Cost Management: Right-sizing, resource cleanup, and continuous optimization
Critical Takeaways
Use the Right Pricing Model: Reserved Instances and Savings Plans for predictable workloads (72% savings), Spot for fault-tolerant batch (90% savings), On-Demand for variable
Optimize Storage Lifecycle: Use S3 Intelligent-Tiering for unknown patterns, transition to IA after 30 days, archive to Glacier for long-term retention
Minimize Data Transfer: Use VPC endpoints to eliminate internet transfer costs, CloudFront to reduce origin costs, same-region transfers when possible
Right-Size Resources: Use Compute Optimizer recommendations, delete unused resources (unattached volumes, old snapshots), and match instance types to workload
Leverage Serverless: Use Lambda, Fargate, Aurora Serverless, and DynamoDB On-Demand for variable workloads to pay only for actual usage
Monitor and Alert: Set up Cost Explorer for analysis, Budgets for alerts, and cost allocation tags for tracking spending by project/team
Self-Assessment Checklist
Test yourself before moving on. You should be able to:
Storage Cost Optimization:
Design S3 lifecycle policies to transition objects between storage classes
Choose appropriate S3 storage class based on access patterns
Calculate cost savings from S3 Intelligent-Tiering
Optimize EBS volumes (gp3 vs gp2, delete unattached volumes)
Implement data transfer optimization strategies
Compute Cost Optimization:
Compare Reserved Instances, Savings Plans, and Spot Instances
Calculate cost savings from different pricing models
Design Spot Fleet strategies for fault-tolerant workloads
This chapter covered the essential concepts for designing cost-optimized architectures on AWS, which accounts for 20% of the SAA-C03 exam. We explored four major task areas:
Task 4.1: Cost-Optimized Storage Solutions
ā S3 storage classes and lifecycle policies
ā S3 Intelligent-Tiering for automatic cost optimization
ā Glacier and Glacier Deep Archive for long-term archival
ā EBS volume types and cost optimization strategies
ā Compute Savings Plans vs EC2 Instance Savings Plans
ā Spot Instances and Spot Fleet strategies
ā Lambda pricing and cost optimization
ā Fargate pricing and Fargate Spot
ā Auto Scaling for cost efficiency
ā EC2 right-sizing and Compute Optimizer
Task 4.3: Cost-Optimized Database Solutions
ā RDS pricing models and Reserved Instances
ā Aurora Serverless for variable workloads
ā DynamoDB On-Demand vs Provisioned capacity
ā DynamoDB Reserved Capacity
ā ElastiCache Reserved Nodes
ā Database backup and snapshot costs
ā Read replica cost considerations
ā Database migration cost optimization
Task 4.4: Cost-Optimized Network Architectures
ā Data transfer pricing and optimization
ā NAT Gateway vs NAT Instance cost comparison
ā VPC endpoints for eliminating data transfer costs
ā PrivateLink cost considerations
ā CloudFront for reducing origin costs
ā Direct Connect vs VPN cost analysis
ā Load balancer cost optimization
ā Transit Gateway and VPC peering costs
Critical Takeaways
Compute Pricing Models: On-Demand (flexibility), Reserved Instances (up to 72% savings), Spot (up to 90% savings), Savings Plans (flexible commitment).
Reserved Instances: Standard RI (highest discount, no flexibility), Convertible RI (lower discount, can change instance family), 1-year or 3-year terms.
Savings Plans: Compute Savings Plans (most flexible, any instance family/region), EC2 Instance Savings Plans (higher discount, specific family/region).
Spot Instances: Up to 90% discount, 2-minute interruption notice, best for fault-tolerant batch processing, not for databases or stateful apps.
S3 Storage Classes: Standard ($0.023/GB), Standard-IA ($0.0125/GB, 30-day minimum), One Zone-IA ($0.01/GB, single AZ), Glacier ($0.004/GB, 90-day minimum), Glacier Deep Archive ($0.00099/GB, 180-day minimum).
S3 Lifecycle Policies: Automatically transition objects to cheaper storage classes based on age (e.g., Standard ā Standard-IA after 30 days ā Glacier after 90 days).
S3 Intelligent-Tiering: Automatic cost optimization for unknown access patterns, $0.0025/1,000 objects monitoring fee, no retrieval fees.
EBS Cost Optimization: Use gp3 instead of gp2 (20% cheaper), delete unattached volumes, delete old snapshots, use st1/sc1 for throughput-intensive workloads.
DynamoDB Pricing: On-Demand ($1.25/million writes, $0.25/million reads) for unpredictable, Provisioned ($0.00065/WCU-hour, $0.00013/RCU-hour) for steady-state.
Aurora Serverless: Pay per ACU-hour ($0.06/ACU-hour), auto-scales from 0.5 to 128 ACUs, ideal for variable workloads, can pause when idle.
Data Transfer Costs: Free inbound, $0.09/GB outbound to internet, $0.01/GB between regions, $0.01/GB between AZs, free within same AZ.
VPC Endpoints: Gateway endpoints (S3, DynamoDB) are free, Interface endpoints cost $0.01/hour + $0.01/GB, eliminate data transfer costs to AWS services.
NAT Gateway: $0.045/hour + $0.045/GB processed, NAT instance can be cheaper for low traffic but requires management.
CloudFront Cost Savings: Reduces origin data transfer costs by 60-90%, caches at edge locations, $0.085/GB (cheaper than S3 direct access for global users).
Cost Monitoring: Use Cost Explorer for analysis, Budgets for alerts, Cost Allocation Tags for tracking, Cost and Usage Report for detailed billing.
Right-Sizing: Use Compute Optimizer for recommendations, can save 20-40% by downsizing over-provisioned instances.
Unused Resources: Delete unattached EBS volumes, old snapshots, unused Elastic IPs, idle load balancers, stopped instances (still charged for EBS).
Self-Assessment Checklist
Test yourself before moving on. You should be able to:
Compute Cost Optimization:
Explain the difference between Reserved Instances and Savings Plans
Calculate cost savings from different pricing models
Choose appropriate Spot Instance strategies for different workloads
Determine when to use Standard vs Convertible Reserved Instances
Optimize Lambda costs through memory and timeout configuration
Use Compute Optimizer for right-sizing recommendations
Storage Cost Optimization:
Design S3 lifecycle policies for automatic cost optimization
Select appropriate S3 storage class based on access patterns
Explain when to use S3 Intelligent-Tiering
Calculate storage costs for different S3 storage classes
Optimize EBS costs by selecting appropriate volume types
Implement EFS lifecycle management for cost savings
Database Cost Optimization:
Choose between RDS On-Demand and Reserved Instances
Determine when to use Aurora Serverless vs provisioned Aurora
Select DynamoDB On-Demand vs Provisioned capacity mode
Calculate DynamoDB Reserved Capacity savings
Optimize database backup retention policies
Design cost-effective read replica strategies
Network Cost Optimization:
Explain data transfer pricing between regions and AZs
Calculate cost savings from VPC endpoints
Choose between NAT Gateway and NAT Instance
Determine when to use CloudFront for cost optimization
Compare Direct Connect vs VPN costs
Optimize load balancer costs (ALB vs NLB)
Cost Monitoring and Management:
Use Cost Explorer to analyze spending patterns
Configure Budgets with alerts for cost thresholds
Implement cost allocation tags for tracking
Analyze Cost and Usage Report for detailed billing
Use AWS Cost Anomaly Detection for unusual spending
Create cost optimization action plans
Cost Optimization Strategies:
Identify and delete unused resources
Right-size over-provisioned instances
Implement Auto Scaling for variable workloads
Use Spot Instances for fault-tolerant workloads
Configure S3 lifecycle policies for automatic tiering
Implement VPC endpoints to eliminate data transfer costs
Unused resources? ā Delete unattached volumes, old snapshots
Congratulations! You've completed Chapter 4: Design Cost-Optimized Architectures. You now understand how to minimize costs while maintaining performance, availability, and security on AWS.
Use the Right Pricing Model: Reserved Instances and Savings Plans for steady-state workloads (up to 72% savings). Spot Instances for fault-tolerant workloads (up to 90% savings).
Implement Lifecycle Policies: Automatically transition S3 objects to cheaper storage classes. Use Intelligent-Tiering for unpredictable access patterns.
Right-Size Resources: Use Compute Optimizer and Trusted Advisor recommendations. Don't over-provision - scale horizontally instead.
Eliminate Data Transfer Costs: Use VPC endpoints for S3 and DynamoDB. Keep data in same region when possible. Use CloudFront to reduce origin costs.
Use Serverless for Variable Workloads: Lambda, Aurora Serverless, and DynamoDB On-Demand eliminate idle capacity costs.
Monitor and Optimize Continuously: Use Cost Explorer to identify trends. Set up Budgets with alerts. Tag resources for cost allocation.
Delete Unused Resources: Unattached EBS volumes, old snapshots, unused Elastic IPs, idle load balancers all cost money.
Choose Cost-Effective Services: gp3 instead of gp2 (20% cheaper), Graviton instances (20% cheaper), S3 Standard-IA for infrequent access.
Self-Assessment Checklist
Test yourself before moving on. You should be able to:
Storage Cost Optimization:
Design S3 lifecycle policies to transition objects to cheaper storage classes
Choose appropriate S3 storage class for access patterns
Configure S3 Intelligent-Tiering for automatic optimization
Select Glacier retrieval option based on urgency (Expedited, Standard, Bulk)
Optimize EBS costs by switching gp2 to gp3
Implement EFS lifecycle management to move to Infrequent Access
Calculate data transfer costs and optimize with VPC endpoints
Use S3 Requester Pays for shared datasets
Compute Cost Optimization:
Choose between Reserved Instances and Savings Plans
Calculate break-even point for Reserved Instances
Implement Spot Instances for fault-tolerant workloads
Configure Spot Fleet with multiple instance types
Optimize Lambda costs by adjusting memory allocation
Use Auto Scaling to match capacity to demand
Implement scheduled scaling for predictable patterns
Right-size instances using Compute Optimizer
Database Cost Optimization:
Choose between RDS and Aurora based on cost and performance
Configure Aurora Serverless for variable workloads
Select DynamoDB On-Demand vs Provisioned capacity
Purchase DynamoDB Reserved Capacity for predictable workloads
Optimize RDS storage with autoscaling
Use RDS Reserved Instances for steady-state databases
Configure appropriate backup retention periods
Implement read replicas only when needed
Network Cost Optimization:
Calculate data transfer costs between regions and AZs
Use VPC endpoints to eliminate NAT Gateway data transfer costs
Choose between NAT Gateway and NAT Instance based on cost
Implement CloudFront to reduce data transfer from origin
Select appropriate Direct Connect bandwidth
Optimize load balancer costs (ALB vs NLB)
Use VPC peering instead of Transit Gateway when appropriate
Reserved Capacity for Steady Workloads: Use Reserved Instances or Savings Plans for predictable workloads. Save up to 72% compared to On-Demand.
Spot Instances for Fault-Tolerant Workloads: Use Spot for batch processing, data analysis, and stateless applications. Save up to 90% compared to On-Demand.
Storage Lifecycle Policies: Automatically transition S3 objects to cheaper storage classes. Use Intelligent-Tiering when access patterns are unknown.
Right-Size Everything: Use Compute Optimizer to identify oversized resources. Downsize or stop unused resources.
Data Transfer is Expensive: Use VPC endpoints to avoid data transfer charges. Use CloudFront to reduce origin data transfer. Keep data in the same region when possible.
Serverless for Variable Workloads: Aurora Serverless and DynamoDB On-Demand automatically scale and you only pay for what you use.
Monitor and Alert: Use Cost Explorer to identify trends. Set up AWS Budgets to alert on overspending. Use cost allocation tags to track spending by project.
Delete Unused Resources: Regularly audit and delete unattached EBS volumes, old snapshots, unused Elastic IPs, and idle load balancers.
Self-Assessment Checklist
Test yourself before moving on:
I understand the difference between Reserved Instances and Savings Plans
I know when to use Spot Instances and their limitations
I can design S3 lifecycle policies for cost optimization
I understand S3 storage class selection criteria
I know how to optimize EBS costs (gp3 vs gp2)
I can calculate cost savings with Reserved Instances
I understand DynamoDB pricing modes (On-Demand vs Provisioned)
Cost Allocation Tags: Track costs by project/department
Compute Optimizer: Right-sizing recommendations
Trusted Advisor: Cost optimization checks
Key Decision Points:
Steady-state workload ā Reserved Instances or Savings Plans
Variable workload ā Auto Scaling + On-Demand or Spot
Batch processing ā Spot Instances (up to 90% savings)
Infrequent access (>30 days) ā S3 Standard-IA or One Zone-IA
Long-term archive ā Glacier Flexible or Deep Archive
Variable database workload ā Aurora Serverless or DynamoDB On-Demand
High S3 data transfer ā VPC endpoint (eliminate transfer costs)
Global content delivery ā CloudFront (reduce origin costs)
Next Chapter: 06_integration - Learn how to integrate multiple services and design cross-domain solutions.
Integration & Advanced Topics: Putting It All Together
Chapter Overview
This chapter demonstrates how to combine concepts from all four domains to design complete, production-ready AWS architectures. You'll learn to integrate security, resilience, performance, and cost optimization into cohesive solutions.
What you'll learn:
Design complete three-tier web applications
Build serverless architectures from scratch
Implement event-driven systems
Create hybrid cloud solutions
Design microservices architectures
Build data processing pipelines
Solve complex cross-domain scenarios
Time to complete: 6-8 hours Prerequisites: Chapters 1-5 (all domain chapters)
Section 1: Three-Tier Web Application Architecture
Diagram Explanation (Comprehensive): This diagram illustrates a complete three-tier web application architecture that integrates all four exam domains. The Presentation Tier uses CloudFront CDN (Domain 3: Performance) to cache and deliver static content (HTML, CSS, JavaScript) stored in an S3 bucket configured as a static website. CloudFront provides global low-latency access (10-50ms) and reduces load on the application tier. The S3 bucket uses server-side encryption (Domain 1: Security) and versioning for data protection. The Application Tier consists of an Application Load Balancer distributing traffic across an Auto Scaling Group of EC2 instances deployed across three Availability Zones (Domain 2: Resilience). The ALB performs health checks every 30 seconds and automatically removes unhealthy instances. Auto Scaling maintains 3-10 instances based on CPU utilization (target: 70%), ensuring the application handles traffic spikes while minimizing costs (Domain 4: Cost Optimization). EC2 instances run in private subnets with no direct internet access, using NAT Gateways for outbound connectivity. Security Groups allow only HTTPS traffic from the ALB. The Data Tier includes RDS Multi-AZ for the relational database (Domain 2: Resilience), providing automatic failover in 60-120 seconds if the primary fails. ElastiCache Redis stores user sessions, enabling stateless application servers and improving performance by caching frequently accessed data (Domain 3: Performance). S3 stores user-uploaded files with lifecycle policies to transition old files to Glacier after 90 days (Domain 4: Cost Optimization). All data is encrypted at rest using KMS (Domain 1: Security). This architecture achieves 99.99% availability, handles 10,000 requests per second, and costs approximately $2,000/month for a medium-sized application.
Detailed Example 1: E-commerce Platform Implementation An e-commerce company needs to build a scalable online store that handles 50,000 concurrent users during Black Friday sales. They implement the three-tier architecture as follows: Presentation Tier: CloudFront caches product images, CSS, and JavaScript files for 24 hours (Cache-Control: max-age=86400), reducing origin requests by 95%. The S3 bucket hosts the React single-page application, which makes API calls to the application tier. CloudFront uses Origin Access Identity (OAI) to restrict S3 access, preventing direct bucket access. Application Tier: The ALB routes requests to 20 EC2 instances (m5.large) running Node.js application servers. Auto Scaling is configured with target tracking policy (CPU 70%) and scheduled scaling (scale to 50 instances at 8 AM on Black Friday). EC2 instances use IAM roles to access S3 and RDS without embedded credentials. Security Groups allow HTTPS (443) from ALB only. Data Tier: RDS PostgreSQL (db.r5.2xlarge) Multi-AZ stores product catalog, orders, and customer data. ElastiCache Redis (cache.r5.large) with 3 nodes stores shopping cart sessions and product cache, reducing database queries by 80%. S3 stores product images with CloudFront distribution. During Black Friday, the system handles 100,000 requests per second with 200ms average response time. Auto Scaling adds 30 instances in 10 minutes to handle the spike. Total cost for the day: $500 (mostly EC2 and data transfer), compared to $50,000 potential revenue loss from downtime.
Detailed Example 2: SaaS Application with Multi-Tenancy A SaaS company provides project management software to 1,000 enterprise customers. They use the three-tier architecture with tenant isolation: Presentation Tier: CloudFront serves the Angular application with custom domain names per tenant (customer1.saas.com, customer2.saas.com) using alternate domain names (CNAMEs). Each tenant's static assets are stored in separate S3 prefixes (s3://saas-app/customer1/, s3://saas-app/customer2/). Application Tier: ALB uses host-based routing to route requests to different target groups based on subdomain. EC2 instances (c5.xlarge) run Java Spring Boot applications with tenant context extracted from JWT tokens. Auto Scaling maintains 5-20 instances based on request count (target: 1000 requests per instance). Data Tier: RDS MySQL (db.r5.xlarge) Multi-AZ uses separate databases per tenant (customer1_db, customer2_db) for data isolation. ElastiCache Redis stores tenant-specific cache with key prefixes (customer1:, customer2:). S3 stores tenant files with bucket policies enforcing tenant isolation. The architecture supports 10,000 concurrent users across all tenants with 99.95% uptime SLA. Cost per tenant: $50/month (shared infrastructure), enabling profitable pricing at $200/month per customer.
Detailed Example 3: Media Streaming Platform A video streaming platform serves 1 million users watching videos simultaneously. They implement the three-tier architecture optimized for media delivery: Presentation Tier: CloudFront caches video segments (HLS .ts files) at 400+ edge locations worldwide, reducing latency to 10-30ms. S3 stores video files in multiple resolutions (1080p, 720p, 480p, 360p) using Intelligent-Tiering storage class to optimize costs. CloudFront uses signed URLs with 1-hour expiration to prevent unauthorized access. Application Tier: ALB routes API requests (user authentication, video metadata, playback tracking) to 30 EC2 instances (c5.2xlarge) running Python Flask applications. Auto Scaling uses custom CloudWatch metrics (concurrent streams) to scale from 10 to 100 instances during peak hours (8 PM - 11 PM). Data Tier: Aurora PostgreSQL Serverless (1-16 ACUs) stores user profiles, video metadata, and viewing history, automatically scaling based on load. ElastiCache Redis (cache.r5.2xlarge) with 5 read replicas caches video metadata and user sessions, handling 100,000 requests per second. S3 stores 10 PB of video content with lifecycle policies moving old content to Glacier Deep Archive after 2 years (96% cost savings). The platform delivers 10 Gbps of video traffic with 99.99% availability and costs $50,000/month (mostly CloudFront and S3 storage).
ā Must Know (Critical Facts):
Presentation tier: Use CloudFront + S3 for static content (HTML, CSS, JS, images) - reduces latency and costs
Application tier: Use ALB + Auto Scaling + EC2 in private subnets - provides resilience and scalability
Data tier: Use RDS Multi-AZ + ElastiCache + S3 - ensures data durability and performance
Security: Implement defense in depth (WAF, Security Groups, NACLs, encryption, IAM roles)
Resilience: Deploy across 3+ AZs, use Multi-AZ databases, implement health checks
Performance: Use caching at multiple layers (CloudFront, ElastiCache, application cache)
Cost optimization: Use Auto Scaling, Reserved Instances, S3 lifecycle policies, CloudFront caching
Diagram Explanation (Comprehensive): This diagram shows a complete serverless application architecture that eliminates server management and scales automatically. The Frontend consists of a React single-page application hosted on S3 and delivered via CloudFront CDN. Users access the application through CloudFront, which caches static assets (HTML, CSS, JavaScript) at edge locations worldwide. The API Layer uses API Gateway to expose RESTful endpoints (/items GET, POST, PUT, DELETE) that the frontend calls. API Gateway integrates with Cognito User Pools for authentication - users must include a JWT token in the Authorization header. API Gateway validates tokens and rejects unauthorized requests before invoking Lambda functions. The Compute Layer consists of four Lambda functions, each handling a specific operation (CRUD operations on items). Lambda functions are stateless and scale automatically - AWS can run 1,000 concurrent executions simultaneously to handle traffic spikes. Each function has an IAM execution role granting permissions to access DynamoDB and S3. The Data Layer uses DynamoDB for structured data (items table with partition key: itemId) and S3 for file storage (user-uploaded images). DynamoDB provides single-digit millisecond latency and scales automatically to handle any request volume. This architecture has zero servers to manage, scales from 0 to millions of requests automatically, and costs only for actual usage (no idle costs). A typical application with 1 million requests per month costs approximately $50 (API Gateway: $3.50, Lambda: $20, DynamoDB: $25, S3: $1, CloudFront: $0.50).
Detailed Example 1: Todo List Application A startup builds a todo list application using serverless architecture. Frontend: React application hosted on S3 (s3://todo-app-frontend/) and delivered via CloudFront. The application makes API calls to API Gateway endpoints. Authentication: Cognito User Pool manages user registration, login, and password reset. Users sign up with email/password, receive verification emails, and get JWT tokens upon login. The frontend stores tokens in localStorage and includes them in API requests. API Layer: API Gateway exposes 5 endpoints: GET /todos (list todos), POST /todos (create todo), PUT /todos/{id} (update todo), DELETE /todos/{id} (delete todo), GET /todos/{id} (get single todo). Each endpoint has a Lambda authorizer that validates JWT tokens. Compute Layer: Five Lambda functions (Node.js 18) handle CRUD operations. Each function is allocated 512 MB memory (equivalent to 0.5 vCPU) and has a 30-second timeout. Functions use AWS SDK to interact with DynamoDB. Data Layer: DynamoDB table (todos) with partition key userId and sort key todoId, enabling efficient queries for all todos belonging to a user. The table uses on-demand billing, automatically scaling to handle any request volume. The application supports 10,000 users with 100,000 todos, costs $30/month, and requires zero server management. Deployment uses AWS SAM (Serverless Application Model) with infrastructure as code.
Detailed Example 2: Image Processing Service A company builds an image processing service using serverless architecture. Frontend: Vue.js application on S3 allows users to upload images. Authentication: Cognito User Pool with social identity providers (Google, Facebook) for easy sign-up. API Layer: API Gateway exposes POST /images endpoint for image uploads. The endpoint returns a pre-signed S3 URL, allowing direct upload from browser to S3 (bypassing API Gateway's 10 MB payload limit). Compute Layer: Three Lambda functions: (1) Upload Lambda generates pre-signed URLs for S3 uploads, (2) Process Lambda (triggered by S3 event) creates thumbnails (100x100, 300x300, 600x600) using Sharp library, (3) Metadata Lambda extracts EXIF data and stores it in DynamoDB. Data Layer: S3 bucket (images-original) stores original images, S3 bucket (images-processed) stores thumbnails, DynamoDB table (image-metadata) stores metadata. The Process Lambda is allocated 3 GB memory (2 vCPUs) to handle image processing quickly. The service processes 10,000 images per day, costs $100/month (mostly Lambda compute for image processing), and scales automatically during traffic spikes. Users upload images directly to S3 (no API Gateway bottleneck), and processing completes in 5 seconds on average.
Detailed Example 3: Real-Time Chat Application A company builds a real-time chat application using serverless architecture with WebSocket support. Frontend: React application on S3 uses WebSocket API to maintain persistent connections. Authentication: Cognito User Pool with MFA for secure authentication. API Layer: API Gateway WebSocket API with three routes: $connect (establish connection), $disconnect (close connection), sendMessage (send chat message). Compute Layer: Three Lambda functions: (1) Connect Lambda stores connection ID in DynamoDB when users connect, (2) Disconnect Lambda removes connection ID when users disconnect, (3) SendMessage Lambda receives messages, stores them in DynamoDB, and broadcasts to all connected users using API Gateway Management API. Data Layer: DynamoDB table (connections) stores active WebSocket connections (connectionId, userId, timestamp), DynamoDB table (messages) stores chat history (roomId, timestamp, userId, message). The SendMessage Lambda queries the connections table to find all users in the chat room and sends messages to each connection. The application supports 1,000 concurrent users with 10,000 messages per hour, costs $50/month, and provides real-time messaging with < 100ms latency. WebSocket connections can stay open for up to 2 hours before automatic reconnection.
ā Must Know (Critical Facts):
Serverless benefits: No server management, automatic scaling, pay-per-use pricing, high availability built-in
API Gateway: Exposes REST and WebSocket APIs, handles authentication, throttling, caching, CORS
S3 pre-signed URLs: Allow direct uploads from browser to S3, bypassing API Gateway payload limits
Cold starts: First invocation takes 1-5 seconds (initialize runtime), subsequent invocations take 10-100ms
Cost model: API Gateway ($3.50 per million requests), Lambda ($0.20 per million requests + $0.0000166667 per GB-second), DynamoDB ($1.25 per million writes, $0.25 per million reads)
Section 3: Event-Driven Architecture
Event-Driven Processing Pipeline
š Event-Driven Architecture Diagram:
sequenceDiagram
participant User
participant S3
participant EventBridge
participant Lambda1 as Lambda: Thumbnail
participant Lambda2 as Lambda: Metadata
participant SQS
participant Lambda3 as Lambda: ML Analysis
participant DDB as DynamoDB
User->>S3: Upload image
S3->>EventBridge: ObjectCreated event
EventBridge->>Lambda1: Trigger (sync)
Lambda1->>S3: Create thumbnail
Lambda1->>DDB: Store thumbnail URL
EventBridge->>Lambda2: Trigger (sync)
Lambda2->>DDB: Extract & store metadata
EventBridge->>SQS: Queue for ML processing
SQS->>Lambda3: Batch processing
Lambda3->>Lambda3: ML image analysis
Lambda3->>DDB: Store tags & labels
DDB-->>User: Image fully processed
Diagram Explanation (Comprehensive): This sequence diagram illustrates an event-driven architecture where a single event (image upload) triggers multiple independent processing workflows. When a User uploads an image to S3, S3 emits an ObjectCreated event to EventBridge. EventBridge evaluates the event against multiple rules and routes it to three different targets simultaneously: (1) Lambda Thumbnail function is invoked synchronously to create thumbnail images (100x100, 300x300) and stores thumbnail URLs in DynamoDB, (2) Lambda Metadata function is invoked synchronously to extract EXIF data (camera model, GPS coordinates, timestamp) and stores it in DynamoDB, (3) SQS queue receives the event for asynchronous ML processing. The SQS queue buffers events and Lambda ML Analysis function polls the queue in batches of 10 messages. This function performs computationally expensive ML image analysis (object detection, facial recognition, scene classification) using Amazon Rekognition and stores results in DynamoDB. The event-driven pattern decouples components - if the ML function fails, it doesn't affect thumbnail generation or metadata extraction. Each component scales independently based on its workload. EventBridge provides at-least-once delivery with automatic retries, ensuring no events are lost. The architecture processes 10,000 images per hour with 5-second average latency for thumbnails and 30-second average latency for ML analysis. Cost is approximately $200/month (mostly Lambda compute for ML processing and Rekognition API calls).
Detailed Example 1: E-commerce Order Processing An e-commerce platform uses event-driven architecture to process orders. When a customer places an order, the Order Service publishes an "OrderPlaced" event to EventBridge. EventBridge fans out to multiple subscribers: (1) Payment Lambda charges the credit card and publishes "PaymentCompleted" event, (2) Inventory Lambda reserves items and publishes "InventoryReserved" event, (3) Shipping Lambda creates shipping label and publishes "ShippingLabelCreated" event, (4) Email Lambda sends order confirmation to customer, (5) Analytics SQS queue receives event for business intelligence processing. Each service is independent and can be deployed, scaled, and updated separately. If the email service is down, it doesn't affect payment or shipping. EventBridge's event archive feature stores all events for 90 days, allowing replay for debugging or reprocessing. The system processes 10,000 orders per day with 2-second average order confirmation time (parallel processing) compared to 10 seconds with sequential processing. Event-driven architecture reduces coupling between services and improves resilience - if one service fails, others continue operating.
Detailed Example 2: IoT Data Processing An IoT platform collects sensor data from 100,000 devices and processes it using event-driven architecture. Devices publish temperature readings to AWS IoT Core every minute. IoT Core routes events to EventBridge based on rules (e.g., temperature > 80°F triggers alert rule). EventBridge fans out to multiple targets: (1) Lambda Alert function sends SNS notifications to operations team for high temperatures, (2) Kinesis Firehose streams all data to S3 for long-term storage and analysis, (3) Lambda Aggregation function calculates hourly averages and stores them in DynamoDB, (4) SQS queue buffers events for ML anomaly detection. The ML Lambda function polls SQS in batches of 100 messages and uses Amazon Lookout for Equipment to detect anomalies. The event-driven pattern allows adding new consumers without modifying IoT devices or existing consumers. When the company adds a new dashboard, they simply add another EventBridge rule routing to a new Lambda function. The system processes 6 million events per hour (100,000 devices à 60 minutes) with < 1 second latency for alerts and costs $500/month (mostly IoT Core message processing and S3 storage).
Detailed Example 3: Video Transcoding Pipeline A video platform uses event-driven architecture for video transcoding. When a user uploads a video to S3, S3 emits an ObjectCreated event to EventBridge. EventBridge routes the event to multiple targets: (1) Lambda Validation function checks video format and duration, rejecting invalid videos, (2) Step Functions workflow orchestrates the transcoding process: (a) Lambda Extract function extracts video metadata (resolution, codec, duration), (b) MediaConvert job transcodes video to multiple formats (1080p, 720p, 480p, 360p) and stores outputs in S3, (c) Lambda Thumbnail function generates video thumbnails at 10-second intervals, (d) Lambda Notification function sends completion email to user. (3) DynamoDB Streams captures changes to the video metadata table and triggers Lambda Analytics function to update video statistics. The event-driven pattern allows the transcoding workflow to scale independently - MediaConvert can process 100 videos simultaneously while Lambda functions scale to 1,000 concurrent executions. The system processes 1,000 videos per day with 10-minute average transcoding time and costs $1,000/month (mostly MediaConvert transcoding costs).
Diagram Explanation (Comprehensive): This diagram shows a hybrid cloud architecture connecting on-premises infrastructure to AWS. The On-Premises Data Center contains the corporate network, Active Directory (AD) for user authentication, and legacy applications that can't be migrated to the cloud. Connectivity is established through AWS Direct Connect (10 Gbps dedicated connection) for primary connectivity and Site-to-Site VPN (1.25 Gbps over internet) as backup. Direct Connect provides consistent network performance (1-2ms latency) and reduced data transfer costs ($0.02/GB vs $0.09/GB for internet). The VPN backup ensures connectivity if Direct Connect fails. Directory Services uses AD Connector, which acts as a proxy to the on-premises Active Directory. EC2 instances in AWS can authenticate users against on-premises AD without replicating the directory to AWS. This enables single sign-on (SSO) - users log in with their corporate credentials. Compute consists of EC2 instances running cloud-native applications that need to authenticate users. Storage uses Storage Gateway File Gateway, which presents an NFS/SMB file share to on-premises applications. Files written to the gateway are automatically uploaded to S3 and cached locally for low-latency access. This allows legacy applications to use cloud storage without modification. The hybrid architecture enables gradual cloud migration - new applications run in AWS while legacy applications remain on-premises. Total cost: $5,000/month (Direct Connect: $2,000, VPN: $100, AD Connector: $200, Storage Gateway: $200, EC2: $2,000, S3: $500).
Detailed Example 1: Enterprise File Sharing A company with 5,000 employees uses hybrid cloud for file sharing. On-Premises: Employees access file shares on Windows File Servers (10 TB of data). AWS: Storage Gateway File Gateway is deployed on-premises as a VM. The gateway presents an SMB file share to employees, caching frequently accessed files locally (1 TB cache). Files are automatically uploaded to S3 (s3://company-files/) with lifecycle policies moving old files to Glacier after 90 days. Connectivity: Direct Connect (10 Gbps) provides high-bandwidth connection for file uploads. Benefits: (1) Unlimited cloud storage - no need to provision additional on-premises storage, (2) Disaster recovery - files are replicated to S3 across multiple AZs, (3) Cost savings - Glacier storage costs $0.00099/GB-month vs $0.10/GB-month for on-premises SAN, (4) Remote access - employees can access files from AWS WorkSpaces or EC2 instances. The company saves $50,000/year on storage costs and improves disaster recovery (RPO: 1 hour, RTO: 4 hours).
Detailed Example 2: Hybrid Active Directory A company with 10,000 employees uses hybrid cloud for identity management. On-Premises: Active Directory Domain Services (AD DS) manages user accounts, groups, and policies. AWS: AD Connector proxies authentication requests to on-premises AD. EC2 instances running Windows Server join the domain through AD Connector. Connectivity: Direct Connect (10 Gbps) with VPN backup ensures reliable connectivity. Use Cases: (1) EC2 instances authenticate users against corporate AD, (2) AWS Management Console uses AD credentials for SSO, (3) RDS SQL Server uses Windows Authentication with AD users, (4) Amazon WorkSpaces uses AD credentials for user login. Benefits: (1) Single source of truth - no need to replicate AD to AWS, (2) Centralized management - IT manages users in one place, (3) Compliance - meets requirements for centralized identity management, (4) Cost savings - no need for AWS Managed Microsoft AD ($2/hour). The company saves $15,000/year on directory services costs and simplifies user management.
Detailed Example 3: Disaster Recovery for On-Premises Applications A company uses hybrid cloud for disaster recovery of on-premises applications. On-Premises: Production applications run on VMware vSphere (100 VMs). AWS: AWS Application Migration Service (MGN) continuously replicates VMs to AWS. Replicated VMs are stored as EBS snapshots in a staging area. Connectivity: Direct Connect (10 Gbps) provides high-bandwidth replication. DR Strategy: Pilot Light - only replication infrastructure runs in AWS (cost: $500/month). During a disaster, the company launches EC2 instances from EBS snapshots (RTO: 1 hour, RPO: 15 minutes). Testing: The company performs quarterly DR drills by launching test instances in an isolated VPC. Benefits: (1) Low cost - pay only for EBS snapshots ($0.05/GB-month) and replication, (2) Fast recovery - launch instances in 15 minutes, (3) No data loss - continuous replication with 15-minute RPO, (4) Compliance - meets regulatory requirements for disaster recovery. The company saves $100,000/year compared to maintaining a secondary data center.
ā Must Know (Critical Facts):
Direct Connect: Dedicated connection, 1-100 Gbps, consistent latency, reduced data transfer costs ($0.02/GB)
VPN: Encrypted tunnel over internet, up to 1.25 Gbps per tunnel, $0.05/hour, backup for Direct Connect
AD Connector: Proxy to on-premises AD, $0.05/hour per directory, supports SSO and domain join
Diagram Explanation (Comprehensive): This diagram illustrates a microservices architecture where the application is decomposed into independent services, each with its own database (database per service pattern). API Gateway serves as the single entry point, routing requests to appropriate microservices based on URL path (/users/* ā User Service, /orders/* ā Order Service, /products/* ā Product Service, /payments/* ā Payment Service). Each microservice runs on ECS Fargate (serverless containers), eliminating server management. Services scale independently - the Order Service can scale to 20 tasks during peak hours while the User Service maintains 5 tasks. Each service has its own database optimized for its use case: User Service uses RDS PostgreSQL for relational user data, Order Service uses DynamoDB for high-throughput order processing, Product Service uses Aurora for complex product catalog queries, Payment Service uses RDS MySQL for transactional payment data. Services communicate asynchronously through SNS/SQS for loose coupling. When an order is placed, the Order Service publishes an event to SNS, which fans out to three SQS queues: Inventory queue (reserve items), Shipping queue (create shipping label), Notifications queue (send confirmation email). This event-driven communication prevents cascading failures - if the shipping service is down, it doesn't affect order placement. The architecture enables independent deployment, scaling, and technology choices per service. Cost: $3,000/month (ECS Fargate: $2,000, RDS/Aurora: $800, DynamoDB: $100, API Gateway: $50, SNS/SQS: $50).
Detailed Example 1: E-commerce Platform Microservices An e-commerce company decomposes its monolithic application into microservices. User Service (Node.js, 5 Fargate tasks, 0.5 vCPU, 1 GB RAM each) manages user registration, authentication, and profiles. It uses RDS PostgreSQL (db.t3.medium) for user data. Product Service (Java Spring Boot, 10 Fargate tasks, 1 vCPU, 2 GB RAM each) manages product catalog with complex search and filtering. It uses Aurora PostgreSQL (db.r5.large) with 2 read replicas for read-heavy workload. Order Service (Python Flask, 20 Fargate tasks, 1 vCPU, 2 GB RAM each) handles order placement and tracking. It uses DynamoDB (on-demand billing) for high-throughput writes (1,000 orders per minute). Payment Service (Go, 5 Fargate tasks, 0.5 vCPU, 1 GB RAM each) processes payments through Stripe API. It uses RDS MySQL (db.t3.small) for payment records. Benefits: (1) Independent scaling - Order Service scales to 50 tasks during Black Friday while others remain at baseline, (2) Independent deployment - Product Service can be updated without affecting Order Service, (3) Technology diversity - each service uses the best language/database for its needs, (4) Fault isolation - if Payment Service fails, users can still browse products and add to cart. Challenges: (1) Distributed transactions - order placement involves multiple services (order, payment, inventory), solved using Saga pattern with compensating transactions, (2) Service discovery - services find each other using AWS Cloud Map, (3) Monitoring - distributed tracing using AWS X-Ray to track requests across services.
Diagram Explanation (Comprehensive): This diagram shows a complete data processing pipeline for real-time analytics. Data Sources include application logs (web server access logs), IoT sensors (temperature, humidity readings), and database change data capture (CDC) from RDS. Ingestion uses Kinesis Data Streams to collect data in real-time. Producers send records to Kinesis shards (each shard handles 1 MB/sec input, 2 MB/sec output). Processing uses Lambda functions to transform data (parse logs, enrich with metadata, filter invalid records) and Kinesis Firehose to batch and deliver data to S3. Firehose buffers data for 60 seconds or 5 MB (whichever comes first) before writing to S3, reducing S3 PUT requests and costs. Storage uses S3 as a data lake. Raw data is stored in JSON format (s3://data-lake/raw/), and AWS Glue ETL jobs transform it to Parquet format (s3://data-lake/processed/) for efficient querying. Parquet is columnar format, reducing query costs by 90% compared to JSON. Analytics uses Athena for ad-hoc SQL queries on S3 data (serverless, pay per query), Redshift for complex analytics and aggregations (data warehouse), and QuickSight for interactive dashboards. The pipeline processes 1 million records per hour with < 5 minute latency from ingestion to availability in Athena. Cost: $1,000/month (Kinesis: $400, Lambda: $100, S3: $200, Glue: $100, Athena: $100, Redshift: $100).
Detailed Example 1: Web Analytics Pipeline A media company processes web server logs for real-time analytics. Ingestion: Web servers (100 EC2 instances) send access logs to Kinesis Data Streams (10 shards, 10 MB/sec total throughput). Each log entry contains timestamp, user ID, page URL, response time, user agent. Processing: Lambda function (512 MB, 30-second timeout) parses logs, extracts fields, enriches with geolocation data (from IP address), and filters bot traffic. Kinesis Firehose buffers transformed logs and delivers to S3 every 60 seconds. Storage: S3 stores raw logs (JSON) and processed logs (Parquet). Glue Crawler automatically discovers schema and creates Glue Data Catalog tables. Analytics: Athena queries processed logs for ad-hoc analysis (e.g., "top 10 pages by traffic"). QuickSight dashboards show real-time metrics (page views per minute, average response time, geographic distribution). Redshift loads daily aggregates for historical analysis. Benefits: (1) Real-time visibility - dashboards update every minute, (2) Cost-effective - Athena charges $5 per TB scanned, Parquet reduces scans by 90%, (3) Scalable - handles 10x traffic spikes automatically, (4) Flexible - can add new analytics without changing ingestion. The pipeline processes 100 million log entries per day and costs $500/month.
ā Must Know (Critical Facts):
Kinesis Data Streams: Real-time ingestion, 1 MB/sec per shard, 24-hour to 365-day retention
Kinesis Firehose: Batch delivery to S3/Redshift/Elasticsearch, automatic scaling, 60-second buffer
Lambda: Transform data in real-time, 15-minute timeout, 10 GB memory max
Why it works: Spacing reviews forces your brain to work harder to recall information, strengthening memory pathways.
The Feynman Technique
Step 1: Choose a concept (e.g., "RDS Multi-AZ")
Step 2: Explain it simply (as if teaching a 10-year-old): "RDS Multi-AZ is like having two identical databases in different buildings. If one building has a problem, the other one automatically takes over so your application keeps working."
Step 3: Identify gaps (where you struggled to explain):
How does failover actually work?
How long does it take?
What triggers failover?
Step 4: Review and simplify (go back to study materials, fill gaps, try again)
Step 5: Use analogies (make it relatable): "Multi-AZ is like having a backup generator that automatically kicks in when power fails."
Interleaved Practice
What it is: Mix different topics in one study session instead of focusing on one topic.
Why it works: Forces your brain to discriminate between concepts and choose the right approach for each problem (like the actual exam).
Elaborative Interrogation
Technique: Ask yourself "why" questions about facts.
Example:
Fact: "S3 Standard-IA is cheaper than S3 Standard"
Why?: Because AWS assumes you'll access it less frequently, so they charge less for storage but more for retrieval
Why does that matter?: It helps me choose the right storage class based on access patterns
When would I use it?: For data accessed less than once a month but needs immediate access when requested
Practice questions to ask:
Why does this service exist?
Why would I choose this over alternatives?
Why does this limitation exist?
Why is this the best practice?
Retrieval Practice
What it is: Testing yourself BEFORE you feel ready (not just reviewing notes).
How to implement:
Read a chapter section (e.g., "Lambda Concurrency")
Close the book immediately
Write down everything you remember (no peeking!)
Check your notes (identify what you missed)
Repeat (focus on what you missed)
Why it works: The act of retrieving information strengthens memory more than passive review.
Tools:
Flashcards (physical or digital)
Practice questions (from this package)
Self-quizzing (write questions for yourself)
Teach someone (forces retrieval)
Domain-Specific Study Strategies
Domain 1: Security (30% of exam)
Focus areas:
IAM policies (understand policy evaluation logic)
VPC security (Security Groups vs NACLs)
Encryption (KMS, at-rest, in-transit)
Compliance (AWS services for different frameworks)
Study approach:
Master IAM first (foundation for everything)
Draw VPC diagrams (visualize security layers)
Practice policy writing (hands-on with IAM Policy Simulator)
Memorize encryption options (which services support what)
Common mistakes to avoid:
Confusing Security Groups (stateful) with NACLs (stateless)
Forgetting that IAM is global (not region-specific)
Not understanding policy evaluation order (explicit deny wins)
š Security Study Priority:
graph TD
A[Start Security Study] --> B[IAM Fundamentals]
B --> C[VPC Security]
C --> D[Encryption & KMS]
D --> E[Compliance Services]
E --> F[Practice Questions]
B --> B1[Users, Groups, Roles]
B --> B2[Policies & Permissions]
B --> B3[MFA & Access Keys]
C --> C1[Security Groups]
C --> C2[NACLs]
C --> C3[VPC Flow Logs]
D --> D1[KMS Keys]
D --> D2[S3 Encryption]
D --> D3[EBS/RDS Encryption]
style B fill:#ffcccc
style C fill:#ffddcc
style D fill:#ffeecc
style E fill:#ffffcc
Practice architecture diagrams (draw HA architectures)
Compare DR strategies (backup/restore vs pilot light vs warm standby vs active-active)
Master decoupling patterns (when to use SQS vs SNS vs EventBridge)
Common mistakes to avoid:
Confusing Multi-AZ (HA) with Read Replicas (performance)
Not understanding Auto Scaling cooldown periods
Forgetting that ELB health checks can trigger Auto Scaling
š Resilience Study Progression:
graph LR
A[Week 1-2: HA Basics] --> B[Week 3: Auto Scaling]
B --> C[Week 4: Load Balancing]
C --> D[Week 5: DR Strategies]
D --> E[Week 6: Decoupling]
E --> F[Week 7: Practice]
A --> A1[Multi-AZ]
A --> A2[Availability Zones]
B --> B1[Dynamic Scaling]
B --> B2[Predictive Scaling]
C --> C1[ALB vs NLB]
C --> C2[Health Checks]
D --> D1[RTO/RPO]
D --> D2[4 DR Strategies]
E --> E1[SQS]
E --> E2[SNS]
E --> E3[EventBridge]
style A fill:#c8e6c9
style B fill:#a5d6a7
style C fill:#81c784
style D fill:#66bb6a
style E fill:#4caf50
style F fill:#388e3c
sequenceDiagram
participant Q as Question
participant S as Situation
participant T as Task
participant A as Action
participant R as Result
Q->>S: Read scenario
S->>S: Identify: Company, Current State, Problem
S->>T: Extract requirements
T->>T: List: Business + Technical + Constraints
T->>A: Evaluate options
A->>A: Check each answer against requirements
A->>R: Select best option
R->>R: Verify: Solves problem + Meets requirements + Best choice
R->>Q: Choose answer
See: diagrams/07_study_strategies_star_method.mmd
Keyword Recognition Strategy
Cost keywords (choose cheapest option):
"most cost-effective"
"minimize cost"
"lowest cost"
"reduce expenses"
Performance keywords (choose fastest option):
"lowest latency"
"highest throughput"
"best performance"
"fastest"
Security keywords (choose most secure option):
"most secure"
"comply with"
"encrypt"
"least privilege"
Operational keywords (choose simplest option):
"least operational overhead"
"minimal management"
"fully managed"
"automated"
Availability keywords (choose most resilient option):
"highly available"
"fault-tolerant"
"disaster recovery"
"minimize downtime"
Elimination Strategy
Step 1: Eliminate obviously wrong answers (reduce to 2-3 options)
Technically impossible (service doesn't support that feature)
Doesn't address the problem (solves different issue)
Step 2: Eliminate "almost right" answers (reduce to 1-2 options)
Partially correct (addresses some requirements but not all)
Overengineered (more complex than needed)
Underengineered (doesn't meet scale requirements)
Step 3: Choose the BEST answer (final selection)
Meets ALL requirements
Follows AWS best practices
Most cost-effective among remaining options
Least operational overhead
š Elimination Process:
graph TD
A[4 Answer Options] --> B{Step 1: Obviously Wrong?}
B -->|Yes| C[Eliminate]
B -->|No| D[Keep]
D --> E{Step 2: Partially Correct?}
E -->|Yes| F[Eliminate]
E -->|No| G[Keep]
G --> H{Step 3: Best Option?}
H -->|Meets all requirements| I[SELECT]
H -->|Missing requirements| J[Eliminate]
C --> K[Remaining: 2-3 options]
F --> L[Remaining: 1-2 options]
I --> M[Final Answer]
style C fill:#ffcccc
style F fill:#ffddcc
style I fill:#ccffcc
style M fill:#66bb6a
EC2 Pricing: I can explain On-Demand, Reserved, Spot, and Savings Plans
S3 Storage Classes: I know the cost and retrieval characteristics of each class
S3 Lifecycle: I can design lifecycle policies to transition between storage classes
RDS Pricing: I understand when to use Reserved Instances vs On-Demand
DynamoDB Pricing: I know the difference between On-Demand and Provisioned capacity
Data Transfer: I understand inter-AZ, inter-region, and internet egress costs
NAT Gateway: I know the cost implications vs NAT instance
VPC Endpoints: I understand how they reduce data transfer costs
Cost Tools: I can use Cost Explorer, Budgets, and Cost Allocation Tags
Trusted Advisor: I know what cost optimization checks it provides
If you checked fewer than 80%: Review those specific chapters and take domain-focused practice tests
Practice Test Marathon
š Final Week Practice Schedule:
gantt
title Final Week Practice Test Schedule
dateFormat YYYY-MM-DD
section Practice Tests
Full Practice Test 3 :2025-02-01, 1d
Review & Study Weak Areas :2025-02-02, 1d
Domain-Focused Tests :2025-02-03, 1d
Service-Focused Tests :2025-02-04, 1d
Timed Practice (30Q) :2025-02-05, 1d
Review Summaries :2025-02-06, 1d
Light Review Only :2025-02-07, 1d
section Exam Day
Exam Day :milestone, 2025-02-08, 0d
Choose the BEST answer (not just A correct answer)
If You Get Stuck:
Take a deep breath (5 seconds)
Re-read the question (look for keywords you missed)
Eliminate one wrong answer (builds momentum)
Make an educated guess (no penalty for guessing)
Flag for review (come back if time permits)
Move on (don't waste time)
Common Traps to Avoid:
ā Misreading "NOT", "EXCEPT", "LEAST" in questions
ā Choosing technically correct but not BEST answer
ā Overthinking simple questions
ā Changing answers without good reason (first instinct often correct)
ā Spending too much time on one question
After Exam
Immediately After:
Take a deep breath (you did it!)
Don't discuss answers with others (causes unnecessary stress)
Celebrate your effort (regardless of how you feel about performance)
Waiting for Results:
Results typically available within 5 business days
Check your email for notification
Access results through AWS Certification portal
Passing score: 720/1000 (72%)
If You Pass:
Celebrate! You're now AWS Certified Solutions Architect - Associate!
Update your resume and LinkedIn profile
Download your digital badge
Consider next certification (Professional level or Specialty)
If You Don't Pass:
Don't be discouraged (many people need multiple attempts)
Review your score report (identifies weak domains)
Focus study on weak areas
Take more practice tests
Schedule retake (30-day waiting period)
You've learned a lot and you'll pass next time!
You're Ready When...
Knowledge Indicators:
You score 80%+ on all full practice tests
You can explain key concepts without notes
You recognize question patterns instantly
You make decisions quickly using frameworks
You've completed all self-assessment checklists
You can draw architecture diagrams from memory
You understand WHY answers are correct, not just WHAT they are
Confidence Indicators:
You feel calm and prepared (not anxious)
You trust your preparation
You can manage test anxiety
You have a clear exam day plan
You've visualized success
Practical Indicators:
You've taken at least 3 full practice tests
You've reviewed all incorrect answers
You've strengthened weak areas
You've memorized brain dump items
You know the testing center location and rules
Remember
Trust Your Preparation:
You've studied 60,000+ words of comprehensive content
You've answered 500+ practice questions
You've reviewed 120+ diagrams
You've completed all self-assessments
You're ready!
Manage Your Time:
2 minutes per question average
Don't spend more than 3 minutes on any question initially
Flag and move on if stuck
Save time for review
Read Carefully:
Watch for "NOT", "EXCEPT", "LEAST"
Identify constraint keywords
Read all answer options
Choose the BEST answer
Don't Overthink:
First instinct often correct
Don't change answers without good reason
Simple questions have simple answers
Trust your knowledge
Stay Calm:
Take deep breaths if stressed
Use positive self-talk
Focus on one question at a time
You've got this!
Final Thoughts
You've put in the work. You've studied hard. You've practiced extensively. You understand AWS services and how to apply them to real-world scenarios. You're ready for this exam.
Remember: This certification is a milestone, not the destination. Whether you pass on your first attempt or need to retake, you've learned valuable skills that will serve you throughout your career.
Believe in yourself. Trust your preparation. You've got this! šÆ
Good luck on your AWS Certified Solutions Architect - Associate exam!
Previous Chapter: 07_study_strategies - Study Techniques & Test-Taking Strategies
I consistently score 75%+ on full-length practice tests
I can complete 65 questions in 130 minutes with time to review
I understand all four exam domains thoroughly
I can explain AWS services and when to use them
I recognize common question patterns and traps
I've reviewed all my incorrect practice test answers
I'm confident in my test-taking strategies
I've had adequate rest and am mentally prepared
If you checked all boxes: You're ready! Trust your preparation and go ace that exam!
If you're missing any: Take an extra week to address those areas. It's better to be over-prepared than under-prepared.
Final Words of Encouragement
You've put in the work. You've studied the material. You've practiced the questions. You understand the concepts.
Trust yourself. You're ready for this.
Remember:
Read each question carefully
Eliminate wrong answers systematically
Choose the BEST answer, not just a correct answer
Manage your time wisely
Don't overthink - your first instinct is usually right
Stay calm and confident
Good luck on your AWS Certified Solutions Architect - Associate exam!
You've got this! š
After the exam: Whether you pass or not, be proud of the effort you put in. If you pass, celebrate! If not, review your score report, identify weak areas, and try again. Many successful architects didn't pass on their first attempt.
Exam Day Checklist
Morning of the Exam
3-4 Hours Before Exam:
Wake up at your normal time (don't disrupt sleep schedule)
Eat a healthy breakfast with protein and complex carbs
Avoid excessive caffeine (no more than your normal amount)
Do a light 15-minute review of your cheat sheet
Review your brain dump list one final time
2 Hours Before Exam:
Gather required items:
Two forms of ID (government-issued photo ID + secondary ID)
Confirmation email with exam appointment details
Water bottle (if allowed at test center)
Snack for after the exam
Dress comfortably (layers for temperature control)
Use the restroom before leaving
1 Hour Before Exam:
Arrive at test center 30 minutes early
Turn off phone and store in locker
Complete check-in process
Review test center rules and procedures
Take a few deep breaths to calm nerves
At the Test Station:
Adjust chair and monitor for comfort
Test headphones/earplugs if provided
Verify scratch paper and pen/pencil
Read all on-screen instructions carefully
Start the exam when ready
During the Exam
First 5 Minutes (Brain Dump):
Write down all memorized facts on scratch paper:
Port numbers (22, 80, 443, 3389, etc.)
Service limits (Lambda 15 min, S3 5 TB object, etc.)
Pricing comparisons (RI vs Spot vs On-Demand)
DR strategies (RTO/RPO for each)
Storage classes and costs
Any formulas or calculations
Time Management Strategy:
First Pass (60 minutes): Answer all questions you're confident about
Skip difficult questions (mark for review)
Aim to answer 40-45 questions in first pass
Build confidence with easy wins
Second Pass (40 minutes): Tackle marked questions
Use elimination method
Apply decision frameworks
Make educated guesses
Don't leave any blank
Final Pass (20 minutes): Review all answers
Check for misread questions
Verify you answered what was asked
Look for careless mistakes
Trust your first instinct (don't overthink)
Question-Answering Strategy:
Read the scenario carefully (identify key details)
Identify the question type:
"Most cost-effective" ā Choose cheapest option
"Least operational overhead" ā Choose managed service
Choose the BEST answer (not just a correct answer)
Watch for qualifier words: "MOST", "LEAST", "BEST", "FIRST"
Common Traps to Avoid:
Don't overthink simple questions
Don't assume information not given in the scenario
Don't choose answers with absolute words ("always", "never")
Don't pick the longest answer just because it's detailed
Don't change answers unless you're certain (first instinct usually right)
Mental Strategies
If You Feel Overwhelmed:
Take 3 deep breaths (in through nose, out through mouth)
Close your eyes for 10 seconds
Remind yourself: "I've prepared for this. I know this material."
Skip the current question and come back to it
Answer a few easy questions to rebuild confidence
If You're Running Out of Time:
Don't panic - you have time
Focus on answering remaining questions (don't leave blank)
Use elimination method quickly
Make educated guesses based on patterns
Trust your preparation
If You Don't Know an Answer:
Eliminate obviously wrong answers
Look for AWS best practices in remaining options
Choose the most managed/automated solution
Choose the most secure option if security-related
Choose the most cost-effective if cost-related
Make a guess and move on (don't dwell)
After the Exam
Immediately After:
Take a deep breath - you did it!
Don't discuss questions with others (NDA violation)
Collect your belongings from locker
Review your preliminary pass/fail result (if shown)
Within 5 Business Days:
Check your email for official score report
Review your performance by domain
If you passed: Celebrate! Share your achievement!
If you didn't pass: Review weak areas, schedule retake
If You Passed:
Download your digital badge from AWS Certification portal
Add certification to LinkedIn profile
Update your resume
Request physical certificate (optional)
Consider next certification (SAP-C02, DVA-C02, SOA-C02)
If You Didn't Pass:
Don't be discouraged - many successful architects failed first attempt
Review your score report to identify weak domains
Focus study on domains where you scored lowest
Retake practice tests for those specific domains
Schedule retake after 14-day waiting period
You've got this - try again!
Final Confidence Boosters
You're Ready If...
You've completed all chapters in this study guide
You score 75%+ on practice tests consistently
You can explain concepts without looking at notes
You recognize question patterns instantly
You make decisions quickly using frameworks
You've reviewed all domain summaries
You've practiced with all bundle types
Remember These Truths
You've put in the work - Trust your preparation
The exam is fair - It tests what you've studied
You don't need 100% - 720/1000 is passing (72%)
Educated guesses are okay - No penalty for wrong answers
First instinct is usually right - Don't overthink
You belong here - You've earned this opportunity
Final Mantras
"I am prepared and confident"
"I know this material"
"I will read each question carefully"
"I will choose the BEST answer"
"I trust my preparation"
"I've got this!"
Post-Exam Reflection
Regardless of Result
What You've Accomplished:
ā Studied 60,000+ words of comprehensive material
ā Learned 100+ AWS services and their use cases
ā Practiced 500+ exam-style questions
ā Mastered 4 major domains of cloud architecture
ā Developed critical thinking for cloud solutions
ā Invested weeks/months in professional development
This Knowledge is Valuable:
You now understand cloud architecture principles
You can design secure, resilient, high-performing, cost-optimized solutions
You've gained skills that are in high demand
You've proven your commitment to learning
You're better prepared for real-world AWS projects
Next Steps:
Apply this knowledge in your work
Build projects to reinforce learning
Share knowledge with others
Continue learning (cloud is always evolving)
Pursue additional certifications if desired
Closing Words
You've reached the end of this comprehensive study guide. Whether you're reading this the night before your exam or weeks in advance, know that you've invested significant time and effort into your professional development.
The exam is just one milestone in your cloud journey. The real value is in the knowledge you've gained and the skills you've developed. These will serve you throughout your career.
Trust yourself. You've prepared thoroughly. You understand the concepts. You can do this.
Good luck on your AWS Certified Solutions Architect - Associate exam!
You've got this! š
One Final Reminder:
Read each question carefully
Eliminate wrong answers systematically
Choose the BEST answer, not just a correct answer
Manage your time wisely
Stay calm and confident
Now go ace that exam!
Appendices
Appendix A: Quick Reference Tables
S3 Storage Classes Comparison
Storage Class
Cost/GB-month
Retrieval Time
Retrieval Cost
Min Duration
Use Case
Standard
$0.023
Milliseconds
None
None
Frequent access
Intelligent-Tiering
$0.023 + $0.0025/1K objects
Milliseconds
None
None
Unknown pattern
Standard-IA
$0.0125
Milliseconds
$0.01/GB
30 days
Infrequent access
One Zone-IA
$0.01
Milliseconds
$0.01/GB
30 days
Reproducible data
Glacier Instant
$0.004
Milliseconds
$0.03/GB
90 days
Archive, instant
Glacier Flexible
$0.0036
Minutes-hours
$0.01-0.03/GB
90 days
Archive, flexible
Glacier Deep Archive
$0.00099
12-48 hours
$0.02/GB
180 days
Long-term archive
EC2 Instance Families
Family
Type
vCPU:Memory Ratio
Use Case
Example
T3
Burstable
1:2
Variable workloads
Web servers, dev/test
M5
General Purpose
1:4
Balanced
App servers, databases
C5
Compute Optimized
1:2
High CPU
Batch, gaming, encoding
R5
Memory Optimized
1:8
High memory
In-memory DBs, big data
I3
Storage Optimized
1:8 + NVMe
High I/O
NoSQL, data warehousing
P3
GPU
GPU
ML training
Deep learning, HPC
G4
GPU
GPU
Graphics
ML inference, rendering
RDS vs DynamoDB
Feature
RDS
DynamoDB
Type
Relational (SQL)
NoSQL (key-value)
Scaling
Vertical (instance size)
Horizontal (automatic)
Latency
5-10ms
1-5ms
Throughput
Limited by instance
Unlimited (on-demand)
Transactions
ACID
Eventually consistent (default)
Queries
Complex SQL
Simple key-based
Cost
Instance hours
Request-based
Use Case
Complex queries, joins
High-scale, simple queries
Load Balancer Types
Feature
ALB
NLB
GWLB
Layer
7 (HTTP/HTTPS)
4 (TCP/UDP)
3 (IP)
Performance
Moderate
Ultra-high
High
Routing
Content-based
Connection-based
Transparent
Static IP
No
Yes
Yes
WebSocket
Yes
Yes
No
Use Case
Web apps, microservices
TCP/UDP, extreme performance
Firewalls, IDS/IPS
Appendix B: Key Service Limits
S3 Limits
Buckets per account: 100 (soft limit)
Object size: 5 TB maximum
Single PUT: 5 GB maximum
Multipart upload: 5 TB maximum
Request rate: 5,500 GET/sec, 3,500 PUT/sec per prefix
EC2 Limits
On-Demand instances: 20 per region (soft limit)
Reserved Instances: No limit
Spot Instances: Dynamic (based on capacity)
EBS volumes: 5,000 per region
Elastic IPs: 5 per region (soft limit)
VPC Limits
VPCs per region: 5 (soft limit)
Subnets per VPC: 200
Security Groups per VPC: 2,500
Rules per Security Group: 60 inbound, 60 outbound
NACLs per VPC: 200
Rules per NACL: 20 (soft limit)
RDS Limits
DB instances: 40 per region
Read replicas: 15 per primary
Automated backups: 35 days retention
Manual snapshots: No limit
Storage: 64 TB maximum (most engines)
Lambda Limits
Concurrent executions: 1,000 per region (soft limit)
Choose best answer: Not just correct, but BEST for the scenario
After the Exam
Results available immediately (pass/fail)
Detailed score report within 5 business days
Certificate available in AWS Certification account
Valid for 3 years from exam date
Consider next certification: Solutions Architect Professional, DevOps Engineer, Security Specialty
Final Encouragement
You've completed a comprehensive study guide covering:
ā 60,000+ words of detailed content
ā 129 visual diagrams for complex concepts
ā All four exam domains with deep explanations
ā Hundreds of examples and scenarios
ā Decision frameworks and best practices
ā Quick reference materials and cheat sheets
You are well-prepared. Trust your knowledge. Stay calm. You've got this!
Congratulations on completing this study guide! Best of luck on your AWS Certified Solutions Architect - Associate (SAA-C03) exam! šÆš
Study Guide Complete | Total Word Count: ~85,000 words | Diagrams: 129 files | Ready for Exam ā
Final Words
You're Ready When...
You score 75%+ on all practice tests consistently
You can explain key concepts without notes
You recognize question patterns instantly
You make decisions quickly using frameworks
You understand trade-offs between different solutions
You can design complete architectures from scratch
Remember
On Exam Day:
Trust your preparation - you've put in the work
Read questions carefully - every word matters
Eliminate wrong answers systematically
Choose the BEST answer, not just a correct answer
Manage your time - 2 minutes per question
Don't overthink - your first instinct is usually right
Stay calm and confident throughout
The Exam Tests:
Your ability to design secure, resilient, high-performing, cost-optimized architectures
Your understanding of AWS services and when to use them
Your ability to make trade-off decisions
Your knowledge of best practices and design patterns
You've Learned:
500+ practice questions with detailed explanations
100,000+ words of comprehensive study material
173 visual diagrams covering all key concepts
All four exam domains in depth
Integration patterns and real-world scenarios
Test-taking strategies and time management
You're Prepared!
Go into that exam with confidence. You've studied hard, practiced extensively, and you know this material.
Good luck on your AWS Certified Solutions Architect - Associate exam! šÆ
After Passing: Congratulations! You're now an AWS Certified Solutions Architect - Associate. Update your LinkedIn, celebrate your achievement, and start applying your knowledge to real-world projects.
If You Need to Retake: Don't be discouraged. Review your score report, identify weak areas, study those topics, and try again. Many successful architects didn't pass on their first attempt. Persistence pays off!