CC

SAA-C03 Study Guide & Reviewer

Comprehensive Study Materials & Key Concepts

AWS Certified Solutions Architect - Associate (SAA-C03) Comprehensive Study Guide

Complete Learning Path for Certification Success

Overview

This study guide provides a structured learning path from fundamentals to exam readiness for the AWS Certified Solutions Architect - Associate (SAA-C03) certification. Designed for complete novices, it teaches all concepts progressively while focusing exclusively on exam-relevant content. Extensive diagrams and visual aids are integrated throughout to enhance understanding and retention.

About This Certification

Exam Code: SAA-C03
Exam Duration: 130 minutes
Number of Questions: 65 (50 scored + 15 unscored)
Passing Score: 720 out of 1000
Question Types: Multiple choice (one correct answer) and multiple response (two or more correct answers)
Exam Format: Scenario-based questions testing real-world architecture decisions

Target Candidate: Individuals with at least 1 year of hands-on experience designing cloud solutions using AWS services, though this guide is designed to teach complete beginners from the ground up.

What This Guide Covers

This comprehensive study guide covers all four domains of the SAA-C03 exam:

  1. Domain 1: Design Secure Architectures (30% of exam)

    • Secure access to AWS resources
    • Secure workloads and applications
    • Data security controls
  2. Domain 2: Design Resilient Architectures (26% of exam)

    • Scalable and loosely coupled architectures
    • Highly available and fault-tolerant architectures
  3. Domain 3: Design High-Performing Architectures (24% of exam)

    • High-performing storage solutions
    • Elastic compute solutions
    • High-performing database solutions
    • Scalable network architectures
    • Data ingestion and transformation solutions
  4. Domain 4: Design Cost-Optimized Architectures (20% of exam)

    • Cost-optimized storage solutions
    • Cost-optimized compute solutions
    • Cost-optimized database solutions
    • Cost-optimized network architectures

Section Organization

Study Sections (read in order):

  • Overview (this section) - How to use the guide and study plan
  • Fundamentals - Section 0: Essential background and prerequisites
  • 02_domain1_secure_architectures - Section 1: Security (30% of exam)
  • 03_domain2_resilient_architectures - Section 2: Resilience (26% of exam)
  • 04_domain3_high_performing_architectures - Section 3: Performance (24% of exam)
  • 05_domain4_cost_optimized_architectures - Section 4: Cost Optimization (20% of exam)
  • Integration - Integration & cross-domain scenarios
  • Study strategies - Study techniques & test-taking strategies
  • Final checklist - Final week preparation checklist
  • Appendices - Quick reference tables, glossary, resources
  • diagrams/ - Folder containing all Mermaid diagram files (.mmd)

Study Plan Overview

Total Time: 6-10 weeks (2-3 hours daily)

Week-by-Week Breakdown:

  • Week 1-2: Fundamentals & Domain 1 (Security)

    • Read: 01_fundamentals (8-10 hours)
    • Read: 02_domain1_secure_architectures (12-15 hours)
    • Practice: Domain 1 focused questions
    • Goal: Understand IAM, VPC security, encryption
  • Week 3-4: Domain 2 (Resilience)

    • Read: 03_domain2_resilient_architectures (12-15 hours)
    • Practice: Domain 2 focused questions
    • Goal: Master high availability, disaster recovery, scalability
  • Week 5-6: Domain 3 (Performance)

    • Read: 04_domain3_high_performing_architectures (12-15 hours)
    • Practice: Domain 3 focused questions
    • Goal: Optimize storage, compute, database, network performance
  • Week 7: Domain 4 (Cost Optimization)

    • Read: 05_domain4_cost_optimized_architectures (8-10 hours)
    • Practice: Domain 4 focused questions
    • Goal: Understand pricing models, cost optimization strategies
  • Week 8: Integration & Cross-Domain Scenarios

    • Read: 06_integration (6-8 hours)
    • Practice: Full practice tests
    • Goal: Connect concepts across domains
  • Week 9: Practice & Review

    • Complete all practice test bundles
    • Review weak areas
    • Target: 75%+ on practice tests
  • Week 10: Final Preparation

    • Read: 07_study_strategies
    • Read: 08_final_checklist
    • Final review of 99_appendices
    • Light review, rest, exam day

Learning Approach

1. Read Actively

  • Don't just read - engage with the material
  • Take notes on ⭐ Must Know items
  • Draw your own diagrams to reinforce concepts
  • Explain concepts out loud to yourself

2. Use the Diagrams

  • Study each diagram carefully
  • Understand how components interact
  • Trace data flows and decision paths
  • Recreate diagrams from memory

3. Practice Regularly

  • Complete exercises after each section
  • Use practice questions to validate understanding
  • Review explanations for both correct and incorrect answers
  • Identify patterns in question types

4. Test Yourself

  • Use self-assessment checklists at end of each chapter
  • Take practice tests under timed conditions
  • Aim for 80%+ before moving to next chapter
  • Review mistakes thoroughly

5. Review Strategically

  • Revisit marked sections weekly
  • Focus on weak areas identified in practice tests
  • Use 99_appendices for quick reference
  • Create your own summary notes

Progress Tracking

Use checkboxes to track your completion:

Chapter Completion:

  • Chapter 0: Fundamentals (01_fundamentals)
  • Chapter 1: Domain 1 - Secure Architectures (02_domain1_secure_architectures)
  • Chapter 2: Domain 2 - Resilient Architectures (03_domain2_resilient_architectures)
  • Chapter 3: Domain 3 - High-Performing Architectures (04_domain3_high_performing_architectures)
  • Chapter 4: Domain 4 - Cost-Optimized Architectures (05_domain4_cost_optimized_architectures)
  • Integration Chapter (06_integration)
  • Study Strategies (07_study_strategies)
  • Final Checklist (08_final_checklist)

Practice Test Performance:

  • Beginner Practice Test 1: ___% (target: 70%+)
  • Beginner Practice Test 2: ___% (target: 75%+)
  • Intermediate Practice Test 1: ___% (target: 70%+)
  • Intermediate Practice Test 2: ___% (target: 75%+)
  • Full Practice Test 1: ___% (target: 75%+)
  • Full Practice Test 2: ___% (target: 80%+)
  • Full Practice Test 3: ___% (target: 85%+)

Domain Mastery:

  • Domain 1 (Security): Practice score 80%+
  • Domain 2 (Resilience): Practice score 80%+
  • Domain 3 (Performance): Practice score 80%+
  • Domain 4 (Cost): Practice score 80%+

Legend

Throughout this guide, you'll see these visual markers:

  • ⭐ Must Know: Critical information for the exam - memorize this
  • šŸ’” Tip: Helpful insight, shortcut, or best practice
  • āš ļø Warning: Common mistake or misconception to avoid
  • šŸ”— Connection: Related to other topics in the guide
  • šŸ“ Practice: Hands-on exercise or scenario to work through
  • šŸŽÆ Exam Focus: Frequently tested concept or pattern
  • šŸ“Š Diagram: Visual representation available (see diagrams folder)

How to Navigate

Sequential Learning (Recommended for Beginners):

  1. Start with 01_fundamentals
  2. Progress through domain chapters in order (02 → 03 → 04 → 05)
  3. Complete 06_integration
  4. Review 07_study_strategies before practice tests
  5. Use 08_final_checklist in your final week
  6. Keep 99_appendices open for quick reference

Targeted Learning (For Experienced Users):

  1. Take a practice test to identify weak areas
  2. Jump directly to relevant domain chapters
  3. Focus on sections marked šŸŽÆ Exam Focus
  4. Use 99_appendices for quick refreshers
  5. Complete 06_integration for cross-domain scenarios

Visual Learning (For Diagram-Focused Study):

  1. Browse the diagrams/ folder
  2. Study architecture diagrams first
  3. Read corresponding text sections for context
  4. Recreate diagrams from memory
  5. Use diagrams to explain concepts to others

Study Tips for Success

Before You Start:

  • Set a realistic study schedule (2-3 hours daily)
  • Create a dedicated study space
  • Gather materials: notebook, highlighter, practice tests
  • Set your exam date (6-10 weeks out)

During Your Study:

  • Study in focused 45-60 minute blocks
  • Take 10-15 minute breaks between blocks
  • Review previous day's material before starting new content
  • Create flashcards for ⭐ Must Know items
  • Join AWS study groups or forums for support

Practice Test Strategy:

  • Take first practice test after Week 2 (baseline)
  • Take practice tests weekly to track progress
  • Review ALL explanations, even for correct answers
  • Identify patterns in mistakes
  • Retake missed questions after reviewing concepts

Final Week:

  • No new material - only review
  • Focus on weak areas identified in practice tests
  • Review all ⭐ Must Know items
  • Complete 08_final_checklist
  • Get adequate sleep

What Makes This Guide Different

Comprehensive for Novices:

  • Assumes zero prior AWS knowledge
  • Explains WHY concepts exist, not just WHAT they are
  • Uses real-world analogies for complex topics
  • Progressive learning from simple to complex

Self-Sufficient:

  • No external resources needed
  • All concepts explained in detail
  • Multiple examples for each topic
  • Complete coverage of exam domains

Visually Rich:

  • 120-200 Mermaid diagrams
  • Architecture patterns for all major services
  • Decision trees for service selection
  • Sequence diagrams for workflows

Exam-Focused:

  • Only covers exam-relevant content
  • Highlights frequently tested concepts
  • Provides test-taking strategies
  • Includes question-answering frameworks

Practical:

  • Real-world scenarios throughout
  • Hands-on exercises
  • Troubleshooting guidance
  • Best practices from AWS Well-Architected Framework

Prerequisites

Recommended Background:

  • Basic understanding of networking (IP addresses, DNS, HTTP/HTTPS)
  • Familiarity with operating systems (Linux or Windows)
  • Basic programming or scripting knowledge (helpful but not required)
  • Understanding of databases (SQL vs NoSQL concepts)

If You're Missing Prerequisites:

  • Chapter 01_fundamentals covers essential background
  • Glossary in 99_appendices defines all technical terms
  • Diagrams provide visual explanations of complex concepts
  • Examples use relatable analogies

How to Use Practice Tests

Practice Test Bundles Included:

  1. Difficulty-Based (6 bundles):

    • Beginner 1 & 2: Build confidence with foundational questions
    • Intermediate 1 & 2: Test understanding of core concepts
    • Advanced 1 & 2: Challenge yourself with complex scenarios
  2. Full Practice Tests (3 bundles):

    • Simulate real exam conditions (65 questions, 130 minutes)
    • Domain-balanced like actual exam
    • Mixed difficulty levels
  3. Domain-Focused (9 bundles):

    • Target specific domains for focused practice
    • Identify weak areas by domain
    • Deep dive into domain-specific concepts
  4. Service-Focused (6 bundles):

    • Practice questions by AWS service category
    • Master specific service groups
    • Understand service integrations

When to Use Each Type:

  • Weeks 1-7: Use domain-focused bundles after completing each chapter
  • Week 8: Take full practice tests to simulate exam
  • Week 9: Use difficulty-based and service-focused bundles to target weak areas
  • Week 10: Final full practice test for confidence check

Expected Outcomes

After Completing This Guide:

  • āœ… Understand all four exam domains thoroughly
  • āœ… Design secure, resilient, high-performing, cost-optimized architectures
  • āœ… Select appropriate AWS services for different scenarios
  • āœ… Explain architectural decisions using AWS best practices
  • āœ… Score 75%+ on practice tests consistently
  • āœ… Feel confident on exam day

Skills You'll Develop:

  • Architecture design and evaluation
  • Service selection and comparison
  • Security best practices implementation
  • Cost optimization strategies
  • Performance tuning techniques
  • Disaster recovery planning
  • Troubleshooting and problem-solving

Getting Help

If You're Stuck:

  1. Review the relevant section in the chapter
  2. Study the associated diagrams
  3. Check 99_appendices for quick reference
  4. Review practice question explanations
  5. Revisit 01_fundamentals for foundational concepts

Additional Resources (After Completing This Guide):

  • AWS Documentation (official reference)
  • AWS Whitepapers (Well-Architected Framework)
  • AWS Training and Certification portal
  • AWS re:Invent videos (for deeper dives)

Ready to Begin?

Start with Fundamentals to build your foundation, then progress through each domain chapter. Remember: this is a marathon, not a sprint. Consistent daily study is more effective than cramming.

Your journey to AWS Solutions Architect - Associate certification starts now!


Last Updated: October 2025
Exam Version: SAA-C03
Study Guide Version: 1.0


Quick Start Guide

For Complete Beginners (6-10 weeks):

  1. Week 1: Read 01_fundamentals + take notes
  2. Week 2-3: Read 02_domain1_secure_architectures + practice Domain 1 questions
  3. Week 4-5: Read 03_domain2_resilient_architectures + practice Domain 2 questions
  4. Week 6: Read 04_domain3_high_performing_architectures + practice Domain 3 questions
  5. Week 7: Read 05_domain4_cost_optimized_architectures + practice Domain 4 questions
  6. Week 8: Read 06_integration + take full practice tests
  7. Week 9: Review weak areas + retake practice tests (target: 80%+)
  8. Week 10: Read 07_study_strategies + 08_final_checklist + light review

For Experienced Users (3-4 weeks):

  1. Week 1: Skim all domain chapters + take full practice test (identify weak areas)
  2. Week 2: Deep dive into weak domains + domain-focused practice tests
  3. Week 3: Read 06_integration + take full practice tests (target: 85%+)
  4. Week 4: Read 07_study_strategies + 08_final_checklist + final review

For Last-Minute Review (1 week):

  1. Day 1-5: Review all chapter summaries + 99_appendices
  2. Day 6: Take full practice test + review mistakes
  3. Day 7: Read 08_final_checklist + light review + rest

Next Chapter: 01_fundamentals - Essential Background & Prerequisites

Good luck on your certification journey! šŸš€

Study Plan Overview

Total Time: 6-10 weeks (2-3 hours daily)
Target Audience: Complete novices to AWS certification
Exam: AWS Certified Solutions Architect - Associate (SAA-C03)

Weekly Breakdown

Week 1-2: Fundamentals & Domain 1 (Security)

  • Days 1-3: Read 01_fundamentals (8-10 hours)
    • AWS global infrastructure
    • Well-Architected Framework
    • Core concepts and terminology
    • Complete self-assessment checklist
  • Days 4-10: Read 02_domain1_secure_architectures (12-15 hours)
    • IAM and access management
    • Network security (VPC, security groups, NACLs)
    • Data protection and encryption
    • Complete practice questions (Domain 1 Bundle 1)
    • Target: 70%+ on beginner questions

Week 3-4: Domain 2 (Resilience)

  • Days 11-17: Read 03_domain2_resilient_architectures (12-15 hours)
    • Loose coupling and microservices
    • Messaging services (SQS, SNS, EventBridge)
    • High availability and fault tolerance
    • Disaster recovery strategies
    • Complete practice questions (Domain 2 Bundle 1)
    • Target: 70%+ on beginner questions

Week 5-6: Domain 3 (Performance)

  • Days 18-24: Read 04_domain3_high_performing_architectures (12-15 hours)
    • Storage performance optimization
    • Compute optimization (EC2, Lambda, containers)
    • Database performance (RDS, Aurora, DynamoDB)
    • Network optimization (CloudFront, Global Accelerator)
    • Data ingestion and analytics
    • Complete practice questions (Domain 3 Bundle 1)
    • Target: 70%+ on beginner questions

Week 7: Domain 4 (Cost Optimization)

  • Days 25-28: Read 05_domain4_cost_optimized_architectures (8-10 hours)
    • Storage cost optimization
    • Compute pricing models (Reserved, Spot, Savings Plans)
    • Database cost strategies
    • Network cost optimization
    • Complete practice questions (Domain 4 Bundle 1)
    • Target: 70%+ on beginner questions

Week 8: Integration & Practice

  • Days 29-31: Read 06_integration (6-8 hours)
    • Cross-domain scenarios
    • Multi-service architectures
    • Real-world case studies
  • Days 32-35: Full Practice Tests
    • Take Full Practice Test 1 (65 questions, 130 minutes)
    • Review incorrect answers thoroughly
    • Target: 65%+ overall score
    • Take Full Practice Test 2 (65 questions, 130 minutes)
    • Review incorrect answers thoroughly
    • Target: 70%+ overall score

Week 9: Review & Advanced Practice

  • Days 36-38: Read 07_study_strategies (4-6 hours)
    • Test-taking techniques
    • Question analysis methods
    • Time management strategies
  • Days 39-42: Advanced Practice
    • Take Full Practice Test 3 (65 questions, 130 minutes)
    • Target: 75%+ overall score
    • Review all flagged topics from previous tests
    • Take domain-specific bundles for weak areas
    • Complete intermediate and advanced questions

Week 10: Final Preparation

  • Days 43-45: Read 08_final_checklist (2-3 hours)
    • Final week preparation guide
    • Day-before checklist
    • Exam day tips
  • Days 46-48: Final Review
    • Review 99_appendices (quick reference)
    • Skim chapter summaries and quick reference cards
    • Review all diagrams for visual reinforcement
    • Take one final practice test
    • Target: 80%+ overall score
  • Day 49: Rest and light review
    • Review cheat sheet only (30 minutes)
    • Get good sleep
    • Prepare exam day materials
  • Day 50: Exam Day!

Learning Approach

1. Read: Study each section thoroughly

  • Don't rush - understanding is more important than speed
  • Take notes on ⭐ Must Know items
  • Draw your own diagrams to reinforce concepts
  • Pause to research unfamiliar terms

2. Visualize: Study all diagrams carefully

  • Each chapter has 10-30 Mermaid diagrams
  • Diagrams are in the diagrams/ folder
  • Understand how components interact
  • Recreate diagrams from memory

3. Practice: Complete exercises after each section

  • Hands-on exercises reinforce learning
  • Use AWS Free Tier for practical experience
  • Build simple architectures to test understanding

4. Test: Use practice questions to validate understanding

  • Start with beginner questions (target: 80%+)
  • Progress to intermediate (target: 70%+)
  • Challenge yourself with advanced (target: 60%+)
  • Review explanations for ALL questions (correct and incorrect)

5. Review: Revisit marked sections as needed

  • Use quick reference cards for rapid review
  • Focus on weak areas identified in practice tests
  • Spaced repetition improves retention

Progress Tracking

Use checkboxes to track completion:

Fundamentals & Prerequisites:

  • 01_fundamentals completed
  • Fundamentals self-assessment passed (80%+)
  • Core concepts understood

Domain 1: Secure Architectures (30% of exam):

  • 02_domain1_secure_architectures completed
  • Domain 1 practice questions passed (70%+)
  • IAM concepts mastered
  • Network security understood
  • Data protection strategies clear

Domain 2: Resilient Architectures (26% of exam):

  • 03_domain2_resilient_architectures completed
  • Domain 2 practice questions passed (70%+)
  • Messaging services understood
  • High availability patterns mastered
  • DR strategies clear

Domain 3: High-Performing Architectures (24% of exam):

  • 04_domain3_high_performing_architectures completed
  • Domain 3 practice questions passed (70%+)
  • Storage optimization understood
  • Compute optimization mastered
  • Database performance clear

Domain 4: Cost-Optimized Architectures (20% of exam):

  • 05_domain4_cost_optimized_architectures completed
  • Domain 4 practice questions passed (70%+)
  • Pricing models understood
  • Cost optimization strategies mastered

Integration & Final Preparation:

  • 06_integration completed
  • 07_study_strategies completed
  • 08_final_checklist completed
  • 99_appendices reviewed
  • Full Practice Test 1 passed (65%+)
  • Full Practice Test 2 passed (70%+)
  • Full Practice Test 3 passed (75%+)
  • Final practice test passed (80%+)

Success Criteria

You're ready for the exam when:

  • You score 75%+ on all full practice tests
  • You can explain key concepts without notes
  • You recognize question patterns instantly
  • You make decisions quickly using frameworks
  • You understand WHY answers are correct, not just WHAT they are
  • You can draw architecture diagrams from memory
  • You feel confident in all four domains

Study Tips

Active Learning:

  • Teach concepts to someone else (or explain out loud)
  • Draw diagrams and architectures on paper
  • Write your own practice questions
  • Compare and contrast similar services

Memory Aids:

  • Use mnemonics for lists (e.g., SAML = Security Assertion Markup Language)
  • Create visual patterns and associations
  • Use the quick reference cards for rapid review
  • Review diagrams regularly for visual reinforcement

Time Management:

  • Study at the same time each day (builds habit)
  • Take 10-minute breaks every hour
  • Don't cram - consistent daily study is better
  • Review previous material before starting new content

Avoid Common Mistakes:

  • Don't skip fundamentals - they're the foundation
  • Don't just read - actively engage with material
  • Don't ignore practice tests - they reveal gaps
  • Don't memorize - understand the WHY behind concepts
  • Don't study in isolation - join study groups or forums

Additional Resources

Official AWS Resources:

Practice and Hands-On:

Community:

How to Navigate This Guide

File Organization:

  • Files are numbered for sequential reading (00, 01, 02, etc.)
  • Each domain chapter is self-contained but builds on previous knowledge
  • Diagrams are in the diagrams/ folder, referenced in text
  • Quick reference cards at end of each chapter for rapid review

Reading Strategy:

  • Read chapters in order (01 → 02 → 03 → 04 → 05 → 06)
  • Don't skip ahead - concepts build progressively
  • Use 99_appendices as quick reference during study
  • Return to 08_final_checklist in your last week
  • Review 07_study_strategies before taking practice tests

Visual Learning:

  • 173 Mermaid diagrams throughout the guide
  • Each diagram has detailed text explanation
  • Diagrams show architecture, flows, decisions, and comparisons
  • Study diagrams carefully - they simplify complex concepts

Practice Integration:

  • Practice questions are organized by difficulty and domain
  • Start with beginner questions after reading each chapter
  • Progress to intermediate and advanced as confidence grows
  • Review explanations for ALL questions, not just incorrect ones

Legend

Throughout this guide, you'll see these markers:

  • ⭐ Must Know: Critical for exam success - memorize these
  • šŸ’” Tip: Helpful insight or shortcut to remember concepts
  • āš ļø Warning: Common mistake to avoid - exam traps
  • šŸ”— Connection: Related to other topics - cross-reference
  • šŸ“ Practice: Hands-on exercise to reinforce learning
  • šŸŽÆ Exam Focus: Frequently tested concept - high priority
  • šŸ“Š Diagram: Visual representation available in diagrams folder

Final Words

This comprehensive study guide is designed to take you from complete novice to exam-ready in 6-10 weeks. The key to success is:

  1. Consistency: Study 2-3 hours daily, every day
  2. Understanding: Focus on WHY, not just WHAT
  3. Practice: Take all practice tests and review thoroughly
  4. Patience: Don't rush - mastery takes time
  5. Confidence: Trust your preparation and stay calm

Remember: This guide is self-sufficient. You have everything you need to pass the SAA-C03 exam. Follow the study plan, complete all practice questions, and you'll be ready!

Good luck on your certification journey! šŸš€


Next Step: Begin with 01_fundamentals - Essential Background


Chapter 0: Essential Background and Prerequisites

Chapter Overview

What you'll learn:

  • AWS Global Infrastructure (Regions, Availability Zones, Edge Locations)
  • AWS Shared Responsibility Model
  • Core AWS concepts and terminology
  • AWS Well-Architected Framework fundamentals
  • Basic networking and cloud computing concepts

Time to complete: 8-10 hours
Prerequisites: None - this chapter starts from the basics

Why this matters: Understanding these foundational concepts is critical for the SAA-C03 exam. Every question assumes you know how AWS infrastructure works, what AWS is responsible for versus what you're responsible for, and how to apply architectural best practices. Without this foundation, the domain-specific chapters won't make sense.


Section 1: What is Cloud Computing?

Introduction

The problem: Traditional IT infrastructure requires companies to buy, install, and maintain physical servers in their own data centers. This means:

  • Large upfront capital expenses (buying servers, networking equipment, cooling systems)
  • Long lead times (weeks or months to procure and set up new hardware)
  • Capacity planning challenges (over-provision and waste money, or under-provision and run out of capacity)
  • Ongoing maintenance costs (power, cooling, physical security, hardware failures)
  • Difficulty scaling globally (need to build data centers in every region you serve)

The solution: Cloud computing provides on-demand access to computing resources (servers, storage, databases, networking) over the internet, with pay-as-you-go pricing. Instead of owning and maintaining physical infrastructure, you rent it from a cloud provider like AWS.

Why it's tested: The SAA-C03 exam assumes you understand the fundamental benefits of cloud computing and can design solutions that leverage these benefits. Questions often test whether you can identify when cloud-native solutions are more appropriate than traditional approaches.

Core Concepts

What is Cloud Computing?

What it is: Cloud computing is the on-demand delivery of IT resources over the internet with pay-as-you-go pricing. Instead of buying, owning, and maintaining physical data centers and servers, you can access technology services such as computing power, storage, and databases on an as-needed basis from a cloud provider like Amazon Web Services (AWS).

Why it exists: Before cloud computing, every company that needed IT infrastructure had to build and maintain their own data centers. This was expensive, time-consuming, and required specialized expertise. Cloud computing emerged to solve these problems by allowing companies to rent infrastructure instead of owning it, similar to how you rent an apartment instead of building a house.

Real-world analogy: Think of cloud computing like electricity from a power company. You don't build your own power plant - you plug into the grid and pay for what you use. Similarly, you don't build your own data center - you connect to AWS and pay for the computing resources you consume.

How it works (Detailed step-by-step):

  1. You identify your need: Your application needs a server to run a web application. Instead of buying physical hardware, you decide to use AWS.

  2. You provision resources via API/Console: You log into the AWS Management Console (a web interface) or use the AWS API (programmatic access) and request a virtual server (called an EC2 instance). You specify what type of server you need (CPU, memory, storage).

  3. AWS allocates resources: Within minutes, AWS provisions a virtual server for you from their massive pool of physical servers in their data centers. This virtual server is isolated from other customers' servers using virtualization technology.

  4. You use the resources: Your virtual server is now running and accessible over the internet. You can install your application, configure it, and start serving users. The server behaves just like a physical server you might have in your own data center.

  5. You pay for what you use: AWS meters your usage (how many hours the server runs, how much data you transfer, how much storage you use) and charges you accordingly. If you stop using the server, you stop paying for it.

  6. You scale as needed: If your application becomes popular and needs more servers, you can provision additional servers in minutes. If traffic decreases, you can terminate servers and stop paying for them. This elasticity is a key benefit of cloud computing.

The Six Advantages of Cloud Computing

⭐ Must Know: These six advantages appear frequently in exam questions. You need to recognize scenarios where each advantage applies.

  1. Trade capital expense for variable expense

    • What it means: Instead of paying large upfront costs for data centers and servers (capital expense), you pay only for the computing resources you consume (variable expense).
    • Example: A startup doesn't need $100,000 to buy servers before launching. They can start with $10/month on AWS and scale up as they grow.
    • Exam relevance: Questions test whether you can identify cost optimization opportunities by moving from fixed to variable costs.
  2. Benefit from massive economies of scale

    • What it means: AWS buys hardware and operates data centers at massive scale, achieving lower costs than individual companies could. These savings are passed to customers through lower prices.
    • Example: AWS can negotiate better prices with hardware vendors because they buy millions of servers. You benefit from these bulk discounts.
    • Exam relevance: Questions may ask why cloud solutions are often more cost-effective than on-premises solutions.
  3. Stop guessing capacity

    • What it means: You don't need to predict how much infrastructure you'll need months in advance. You can scale up or down based on actual demand.
    • Example: A retail website doesn't need to buy enough servers to handle Black Friday traffic all year round. They can scale up for Black Friday and scale down afterward.
    • Exam relevance: Questions test your understanding of auto-scaling and elastic architectures.
  4. Increase speed and agility

    • What it means: New IT resources are available in minutes instead of weeks. This allows faster experimentation and innovation.
    • Example: A developer can spin up a test environment in 5 minutes to try a new idea, instead of waiting weeks for IT to procure and configure hardware.
    • Exam relevance: Questions test whether you can design solutions that enable rapid deployment and iteration.
  5. Stop spending money running and maintaining data centers

    • What it means: You can focus on your business and applications instead of managing physical infrastructure (racking servers, managing power and cooling, physical security).
    • Example: A healthcare company can focus on improving patient care instead of hiring data center technicians.
    • Exam relevance: Questions test whether you understand the operational benefits of managed services.
  6. Go global in minutes

    • What it means: You can deploy your application in multiple geographic regions around the world with just a few clicks, providing lower latency to global users.
    • Example: A gaming company can deploy servers in North America, Europe, and Asia simultaneously to provide low-latency gameplay to players worldwide.
    • Exam relevance: Questions test your understanding of multi-region architectures and global deployment strategies.

šŸ’” Tip: When you see exam questions asking "Why should the company move to AWS?" or "What are the benefits of this cloud solution?", think about these six advantages. The correct answer often relates to one or more of them.


Section 2: AWS Global Infrastructure

Introduction

The problem: Applications need to be available to users around the world with low latency (fast response times). If all your servers are in one location, users far away will experience slow performance. Additionally, if that one location experiences a disaster (power outage, natural disaster, network failure), your entire application goes down.

The solution: AWS has built a global infrastructure with data centers distributed around the world. This allows you to deploy your application close to your users for low latency, and across multiple isolated locations for high availability and disaster recovery.

Why it's tested: Understanding AWS global infrastructure is fundamental to the SAA-C03 exam. Questions frequently test your ability to design architectures that leverage Regions, Availability Zones, and Edge Locations for resilience, performance, and compliance.

Core Concepts

AWS Regions

What it is: An AWS Region is a physical geographic area where AWS has multiple data centers. Each Region is completely independent and isolated from other Regions. As of 2025, AWS has 33+ Regions worldwide, with names like us-east-1 (N. Virginia), eu-west-1 (Ireland), and ap-southeast-1 (Singapore).

Why it exists: Regions exist to allow you to deploy applications close to your users (reducing latency), comply with data residency requirements (some countries require data to stay within their borders), and provide geographic redundancy (if one Region fails, your application can continue running in another Region).

Real-world analogy: Think of AWS Regions like different branches of a bank. Each branch operates independently - if the New York branch has a problem, the London branch continues operating normally. You choose which branch to use based on where you live (proximity) and local regulations.

How it works (Detailed step-by-step):

  1. AWS builds data centers in a geographic area: AWS selects a location (like Northern Virginia) and builds multiple data centers in that area. These data centers are connected with high-speed, low-latency networking.

  2. The Region is isolated: Each Region is completely independent. Resources in us-east-1 don't automatically replicate to eu-west-1. This isolation provides fault tolerance - a problem in one Region doesn't affect other Regions.

  3. You choose a Region for your resources: When you create AWS resources (like EC2 instances, S3 buckets, RDS databases), you must specify which Region to create them in. This decision is based on:

    • Proximity to users: Choose a Region close to your users for low latency
    • Compliance requirements: Some regulations require data to stay in specific countries
    • Service availability: Not all AWS services are available in all Regions
    • Cost: Pricing varies slightly between Regions
  4. Resources stay in that Region: Once created, resources remain in that Region unless you explicitly copy or move them. For example, an EC2 instance in us-east-1 cannot be directly moved to eu-west-1 - you would need to create a new instance in eu-west-1.

  5. You can deploy across multiple Regions: For global applications, you can deploy resources in multiple Regions and use services like Route 53 (DNS) and CloudFront (CDN) to route users to the nearest Region.

⭐ Must Know:

  • Each Region is completely isolated and independent
  • Resources don't automatically replicate across Regions
  • You choose the Region based on latency, compliance, service availability, and cost
  • Region names follow the pattern: geographic-area-number (e.g., us-east-1, eu-west-2)

Detailed Example 1: E-commerce Application Deployment

Imagine you're running an e-commerce website that sells products to customers in the United States and Europe. Here's how you would use Regions:

Scenario: Your company is based in the US, but 40% of your customers are in Europe. European customers complain about slow page load times.

Solution using Regions:

  1. Deploy your application in us-east-1 (N. Virginia) to serve US customers
  2. Deploy a copy of your application in eu-west-1 (Ireland) to serve European customers
  3. Use Route 53 with geolocation routing to automatically direct US users to us-east-1 and European users to eu-west-1
  4. Each Region has its own EC2 instances, load balancers, and databases
  5. You replicate product catalog data between Regions so both have the same inventory information

Result: US customers connect to servers in Virginia (low latency), European customers connect to servers in Ireland (low latency). If the Virginia Region experiences an outage, European customers are unaffected because Ireland is completely independent.

Detailed Example 2: Compliance Requirements

Scenario: A German healthcare company must comply with GDPR, which requires patient data to remain within the European Union.

Solution using Regions:

  1. Deploy all application resources in eu-central-1 (Frankfurt, Germany)
  2. Configure S3 buckets with region restrictions to prevent accidental data transfer outside the EU
  3. Use AWS Organizations with Service Control Policies (SCPs) to prevent developers from creating resources in non-EU Regions
  4. Enable CloudTrail logging to audit all data access and ensure compliance

Result: All patient data stays within the EU, satisfying GDPR requirements. The company can prove to regulators that data never leaves the EU Region.

Detailed Example 3: Disaster Recovery Across Regions

Scenario: A financial services company needs to ensure their trading platform remains available even if an entire AWS Region fails.

Solution using Regions:

  1. Primary deployment in us-east-1 (N. Virginia) handles all production traffic
  2. Standby deployment in us-west-2 (Oregon) remains ready but doesn't serve traffic
  3. Database replication from us-east-1 to us-west-2 keeps data synchronized
  4. Route 53 health checks monitor the us-east-1 deployment
  5. If us-east-1 fails, Route 53 automatically redirects traffic to us-west-2

Result: If the entire us-east-1 Region becomes unavailable (extremely rare but possible), the application automatically fails over to us-west-2 within minutes, minimizing downtime.

āš ļø Common Mistakes & Misconceptions:

  • Mistake 1: Assuming resources automatically replicate across Regions

    • Why it's wrong: AWS Regions are completely isolated. If you create an EC2 instance in us-east-1, it doesn't automatically appear in eu-west-1.
    • Correct understanding: You must explicitly configure cross-region replication for services that support it (like S3, RDS, DynamoDB) or manually deploy resources in multiple Regions.
  • Mistake 2: Thinking all AWS services are available in all Regions

    • Why it's wrong: New AWS services typically launch in a few Regions first, then gradually expand to other Regions over time.
    • Correct understanding: Always check the AWS Regional Services List to confirm a service is available in your chosen Region before designing your architecture.
  • Mistake 3: Choosing a Region based only on cost

    • Why it's wrong: While cost is a factor, choosing a Region far from your users can result in poor performance (high latency), which may cost you more in lost customers than you save on infrastructure.
    • Correct understanding: Prioritize proximity to users and compliance requirements, then consider cost as a secondary factor.

šŸ”— Connections to Other Topics:

  • Relates to Availability Zones (covered next) because: Each Region contains multiple Availability Zones
  • Builds on Disaster Recovery (covered in Domain 2) by: Providing geographic redundancy for business continuity
  • Often used with Route 53 (covered in Domain 3) to: Route users to the nearest Region for optimal performance

Availability Zones (AZs)

What it is: An Availability Zone (AZ) is one or more discrete data centers within an AWS Region, each with redundant power, networking, and connectivity. Each Region has multiple AZs (typically 3-6), and they are physically separated from each other (different buildings, sometimes different flood plains) but connected with high-speed, low-latency networking.

Why it exists: Even within a single geographic region, you need protection against localized failures. A single data center could experience power outages, cooling failures, network issues, or natural disasters. By distributing your application across multiple AZs within a Region, you protect against these single-point-of-failure scenarios while maintaining low latency between components.

Real-world analogy: Think of Availability Zones like different buildings in a corporate campus. All buildings are in the same city (Region) and connected with high-speed fiber optic cables, but each building has its own power supply, cooling system, and network connection. If one building loses power, the others continue operating normally.

How it works (Detailed step-by-step):

  1. AWS builds multiple isolated data centers in a Region: Within each Region, AWS constructs 3-6 separate data center facilities. These are physically separated (typically 10-100 km apart) to protect against localized disasters, but close enough for low-latency communication (typically <2ms latency between AZs).

  2. Each AZ has independent infrastructure: Each AZ has its own:

    • Power supply (with backup generators and UPS systems)
    • Cooling systems
    • Network connectivity (multiple ISPs)
    • Physical security
      This independence means a failure in one AZ (like a power outage) doesn't affect other AZs.
  3. AZs are connected with redundant, high-speed networking: AWS connects AZs within a Region using multiple redundant 100 Gbps fiber optic connections. This allows your application components in different AZs to communicate quickly and reliably.

  4. You distribute resources across AZs: When designing your architecture, you deploy resources (EC2 instances, databases, load balancers) across multiple AZs. For example:

    • Deploy web servers in AZ-1a, AZ-1b, and AZ-1c
    • Use an Application Load Balancer that distributes traffic across all three AZs
    • Use RDS Multi-AZ to automatically replicate your database to a standby in a different AZ
  5. AWS handles failover automatically (for some services): Many AWS services automatically handle AZ failures. For example:

    • Elastic Load Balancers automatically stop sending traffic to unhealthy AZs
    • RDS Multi-AZ automatically fails over to the standby database in another AZ
    • S3 automatically replicates data across multiple AZs
  6. You benefit from high availability: If one AZ fails completely, your application continues running in the remaining AZs with minimal disruption.

⭐ Must Know:

  • Each Region has multiple AZs (minimum 3, typically 3-6)
  • AZs are physically separated but connected with low-latency networking
  • AZ names are Region-specific: us-east-1a, us-east-1b, us-east-1c, etc.
  • Deploying across multiple AZs is the primary way to achieve high availability in AWS
  • Some services (like S3, DynamoDB) automatically use multiple AZs; others (like EC2) require you to explicitly deploy across AZs

šŸ“Š Global Infrastructure Diagram:

graph TB
    subgraph "AWS Global Infrastructure"
        subgraph "Region: us-east-1 (N. Virginia)"
            subgraph "AZ-1a"
                DC1[Data Center 1]
                DC2[Data Center 2]
            end
            subgraph "AZ-1b"
                DC3[Data Center 3]
                DC4[Data Center 4]
            end
            subgraph "AZ-1c"
                DC5[Data Center 5]
                DC6[Data Center 6]
            end
        end
        
        subgraph "Region: eu-west-1 (Ireland)"
            subgraph "AZ-2a"
                DC7[Data Center 7]
            end
            subgraph "AZ-2b"
                DC8[Data Center 8]
            end
            subgraph "AZ-2c"
                DC9[Data Center 9]
            end
        end
        
        subgraph "Edge Locations"
            EDGE1[CloudFront Edge<br/>New York]
            EDGE2[CloudFront Edge<br/>London]
            EDGE3[CloudFront Edge<br/>Tokyo]
        end
    end
    
    DC1 -.Low-latency connection.-> DC3
    DC1 -.Low-latency connection.-> DC5
    DC3 -.Low-latency connection.-> DC5
    
    style DC1 fill:#c8e6c9
    style DC3 fill:#c8e6c9
    style DC5 fill:#c8e6c9
    style EDGE1 fill:#e1f5fe
    style EDGE2 fill:#e1f5fe
    style EDGE3 fill:#e1f5fe

See: diagrams/01_fundamentals_global_infrastructure.mmd

Diagram Explanation (detailed):

This diagram illustrates the hierarchical structure of AWS global infrastructure. At the highest level, we have Regions - completely independent geographic areas like us-east-1 (Northern Virginia) and eu-west-1 (Ireland). Each Region is isolated from other Regions, meaning resources don't automatically replicate between them and a failure in one Region doesn't affect others.

Within each Region, we see multiple Availability Zones (AZ-1a, AZ-1b, AZ-1c in us-east-1). Each AZ contains one or more data centers (shown as DC1, DC2, etc.). The green data centers in us-east-1 represent active data centers within different AZs, connected by low-latency, high-bandwidth networking (shown as dotted lines). This low-latency connection (typically <2ms) allows your application components in different AZs to communicate quickly, enabling you to build highly available architectures without sacrificing performance.

The physical separation between AZs (they're in different buildings, sometimes different flood plains) protects against localized failures. If AZ-1a experiences a power outage, AZ-1b and AZ-1c continue operating normally because they have independent power supplies, cooling systems, and network connections.

At the bottom, we see Edge Locations (shown in blue) - these are separate from Regions and AZs. Edge Locations are part of AWS's content delivery network (CloudFront) and are distributed in major cities worldwide (200+ locations). They cache content close to end users for faster delivery. Unlike Regions and AZs where you deploy your application infrastructure, Edge Locations are managed by AWS and used automatically when you enable CloudFront.

The key architectural principle shown here is defense in depth: Regions protect against geographic disasters, Availability Zones protect against localized failures within a Region, and multiple data centers within each AZ protect against individual data center failures. This multi-layered approach enables AWS to achieve extremely high availability (99.99% or higher for many services).

Detailed Example 1: Multi-AZ Web Application

Imagine you're deploying a three-tier web application (web servers, application servers, database) that needs to be highly available.

Scenario: Your e-commerce application must remain available even if an entire data center fails. Downtime costs $10,000 per minute in lost sales.

Solution using Multiple AZs:

  1. Web Tier (in 3 AZs):

    • Deploy 2 EC2 instances in us-east-1a running your web application
    • Deploy 2 EC2 instances in us-east-1b running your web application
    • Deploy 2 EC2 instances in us-east-1c running your web application
    • Total: 6 web servers distributed across 3 AZs
  2. Load Balancer (automatically multi-AZ):

    • Create an Application Load Balancer (ALB) and enable all 3 AZs
    • The ALB automatically distributes traffic across all 6 web servers
    • The ALB performs health checks every 30 seconds
    • If servers in one AZ become unhealthy, the ALB automatically stops sending traffic to that AZ
  3. Application Tier (in 3 AZs):

    • Deploy 2 EC2 instances in each AZ running your application logic
    • Total: 6 application servers distributed across 3 AZs
  4. Database Tier (Multi-AZ RDS):

    • Create an RDS database with Multi-AZ enabled
    • Primary database runs in us-east-1a
    • Standby database automatically created in us-east-1b
    • AWS synchronously replicates all data from primary to standby
    • If primary fails, AWS automatically promotes standby to primary (1-2 minute failover)

What happens when AZ-1a fails:

  1. The power goes out in the entire us-east-1a Availability Zone
  2. All EC2 instances in us-east-1a become unreachable (2 web servers, 2 app servers)
  3. The ALB detects failed health checks for servers in us-east-1a within 30 seconds
  4. The ALB stops sending new traffic to us-east-1a, routing all traffic to us-east-1b and us-east-1c
  5. RDS detects the primary database is unreachable and automatically fails over to the standby in us-east-1b (takes 1-2 minutes)
  6. Your application continues serving customers with 4 web servers and 4 app servers (instead of 6 each)
  7. Performance may be slightly degraded due to reduced capacity, but the application remains available
  8. When us-east-1a recovers, the ALB automatically starts sending traffic to those servers again

Result: Total downtime is approximately 1-2 minutes (during database failover), compared to potentially hours if you had deployed everything in a single AZ. The cost of running resources in 3 AZs instead of 1 is minimal (no extra charge for using multiple AZs, just the cost of the additional EC2 instances), but the benefit is massive (avoiding $10,000/minute in lost sales).

Detailed Example 2: Multi-AZ Database for Data Durability

Scenario: A financial services company stores transaction records in a database. Losing this data would be catastrophic (regulatory violations, customer lawsuits, loss of trust).

Solution using RDS Multi-AZ:

  1. Enable RDS Multi-AZ: When creating the RDS database, enable the Multi-AZ option

  2. Primary database in AZ-1a: Handles all read and write operations

  3. Standby database in AZ-1b: Receives synchronous replication of every transaction

  4. Synchronous replication: When your application writes data to the primary database:

    • The write is sent to the primary database in AZ-1a
    • The primary database immediately replicates the write to the standby in AZ-1b
    • Only after the standby confirms it has received the data does the primary acknowledge the write to your application
    • This ensures zero data loss - if the primary fails immediately after acknowledging a write, the standby already has that data
  5. Automatic failover: If the primary database fails:

    • RDS detects the failure within 60 seconds
    • RDS automatically promotes the standby to primary
    • RDS updates the DNS record to point to the new primary
    • Your application reconnects and continues operating
    • Total failover time: 1-2 minutes

Result: Even if the entire us-east-1a Availability Zone is destroyed (extremely unlikely but theoretically possible), you lose zero data because every transaction was synchronously replicated to us-east-1b before being acknowledged. The cost is approximately 2x the single-AZ database cost (you're running two database instances), but the benefit is guaranteed data durability and high availability.

Detailed Example 3: Auto Scaling Across AZs

Scenario: A news website experiences unpredictable traffic spikes when breaking news occurs. Traffic can increase from 1,000 requests/second to 50,000 requests/second within minutes.

Solution using Auto Scaling across AZs:

  1. Create an Auto Scaling Group: Configure it to maintain a minimum of 6 EC2 instances (2 per AZ) and scale up to 60 instances (20 per AZ)
  2. Distribute across 3 AZs: Configure the Auto Scaling Group to balance instances evenly across us-east-1a, us-east-1b, and us-east-1c
  3. Set scaling policies: When CPU utilization exceeds 70%, add 3 instances (1 per AZ). When CPU drops below 30%, remove 3 instances (1 per AZ)
  4. Use an ALB: The Application Load Balancer distributes traffic across all instances in all AZs

What happens during a traffic spike:

  1. Breaking news causes traffic to spike from 1,000 to 50,000 requests/second
  2. CPU utilization on existing instances quickly rises above 70%
  3. Auto Scaling detects high CPU and launches 3 new instances (1 in each AZ)
  4. The new instances register with the ALB and start receiving traffic within 2-3 minutes
  5. If CPU remains high, Auto Scaling continues adding instances (3 at a time, distributed across AZs) until traffic is handled or the maximum of 60 instances is reached
  6. When the traffic spike ends and CPU drops below 30%, Auto Scaling gradually terminates instances (3 at a time, maintaining balance across AZs)

Result: The application automatically scales to handle traffic spikes without manual intervention, and the multi-AZ distribution ensures that if one AZ fails during a traffic spike, the other two AZs continue serving traffic. The even distribution across AZs also ensures balanced load and prevents any single AZ from becoming a bottleneck.

āš ļø Common Mistakes & Misconceptions:

  • Mistake 1: Deploying all resources in a single AZ to save costs

    • Why it's wrong: There's no cost savings - AWS doesn't charge extra for using multiple AZs. You pay for the resources (EC2 instances, storage, etc.), not for the number of AZs you use.
    • Correct understanding: Always deploy across at least 2 AZs (preferably 3) for production workloads. The only "cost" is the additional resources you run for redundancy (e.g., running 6 servers instead of 3), but this is necessary for high availability.
  • Mistake 2: Assuming AZ names are consistent across AWS accounts

    • Why it's wrong: AWS randomizes AZ names across accounts. Your us-east-1a might be a different physical data center than someone else's us-east-1a. This prevents all customers from concentrating resources in the same physical AZ.
    • Correct understanding: Use AZ IDs (like use1-az1) when coordinating across accounts, not AZ names (like us-east-1a).
  • Mistake 3: Thinking data automatically replicates across AZs

    • Why it's wrong: Only certain services automatically replicate across AZs (S3, DynamoDB, EFS). For EC2 instances and EBS volumes, you must explicitly configure replication or deploy resources in multiple AZs.
    • Correct understanding: Check each service's documentation to understand its AZ behavior. For EC2, you must manually launch instances in multiple AZs. For RDS, you must enable Multi-AZ. For S3, replication across AZs is automatic.

šŸ”— Connections to Other Topics:

  • Relates to High Availability (Domain 2) because: Multi-AZ deployments are the foundation of highly available architectures
  • Builds on Load Balancing (Domain 2) by: Using load balancers to distribute traffic across AZs
  • Often used with Auto Scaling (Domain 3) to: Automatically maintain balanced capacity across AZs

šŸ’” Tips for Understanding:

  • Think of AZs as "failure domains" - design your architecture so that the failure of any single AZ doesn't bring down your application
  • The rule of thumb: Always use at least 2 AZs for production workloads, preferably 3
  • Remember: Low latency between AZs (<2ms) means you can treat them almost like a single data center for performance purposes, but they're isolated for fault tolerance

Edge Locations and CloudFront

What it is: Edge Locations are AWS data centers specifically designed to deliver content to end users with the lowest possible latency. They are part of Amazon CloudFront, AWS's Content Delivery Network (CDN). AWS has 400+ Edge Locations in 90+ cities across 48 countries, far more than the 33 Regions.

Why it exists: Even if you deploy your application in multiple Regions, users far from those Regions will still experience high latency. For example, if your application is in us-east-1 and eu-west-1, users in Australia will have high latency to both Regions (200-300ms). Edge Locations solve this by caching content close to users worldwide, reducing latency to 10-50ms.

Real-world analogy: Think of Edge Locations like local convenience stores. The main warehouse (Region) is far away, but the convenience store (Edge Location) in your neighborhood stocks popular items. You can get those items quickly from the local store without traveling to the warehouse. If the store doesn't have what you need, it orders from the warehouse, but most requests are served locally.

How it works (Detailed step-by-step):

  1. You enable CloudFront: You create a CloudFront distribution and point it to your origin (the source of your content, like an S3 bucket or an EC2 web server in a Region).

  2. User requests content: A user in Tokyo requests an image from your website (www.example.com/logo.png).

  3. DNS routes to nearest Edge Location: CloudFront's DNS automatically routes the user to the nearest Edge Location (in this case, Tokyo).

  4. Edge Location checks cache: The Tokyo Edge Location checks if it has logo.png cached locally.

  5. Cache hit (content is cached): If the Edge Location has the content cached and it hasn't expired:

    • The Edge Location immediately returns the content to the user
    • Latency: 10-20ms (very fast)
    • The origin server (in us-east-1) is never contacted
    • This is the most common scenario for popular content
  6. Cache miss (content not cached): If the Edge Location doesn't have the content cached:

    • The Edge Location requests the content from the origin server (in us-east-1)
    • The origin server sends the content to the Edge Location
    • The Edge Location caches the content locally and returns it to the user
    • Latency: 150-200ms for this first request (slower)
    • Subsequent requests from users in Tokyo will be cache hits (fast)
  7. Content expires and refreshes: You configure a Time-To-Live (TTL) for cached content (e.g., 24 hours). After 24 hours, the Edge Location requests fresh content from the origin to ensure users get updated content.

⭐ Must Know:

  • Edge Locations are separate from Regions and AZs - they're specifically for content delivery
  • There are 400+ Edge Locations worldwide, far more than the 33 Regions
  • Edge Locations cache content from your origin (S3, EC2, ALB, etc.)
  • CloudFront is the service that uses Edge Locations
  • Edge Locations can also be used for uploading content (S3 Transfer Acceleration)

Detailed Example 1: Global Website Performance

Scenario: A media company hosts video content in S3 buckets in us-east-1. They have users worldwide, but users in Asia and Australia complain about slow video loading times.

Problem without CloudFront:

  • User in Sydney requests a video from S3 in us-east-1
  • Request travels from Sydney to Virginia (approximately 15,000 km)
  • Latency: 200-250ms per request
  • Video takes 30-60 seconds to start playing
  • Buffering occurs frequently during playback

Solution with CloudFront:

  1. Create a CloudFront distribution with the S3 bucket as the origin
  2. Enable CloudFront in all Edge Locations worldwide
  3. Update the website to use the CloudFront URL instead of the direct S3 URL

What happens:

  1. User in Sydney requests a video
  2. DNS routes the request to the Sydney Edge Location (closest to the user)
  3. First request (cache miss):
    • Sydney Edge Location requests the video from S3 in us-east-1
    • S3 sends the video to Sydney Edge Location
    • Sydney Edge Location caches the video and streams it to the user
    • Latency: 200ms for the initial request, but subsequent chunks stream quickly
  4. Second user in Sydney requests the same video (cache hit):
    • Sydney Edge Location already has the video cached
    • Video streams immediately from Sydney Edge Location
    • Latency: 10-20ms
    • Video starts playing in 2-3 seconds
    • No buffering during playback

Result: Video loading time reduced from 30-60 seconds to 2-3 seconds for users in Sydney. The first user experiences slightly slower loading (cache miss), but all subsequent users in the region benefit from the cached content. The media company's bandwidth costs also decrease because most requests are served from Edge Locations instead of the origin S3 bucket.

Detailed Example 2: Dynamic Content Acceleration

Scenario: An e-commerce application serves dynamic content (personalized product recommendations, shopping cart, user profiles) that can't be cached. Users in Europe experience slow page loads because the application servers are in us-east-1.

Solution with CloudFront (even for dynamic content):

CloudFront can accelerate dynamic content through network optimizations, even though the content isn't cached:

  1. Create a CloudFront distribution with the ALB (Application Load Balancer) in us-east-1 as the origin
  2. Enable CloudFront for dynamic content (set TTL to 0 for non-cacheable content)
  3. CloudFront uses AWS's private backbone network to route requests

What happens:

  1. User in London requests their shopping cart (dynamic, personalized content)
  2. Request goes to London Edge Location
  3. Edge Location forwards the request to us-east-1 using AWS's private backbone network (not the public internet)
  4. AWS's backbone network is optimized for low latency and high reliability
  5. Application server in us-east-1 generates the personalized shopping cart
  6. Response travels back through AWS's backbone network to London Edge Location
  7. Edge Location forwards the response to the user

Result: Even though the content isn't cached, latency is reduced by 20-40% because AWS's private network is faster and more reliable than the public internet. Additionally, CloudFront maintains persistent connections to the origin, reducing the overhead of establishing new connections for each request.

Detailed Example 3: S3 Transfer Acceleration

Scenario: A video production company in Australia needs to upload large video files (5-50 GB each) to S3 in us-east-1. Direct uploads to S3 are slow (taking hours) and frequently fail due to network issues.

Solution with S3 Transfer Acceleration:

S3 Transfer Acceleration uses CloudFront Edge Locations to accelerate uploads:

  1. Enable S3 Transfer Acceleration on the S3 bucket
  2. Use the Transfer Acceleration endpoint instead of the standard S3 endpoint
  3. Upload files using the Transfer Acceleration endpoint

What happens:

  1. Video file upload starts from Sydney
  2. File is uploaded to the Sydney Edge Location (close to the user, low latency)
  3. Sydney Edge Location uses AWS's private backbone network to transfer the file to S3 in us-east-1
  4. AWS's backbone network is optimized for high throughput and reliability
  5. File arrives at S3 in us-east-1

Result: Upload speed increases by 50-500% (depending on distance and network conditions). A 10 GB file that previously took 3 hours to upload now takes 30-45 minutes. Upload reliability also improves because the long-distance transfer happens over AWS's reliable backbone network instead of the public internet.

āš ļø Common Mistakes & Misconceptions:

  • Mistake 1: Thinking Edge Locations are the same as Regions

    • Why it's wrong: Edge Locations are much smaller and only cache content - you can't deploy EC2 instances or databases in Edge Locations.
    • Correct understanding: Regions are where you deploy your application infrastructure. Edge Locations are where CloudFront caches content to serve users quickly.
  • Mistake 2: Assuming all content should be cached at Edge Locations

    • Why it's wrong: Some content shouldn't be cached (personalized data, real-time data, sensitive data). Caching this content could show users stale or incorrect information.
    • Correct understanding: Use CloudFront for static content (images, videos, CSS, JavaScript) and public content. For dynamic or personalized content, either don't cache it or use very short TTLs.
  • Mistake 3: Forgetting to invalidate cached content after updates

    • Why it's wrong: If you update content at the origin but don't invalidate the CloudFront cache, users will continue seeing old content until the TTL expires.
    • Correct understanding: When you update content, create a CloudFront invalidation to immediately clear the cached content, or use versioned file names (logo-v2.png instead of logo.png) to force cache misses.

šŸ”— Connections to Other Topics:

  • Relates to Performance Optimization (Domain 3) because: CloudFront reduces latency and improves user experience
  • Builds on S3 (Domain 3) by: Caching S3 content at Edge Locations for faster delivery
  • Often used with Route 53 (Domain 3) to: Provide DNS routing to the nearest Edge Location

šŸ’” Tips for Understanding:

  • Think of CloudFront as a global caching layer that sits in front of your application
  • Use CloudFront for any content that's accessed by users in multiple geographic locations
  • Remember: Edge Locations are read-only for most use cases (except S3 Transfer Acceleration, which allows writes)

šŸŽÆ Exam Focus: Questions often test whether you understand when to use CloudFront (global content delivery, reducing latency) versus when to use multi-Region deployments (compliance, disaster recovery). CloudFront is for performance; multi-Region is for availability and compliance.


Section 3: AWS Shared Responsibility Model

Introduction

The problem: When you move to the cloud, security responsibilities are split between you (the customer) and AWS (the cloud provider). If you don't understand who is responsible for what, you might assume AWS is protecting something that you're actually responsible for, leading to security vulnerabilities. Conversely, you might waste time and money protecting things that AWS already handles.

The solution: The AWS Shared Responsibility Model clearly defines which security responsibilities belong to AWS ("Security OF the Cloud") and which belong to you ("Security IN the Cloud"). This model varies depending on the type of service you use (IaaS, PaaS, SaaS).

Why it's tested: The SAA-C03 exam frequently tests your understanding of the Shared Responsibility Model. Questions ask you to identify who is responsible for specific security tasks, or to design solutions that properly address customer responsibilities while leveraging AWS's responsibilities.

Core Concepts

Understanding "Security OF the Cloud" vs "Security IN the Cloud"

What it is: The Shared Responsibility Model divides security and compliance responsibilities between AWS and the customer:

  • AWS Responsibility: "Security OF the Cloud": AWS is responsible for protecting the infrastructure that runs all AWS services. This includes the physical data centers, hardware, software, networking, and facilities.

  • Customer Responsibility: "Security IN the Cloud": Customers are responsible for securing their data, applications, operating systems, and configurations within AWS. The extent of customer responsibility varies based on the service used.

Why it exists: In traditional on-premises IT, you're responsible for everything - from physical security of the building to application security. In the cloud, AWS takes over the lower layers (physical security, hardware, infrastructure), allowing you to focus on your applications and data. However, you still need to secure what you put in the cloud. The Shared Responsibility Model clarifies this division to prevent security gaps.

Real-world analogy: Think of AWS like a secure apartment building. The building owner (AWS) is responsible for:

  • Physical security (locks on the building, security cameras, guards)
  • Building infrastructure (electricity, plumbing, HVAC)
  • Structural integrity (foundation, walls, roof)

You (the tenant) are responsible for:

  • Locking your apartment door
  • Securing your belongings inside the apartment
  • Who you give keys to
  • What you do inside your apartment

The building owner can't enter your apartment to secure your belongings, and you can't modify the building's foundation. Each party has clear responsibilities.

How it works (Detailed step-by-step):

  1. AWS secures the infrastructure: AWS is responsible for:

    • Physical security: Data centers with 24/7 security guards, biometric access controls, video surveillance, and intrusion detection systems
    • Hardware: Servers, storage devices, networking equipment - AWS maintains, patches, and replaces hardware
    • Network infrastructure: AWS manages the network that connects data centers, including DDoS protection at the infrastructure level
    • Virtualization layer: The hypervisor that creates virtual machines is managed and secured by AWS
    • Facilities: Power, cooling, fire suppression, and environmental controls in data centers
  2. You secure your resources: As a customer, you're responsible for:

    • Data: Encrypting sensitive data, classifying data, implementing data retention policies
    • Applications: Securing your application code, patching application vulnerabilities
    • Operating systems: Patching OS vulnerabilities, configuring OS security settings (for IaaS services like EC2)
    • Network configuration: Configuring security groups, network ACLs, VPC settings
    • Access management: Creating IAM users, assigning permissions, implementing MFA
    • Client-side encryption: Encrypting data before sending it to AWS
    • Server-side encryption: Configuring encryption for data at rest in AWS services
  3. Shared controls: Some responsibilities are shared:

    • Patch management: AWS patches the infrastructure and managed services; you patch your OS and applications
    • Configuration management: AWS configures infrastructure; you configure your resources
    • Awareness and training: AWS trains its employees; you train your employees
  4. Responsibility varies by service type:

    • IaaS (Infrastructure as a Service): You have more responsibility (e.g., EC2 - you manage the OS)
    • PaaS (Platform as a Service): AWS handles more (e.g., RDS - AWS manages the OS and database software)
    • SaaS (Software as a Service): AWS handles almost everything (e.g., AWS Managed Services)

⭐ Must Know:

  • AWS is ALWAYS responsible for physical security, hardware, and the global infrastructure
  • Customers are ALWAYS responsible for their data, IAM, and access management
  • For EC2 (IaaS), customers are responsible for the guest OS, applications, and security groups
  • For managed services like RDS (PaaS), AWS handles the OS and database software; customers handle data and access control
  • For S3, customers are responsible for bucket policies, encryption settings, and data classification

šŸ“Š Shared Responsibility Model Diagram:

graph TB
    subgraph "Customer Responsibility: Security IN the Cloud"
        CUST1[Customer Data]
        CUST2[Platform & Application Management]
        CUST3[Operating System, Network & Firewall Config]
        CUST4[Client-Side Data Encryption]
        CUST5[Server-Side Encryption]
        CUST6[Network Traffic Protection]
        CUST7[IAM & Access Management]
    end
    
    subgraph "Shared Controls"
        SHARED1[Patch Management]
        SHARED2[Configuration Management]
        SHARED3[Awareness & Training]
    end
    
    subgraph "AWS Responsibility: Security OF the Cloud"
        AWS1[Software: Compute, Storage, Database, Networking]
        AWS2[Hardware/AWS Global Infrastructure]
        AWS3[Regions]
        AWS4[Availability Zones]
        AWS5[Edge Locations]
        AWS6[Physical Security of Data Centers]
    end
    
    CUST1 --> CUST2
    CUST2 --> CUST3
    CUST3 --> SHARED1
    SHARED1 --> AWS1
    AWS1 --> AWS2
    AWS2 --> AWS3
    AWS3 --> AWS4
    AWS4 --> AWS5
    AWS5 --> AWS6
    
    style CUST1 fill:#ffebee
    style CUST2 fill:#ffebee
    style CUST3 fill:#ffebee
    style CUST4 fill:#ffebee
    style CUST5 fill:#ffebee
    style CUST6 fill:#ffebee
    style CUST7 fill:#ffebee
    style SHARED1 fill:#fff3e0
    style SHARED2 fill:#fff3e0
    style SHARED3 fill:#fff3e0
    style AWS1 fill:#e1f5fe
    style AWS2 fill:#e1f5fe
    style AWS3 fill:#e1f5fe
    style AWS4 fill:#e1f5fe
    style AWS5 fill:#e1f5fe
    style AWS6 fill:#e1f5fe

See: diagrams/01_fundamentals_shared_responsibility.mmd

Diagram Explanation (detailed):

This diagram illustrates the division of security responsibilities between customers and AWS, organized in three layers: Customer Responsibility (red), Shared Controls (orange), and AWS Responsibility (blue).

Customer Responsibility (Top Layer - Red):
At the top, we see customer responsibilities, which represent "Security IN the Cloud." The customer is responsible for everything they put into AWS:

  • Customer Data: This is the most critical customer responsibility. You must classify your data (public, confidential, restricted), implement appropriate encryption, and control who can access it. AWS provides the tools (KMS, encryption options), but you must use them correctly.

  • Platform & Application Management: You're responsible for securing your applications, including patching application vulnerabilities, implementing secure coding practices, and managing application configurations.

  • Operating System, Network & Firewall Configuration: For IaaS services like EC2, you must patch the OS, configure firewalls (security groups), and harden the OS according to security best practices. For managed services like RDS, AWS handles this.

  • Client-Side Data Encryption & Server-Side Encryption: You decide whether to encrypt data and manage encryption keys. AWS provides encryption services (KMS), but you must enable and configure them.

  • Network Traffic Protection: You must configure VPCs, subnets, security groups, and NACLs to control network traffic. You also decide whether to use VPNs or Direct Connect for encrypted connections.

  • IAM & Access Management: You create IAM users, groups, roles, and policies. You implement MFA, rotate credentials, and follow the principle of least privilege. This is entirely your responsibility.

Shared Controls (Middle Layer - Orange):
These responsibilities are shared between AWS and customers, but each party handles different aspects:

  • Patch Management: AWS patches the underlying infrastructure, hypervisor, and managed service software (like RDS database engine). You patch your guest operating systems (EC2) and applications.

  • Configuration Management: AWS configures the infrastructure and provides secure defaults. You configure your resources (security groups, bucket policies, etc.) according to your security requirements.

  • Awareness & Training: AWS trains its employees on security best practices and compliance. You must train your employees on how to use AWS securely and follow your organization's security policies.

AWS Responsibility (Bottom Layer - Blue):
At the bottom, we see AWS responsibilities, which represent "Security OF the Cloud." AWS is responsible for the entire infrastructure:

  • Software Layer: AWS manages and secures the software that provides compute (EC2 hypervisor), storage (S3 software), database (RDS engine), and networking services. AWS patches vulnerabilities, monitors for threats, and ensures service availability.

  • Hardware/AWS Global Infrastructure: AWS maintains all physical hardware - servers, storage devices, networking equipment. AWS replaces failed hardware, upgrades capacity, and ensures hardware security.

  • Regions, Availability Zones, Edge Locations: AWS designs, builds, and operates the global infrastructure. AWS ensures Regions are isolated, AZs are connected with low-latency networking, and Edge Locations are strategically placed.

  • Physical Security of Data Centers: AWS implements multiple layers of physical security - perimeter fencing, security guards, biometric access controls, video surveillance, intrusion detection, and environmental controls. Customers never have physical access to AWS data centers.

The key insight from this diagram is that security is a partnership. AWS provides a secure infrastructure, but you must use it securely. AWS can't access your data to encrypt it for you, and you can't access AWS data centers to verify physical security. Each party must fulfill their responsibilities for the overall system to be secure.

Detailed Example 1: EC2 Instance Security (IaaS)

Scenario: You're deploying a web application on EC2 instances. Who is responsible for what?

AWS Responsibilities:

  • Physical security of the data center where the EC2 instance runs
  • Security of the hypervisor that creates the virtual machine
  • Network infrastructure connecting the data center
  • Hardware maintenance and replacement
  • Patching the hypervisor and underlying infrastructure

Your Responsibilities:

  • Choosing a secure AMI (Amazon Machine Image) to launch the instance
  • Patching the guest operating system (e.g., applying Ubuntu security updates)
  • Configuring the OS securely (disabling unnecessary services, hardening SSH)
  • Installing and patching application software (e.g., Apache, Nginx)
  • Configuring security groups to control inbound/outbound traffic
  • Managing SSH keys and ensuring they're not compromised
  • Implementing application-level security (input validation, authentication)
  • Encrypting sensitive data stored on EBS volumes
  • Configuring IAM roles for the EC2 instance to access other AWS services
  • Monitoring logs and responding to security incidents

What happens if there's a security breach:

  • If the hypervisor is compromised: AWS is responsible and will fix it
  • If your OS is compromised due to unpatched vulnerabilities: You are responsible
  • If your application has a SQL injection vulnerability: You are responsible
  • If someone gains physical access to the data center: AWS is responsible

Result: For EC2 (IaaS), you have significant security responsibilities because you control the operating system and everything above it. This gives you flexibility but requires security expertise.

Detailed Example 2: RDS Database Security (PaaS)

Scenario: You're using Amazon RDS for your database. Who is responsible for what?

AWS Responsibilities:

  • Physical security of the data center
  • Security of the hypervisor and underlying infrastructure
  • Patching the database operating system
  • Patching the database engine (MySQL, PostgreSQL, etc.)
  • Performing automated backups
  • Implementing Multi-AZ replication for high availability
  • Monitoring database health and performance

Your Responsibilities:

  • Configuring database security groups to control network access
  • Creating database users and managing their permissions
  • Encrypting data at rest (enabling RDS encryption)
  • Encrypting data in transit (enforcing SSL/TLS connections)
  • Managing database credentials securely (using Secrets Manager)
  • Configuring automated backups and retention periods
  • Implementing application-level access controls
  • Classifying and protecting sensitive data in the database
  • Monitoring database access logs and responding to suspicious activity

What happens if there's a security breach:

  • If the database engine has a vulnerability: AWS patches it automatically
  • If the database OS has a vulnerability: AWS patches it automatically
  • If database credentials are leaked: You are responsible for rotating them
  • If unauthorized users access the database: You are responsible (check your security groups and IAM policies)

Result: For RDS (PaaS), AWS handles more security responsibilities than EC2. You don't need to patch the OS or database engine, but you're still responsible for access control, encryption, and data protection.

Detailed Example 3: S3 Bucket Security (SaaS-like)

Scenario: You're storing files in Amazon S3. Who is responsible for what?

AWS Responsibilities:

  • Physical security of the data centers storing S3 data
  • Durability of data (S3 automatically replicates data across multiple AZs)
  • Availability of the S3 service
  • Patching and maintaining S3 infrastructure
  • Protecting against infrastructure-level DDoS attacks

Your Responsibilities:

  • Configuring S3 bucket policies to control access
  • Enabling S3 bucket versioning to protect against accidental deletion
  • Enabling S3 encryption (SSE-S3, SSE-KMS, or SSE-C)
  • Configuring S3 Block Public Access to prevent accidental public exposure
  • Implementing S3 Object Lock for compliance requirements
  • Managing IAM policies for users accessing S3
  • Classifying data and applying appropriate security controls
  • Monitoring S3 access logs and responding to suspicious activity
  • Configuring S3 lifecycle policies for data retention
  • Enabling MFA Delete for critical buckets

What happens if there's a security breach:

  • If S3 infrastructure is compromised: AWS is responsible
  • If your bucket is publicly accessible due to misconfigured policies: You are responsible
  • If someone gains access using stolen IAM credentials: You are responsible for rotating credentials
  • If data is lost due to S3 infrastructure failure: AWS is responsible (and will restore from replicas)
  • If data is deleted by an authorized user: You are responsible (use versioning and MFA Delete to prevent this)

Result: For S3, AWS handles almost all infrastructure security, but you're responsible for access control and data protection. Most S3 security breaches are due to misconfigured bucket policies, not AWS infrastructure failures.

āš ļø Common Mistakes & Misconceptions:

  • Mistake 1: Assuming AWS is responsible for patching your EC2 instances

    • Why it's wrong: EC2 is IaaS - you have full control over the guest OS, which means you're responsible for patching it.
    • Correct understanding: AWS patches the hypervisor and infrastructure, but you must patch the OS and applications on your EC2 instances. Use AWS Systems Manager Patch Manager to automate this.
  • Mistake 2: Thinking AWS can access your data to help with security

    • Why it's wrong: AWS has a strict policy of not accessing customer data without explicit permission. AWS can't encrypt your data, configure your security groups, or fix your application vulnerabilities.
    • Correct understanding: You are solely responsible for your data and configurations. AWS provides tools and services, but you must use them correctly.
  • Mistake 3: Believing that using AWS automatically makes you compliant with regulations

    • Why it's wrong: AWS provides a compliant infrastructure (AWS is responsible for infrastructure compliance), but you're responsible for how you use that infrastructure. You must configure services correctly to meet your compliance requirements.
    • Correct understanding: AWS provides compliance certifications for the infrastructure (SOC 2, ISO 27001, PCI DSS, etc.), but you must implement appropriate controls in your applications and configurations to achieve compliance.
  • Mistake 4: Assuming managed services mean AWS handles all security

    • Why it's wrong: Even with managed services like RDS, you're still responsible for access control, encryption, and data protection.
    • Correct understanding: Managed services reduce your operational burden (AWS handles patching, backups, etc.), but you're always responsible for IAM, encryption, and data security.

šŸ”— Connections to Other Topics:

  • Relates to IAM (Domain 1) because: You're responsible for all access management
  • Builds on Encryption (Domain 1) by: Clarifying that you must enable and configure encryption
  • Often tested with Compliance (Domain 1) to: Verify you understand customer vs. AWS responsibilities for compliance

šŸ’” Tips for Understanding:

  • Remember the simple rule: AWS secures the infrastructure; you secure what you put on the infrastructure
  • For IaaS (EC2), you have more responsibility; for PaaS (RDS), AWS handles more; for SaaS, AWS handles almost everything
  • When in doubt, ask: "Can I configure this?" If yes, you're responsible for configuring it securely

šŸŽÆ Exam Focus: Exam questions often present a security scenario and ask "Who is responsible for fixing this?" or "What should the customer do to secure this?" Always think about whether the issue is in the infrastructure (AWS) or in the customer's configuration/data (customer).


Section 4: AWS Well-Architected Framework

Introduction

The problem: When designing cloud architectures, there are countless decisions to make: which services to use, how to configure them, how to ensure security, how to optimize costs, and how to maintain reliability. Without a structured framework, architects might make suboptimal decisions, leading to systems that are insecure, unreliable, expensive, or difficult to operate.

The solution: The AWS Well-Architected Framework provides a consistent approach for evaluating architectures and implementing designs that scale over time. It consists of six pillars - Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability - each with design principles and best practices.

Why it's tested: The SAA-C03 exam is fundamentally about designing well-architected solutions. Every question tests your ability to apply Well-Architected principles to real-world scenarios. Understanding this framework is essential for passing the exam and for your career as a solutions architect.

Core Concepts

What is the AWS Well-Architected Framework?

What it is: The AWS Well-Architected Framework is a set of best practices, design principles, and questions that help you evaluate and improve your cloud architectures. It was developed by AWS solutions architects based on years of experience designing systems for thousands of customers. The framework is organized into six pillars, each focusing on a different aspect of architecture.

Why it exists: AWS recognized that customers were repeatedly making the same architectural mistakes and facing similar challenges. By codifying best practices into a framework, AWS helps customers avoid common pitfalls and build better systems from the start. The framework also provides a common language for discussing architecture, making it easier for teams to collaborate and for AWS to provide guidance.

Real-world analogy: Think of the Well-Architected Framework like building codes for construction. When building a house, you follow building codes that specify requirements for structural integrity, electrical safety, plumbing, fire safety, etc. These codes are based on decades of experience and prevent common problems. Similarly, the Well-Architected Framework provides "building codes" for cloud architectures, helping you avoid common problems and build robust systems.

How it works (Detailed step-by-step):

  1. You design an architecture: You're planning to build a new application on AWS or evaluating an existing application.

  2. You review against the six pillars: For each pillar, you ask yourself a series of questions:

    • Operational Excellence: How do you operate and monitor your system?
    • Security: How do you protect your data and systems?
    • Reliability: How do you ensure your system recovers from failures?
    • Performance Efficiency: How do you use resources efficiently?
    • Cost Optimization: How do you avoid unnecessary costs?
    • Sustainability: How do you minimize environmental impact?
  3. You identify gaps: As you answer the questions, you identify areas where your architecture doesn't follow best practices. For example, you might discover that you're not using Multi-AZ deployments (Reliability pillar) or that you're not encrypting data at rest (Security pillar).

  4. You implement improvements: You prioritize the gaps based on business impact and implement improvements. For example, you might enable RDS Multi-AZ for your database or enable S3 encryption for your data.

  5. You iterate continuously: Architecture is not a one-time activity. You regularly review your architecture against the framework as your application evolves, new AWS services become available, and best practices change.

  6. You use AWS tools: AWS provides tools to help you apply the framework:

    • AWS Well-Architected Tool: A free service that helps you review your workloads against the framework
    • AWS Trusted Advisor: Provides automated checks for some Well-Architected best practices
    • AWS Well-Architected Labs: Hands-on labs to learn and implement best practices

⭐ Must Know: The six pillars of the Well-Architected Framework:

  1. Operational Excellence: Run and monitor systems to deliver business value
  2. Security: Protect information, systems, and assets
  3. Reliability: Recover from failures and meet demand
  4. Performance Efficiency: Use resources efficiently
  5. Cost Optimization: Avoid unnecessary costs
  6. Sustainability: Minimize environmental impact

šŸ“Š Well-Architected Framework Diagram:

graph TB
    WAF[AWS Well-Architected Framework]
    
    WAF --> OP[Operational Excellence]
    WAF --> SEC[Security]
    WAF --> REL[Reliability]
    WAF --> PERF[Performance Efficiency]
    WAF --> COST[Cost Optimization]
    WAF --> SUS[Sustainability]
    
    OP --> OP1[Perform operations as code]
    OP --> OP2[Make frequent, small, reversible changes]
    OP --> OP3[Refine operations procedures frequently]
    OP --> OP4[Anticipate failure]
    OP --> OP5[Learn from operational failures]
    
    SEC --> SEC1[Implement strong identity foundation]
    SEC --> SEC2[Enable traceability]
    SEC --> SEC3[Apply security at all layers]
    SEC --> SEC4[Automate security best practices]
    SEC --> SEC5[Protect data in transit and at rest]
    SEC --> SEC6[Keep people away from data]
    SEC --> SEC7[Prepare for security events]
    
    REL --> REL1[Automatically recover from failure]
    REL --> REL2[Test recovery procedures]
    REL --> REL3[Scale horizontally]
    REL --> REL4[Stop guessing capacity]
    REL --> REL5[Manage change through automation]
    
    PERF --> PERF1[Democratize advanced technologies]
    PERF --> PERF2[Go global in minutes]
    PERF --> PERF3[Use serverless architectures]
    PERF --> PERF4[Experiment more often]
    PERF --> PERF5[Consider mechanical sympathy]
    
    COST --> COST1[Implement cloud financial management]
    COST --> COST2[Adopt consumption model]
    COST --> COST3[Measure overall efficiency]
    COST --> COST4[Stop spending on undifferentiated heavy lifting]
    COST --> COST5[Analyze and attribute expenditure]
    
    SUS --> SUS1[Understand your impact]
    SUS --> SUS2[Establish sustainability goals]
    SUS --> SUS3[Maximize utilization]
    SUS --> SUS4[Anticipate and adopt new efficient offerings]
    SUS --> SUS5[Use managed services]
    SUS --> SUS6[Reduce downstream impact]
    
    style WAF fill:#e1f5fe
    style OP fill:#f3e5f5
    style SEC fill:#ffebee
    style REL fill:#c8e6c9
    style PERF fill:#fff3e0
    style COST fill:#e8f5e9
    style SUS fill:#e0f2f1

See: diagrams/01_fundamentals_well_architected.mmd

Diagram Explanation (detailed):

This diagram illustrates the AWS Well-Architected Framework's hierarchical structure, with the framework at the center branching into six pillars, each with its own design principles.

The Six Pillars (Color-Coded):

  1. Operational Excellence (Purple): Focuses on running and monitoring systems to deliver business value and continually improving processes. The design principles include:

    • Perform operations as code: Define your infrastructure and operations as code (Infrastructure as Code) so you can version, test, and automate them
    • Make frequent, small, reversible changes: Deploy changes incrementally so failures have minimal impact and can be easily rolled back
    • Refine operations procedures frequently: Continuously improve your operational procedures based on lessons learned
    • Anticipate failure: Perform "pre-mortem" exercises to identify potential failures before they occur
    • Learn from operational failures: Share lessons learned across teams and implement improvements
  2. Security (Red): Focuses on protecting information, systems, and assets while delivering business value. The design principles include:

    • Implement a strong identity foundation: Use IAM with least privilege, eliminate long-term credentials, implement MFA
    • Enable traceability: Monitor and log all actions and changes (CloudTrail, CloudWatch Logs)
    • Apply security at all layers: Defense in depth - secure network, compute, storage, data, and application layers
    • Automate security best practices: Use automation to enforce security controls consistently
    • Protect data in transit and at rest: Encrypt data using TLS for transit and KMS for data at rest
    • Keep people away from data: Reduce direct access to data to minimize risk of human error or malicious activity
    • Prepare for security events: Have incident response plans and practice them regularly
  3. Reliability (Green): Focuses on ensuring a workload performs its intended function correctly and consistently. The design principles include:

    • Automatically recover from failure: Monitor systems and trigger automated recovery when thresholds are breached
    • Test recovery procedures: Regularly test your disaster recovery and failover procedures
    • Scale horizontally: Distribute load across multiple smaller resources instead of one large resource
    • Stop guessing capacity: Use Auto Scaling to match capacity to demand automatically
    • Manage change through automation: Use Infrastructure as Code to make changes predictable and reversible
  4. Performance Efficiency (Orange): Focuses on using computing resources efficiently to meet requirements. The design principles include:

    • Democratize advanced technologies: Use managed services so your team can focus on applications instead of infrastructure
    • Go global in minutes: Deploy in multiple Regions to reduce latency for global users
    • Use serverless architectures: Eliminate operational burden of managing servers
    • Experiment more often: Easy to test different configurations and instance types
    • Consider mechanical sympathy: Understand how cloud services work and choose the right tool for the job
  5. Cost Optimization (Light Green): Focuses on avoiding unnecessary costs. The design principles include:

    • Implement cloud financial management: Establish cost awareness and accountability across the organization
    • Adopt a consumption model: Pay only for what you use; scale down when not needed
    • Measure overall efficiency: Monitor business metrics and costs to understand ROI
    • Stop spending money on undifferentiated heavy lifting: Use managed services instead of managing infrastructure
    • Analyze and attribute expenditure: Use cost allocation tags to understand where money is spent
  6. Sustainability (Teal): Focuses on minimizing environmental impact. The design principles include:

    • Understand your impact: Measure and monitor your carbon footprint
    • Establish sustainability goals: Set targets for reducing environmental impact
    • Maximize utilization: Right-size resources and use Auto Scaling to avoid idle capacity
    • Anticipate and adopt new, more efficient hardware and software offerings: Use latest instance types and services
    • Use managed services: Managed services are more efficient due to economies of scale
    • Reduce the downstream impact of your cloud workloads: Optimize data transfer and storage

The key insight from this diagram is that well-architected systems balance all six pillars. You can't focus only on cost optimization while ignoring security, or prioritize performance while neglecting reliability. The framework helps you make informed trade-offs and ensures you consider all aspects of architecture.

How the Pillars Relate to the SAA-C03 Exam Domains:

  • Security Pillar → Domain 1: Design Secure Architectures (30% of exam)
  • Reliability Pillar → Domain 2: Design Resilient Architectures (26% of exam)
  • Performance Efficiency Pillar → Domain 3: Design High-Performing Architectures (24% of exam)
  • Cost Optimization Pillar → Domain 4: Design Cost-Optimized Architectures (20% of exam)
  • Operational Excellence → Tested across all domains
  • Sustainability → Tested across all domains (newer addition to framework)

The exam is essentially testing your ability to apply Well-Architected principles to real-world scenarios. Every question can be mapped back to one or more pillars of the framework.

Pillar Trade-offs and Balancing

Understanding Trade-offs: In real-world architecture, you often need to make trade-offs between pillars. Understanding these trade-offs is crucial for the exam.

Common Trade-offs:

  1. Performance vs. Cost:

    • Scenario: You can use larger EC2 instances for better performance, but they cost more
    • Trade-off: Balance performance requirements with budget constraints
    • Example: Use c5.2xlarge instances (8 vCPUs, $0.34/hour) for compute-intensive workloads instead of c5.24xlarge (96 vCPUs, $4.08/hour) if 8 vCPUs meet your needs
    • Exam relevance: Questions test whether you can identify the most cost-effective solution that still meets performance requirements
  2. Security vs. Operational Complexity:

    • Scenario: Implementing strict security controls (encryption, MFA, network segmentation) increases operational complexity
    • Trade-off: Balance security requirements with operational overhead
    • Example: Requiring MFA for all users improves security but adds friction to the user experience
    • Exam relevance: Questions test whether you can implement appropriate security without over-engineering
  3. Reliability vs. Cost:

    • Scenario: Multi-AZ and multi-Region deployments improve reliability but increase costs
    • Trade-off: Balance availability requirements with budget
    • Example: Use Multi-AZ RDS for production databases (2x cost) but single-AZ for development databases
    • Exam relevance: Questions test whether you can design appropriately resilient architectures without over-provisioning
  4. Performance vs. Sustainability:

    • Scenario: Over-provisioning resources for peak performance wastes energy during low-utilization periods
    • Trade-off: Balance performance needs with environmental impact
    • Example: Use Auto Scaling to match capacity to demand instead of running maximum capacity 24/7
    • Exam relevance: Questions test whether you can design efficient architectures that scale with demand

šŸ’” Tip for the Exam: When questions present multiple valid solutions, the correct answer usually represents the best balance of the pillars. Look for solutions that meet requirements without over-engineering or under-engineering.


Section 5: Essential Networking Concepts

Introduction

The problem: Cloud architectures rely heavily on networking to connect components, control access, and deliver content to users. Without understanding basic networking concepts, you can't design secure, performant, or reliable architectures.

The solution: This section covers the essential networking concepts you need for the SAA-C03 exam: IP addressing, subnets, routing, DNS, and load balancing. These concepts form the foundation for understanding AWS networking services like VPC, Route 53, and Elastic Load Balancing.

Why it's tested: Networking questions appear throughout the exam, especially in Domain 1 (Security) and Domain 3 (Performance). You need to understand how to design VPCs, configure security groups, route traffic, and optimize network performance.

Core Concepts

IP Addresses and CIDR Notation

What it is: An IP address is a unique identifier for a device on a network. IPv4 addresses are 32-bit numbers typically written as four octets (e.g., 192.168.1.10). CIDR (Classless Inter-Domain Routing) notation specifies a range of IP addresses using a prefix (e.g., 10.0.0.0/16).

Why it exists: Networks need a way to identify and route traffic to specific devices. IP addresses provide this identification. CIDR notation allows efficient allocation of IP address ranges without wasting addresses.

Real-world analogy: Think of IP addresses like street addresses. Just as every house has a unique address (123 Main Street), every device on a network has a unique IP address. CIDR notation is like specifying a neighborhood - "all addresses on Main Street" instead of listing each house individually.

How it works:

  1. IPv4 Address Structure: An IPv4 address consists of 32 bits divided into 4 octets:

    • Example: 192.168.1.10
    • Binary: 11000000.10101000.00000001.00001010
    • Each octet ranges from 0 to 255
  2. CIDR Notation: Specifies a network and the number of bits used for the network portion:

    • Example: 10.0.0.0/16
    • /16 means the first 16 bits are the network portion
    • This leaves 32 - 16 = 16 bits for host addresses
    • Total addresses: 2^16 = 65,536 addresses
  3. Common CIDR Blocks:

    • /32: Single IP address (1 address)
    • /24: 256 addresses (common for small subnets)
    • /16: 65,536 addresses (common for VPCs)
    • /8: 16,777,216 addresses (very large networks)

⭐ Must Know for Exam:

  • /16 provides 65,536 IP addresses (recommended for VPCs)
  • /24 provides 256 IP addresses (common for subnets)
  • AWS reserves 5 IP addresses in each subnet (first 4 and last 1)
  • Private IP ranges: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16

Detailed Example: Planning VPC and Subnet IP Ranges

Scenario: You're designing a VPC for a three-tier application (web, app, database) that needs to run in 3 Availability Zones.

Solution:

  1. VPC CIDR: 10.0.0.0/16 (provides 65,536 addresses)
  2. Subnet allocation (9 subnets total):
    • Public subnets (for web tier):
      • us-east-1a: 10.0.1.0/24 (256 addresses)
      • us-east-1b: 10.0.2.0/24 (256 addresses)
      • us-east-1c: 10.0.3.0/24 (256 addresses)
    • Private subnets (for app tier):
      • us-east-1a: 10.0.11.0/24 (256 addresses)
      • us-east-1b: 10.0.12.0/24 (256 addresses)
      • us-east-1c: 10.0.13.0/24 (256 addresses)
    • Database subnets (for database tier):
      • us-east-1a: 10.0.21.0/24 (256 addresses)
      • us-east-1b: 10.0.22.0/24 (256 addresses)
      • us-east-1c: 10.0.23.0/24 (256 addresses)

Result: Each subnet has 256 addresses (minus 5 reserved by AWS = 251 usable), which is sufficient for most applications. The VPC has room for additional subnets if needed (you've used 9 /24 subnets out of 256 possible /24 subnets in a /16 VPC).

Public vs. Private IP Addresses

What it is: Public IP addresses are routable on the internet and can be accessed from anywhere. Private IP addresses are only routable within a private network (like a VPC) and cannot be accessed directly from the internet.

Why it exists: Not all resources should be accessible from the internet. Private IP addresses allow resources to communicate within a network while remaining isolated from the internet, improving security.

How it works:

  • Public IP: Assigned to resources that need internet access (web servers, NAT gateways)
  • Private IP: Assigned to all resources in a VPC; used for internal communication
  • Elastic IP: A static public IP address that you can associate with resources

⭐ Must Know:

  • All EC2 instances get a private IP address
  • Public IP addresses are optional and can be auto-assigned or manually attached (Elastic IP)
  • Resources in private subnets can access the internet through a NAT Gateway (which has a public IP)

DNS (Domain Name System)

What it is: DNS translates human-readable domain names (www.example.com) into IP addresses (192.0.2.1) that computers use to communicate.

Why it exists: Remembering IP addresses is difficult for humans. DNS allows us to use memorable names instead of numeric addresses.

How it works:

  1. User types www.example.com in browser
  2. Browser queries DNS resolver
  3. DNS resolver queries root DNS servers, then TLD servers (.com), then authoritative name servers
  4. Authoritative name server returns IP address (192.0.2.1)
  5. Browser connects to 192.0.2.1

⭐ Must Know for Exam:

  • Route 53 is AWS's DNS service
  • DNS records types: A (IPv4 address), AAAA (IPv6 address), CNAME (alias), MX (mail), TXT (text)
  • TTL (Time To Live) controls how long DNS records are cached

Chapter Summary

What We Covered

In this chapter, you learned the foundational concepts that underpin all AWS architectures:

āœ… Cloud Computing Fundamentals:

  • The six advantages of cloud computing
  • How cloud computing differs from traditional IT
  • The benefits of on-demand, pay-as-you-go infrastructure

āœ… AWS Global Infrastructure:

  • Regions: Geographic areas with multiple data centers
  • Availability Zones: Isolated data centers within a Region
  • Edge Locations: Content delivery network endpoints
  • How to use multi-AZ and multi-Region architectures for resilience and performance

āœ… Shared Responsibility Model:

  • AWS responsibilities: Security OF the cloud (infrastructure, hardware, facilities)
  • Customer responsibilities: Security IN the cloud (data, applications, access management)
  • How responsibilities vary by service type (IaaS, PaaS, SaaS)

āœ… AWS Well-Architected Framework:

  • Six pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability
  • Design principles for each pillar
  • How to balance trade-offs between pillars

āœ… Essential Networking Concepts:

  • IP addressing and CIDR notation
  • Public vs. private IP addresses
  • DNS and domain name resolution

Critical Takeaways

⭐ Must Remember:

  1. Regions are isolated: Resources don't automatically replicate across Regions. You must explicitly configure cross-region replication or deploy resources in multiple Regions.

  2. Availability Zones provide high availability: Always deploy production workloads across at least 2 AZs (preferably 3) to protect against data center failures.

  3. Shared Responsibility varies by service: For EC2 (IaaS), you manage the OS and applications. For RDS (PaaS), AWS manages the OS and database software. Always understand who is responsible for what.

  4. Well-Architected Framework guides all decisions: Every architecture decision should consider all six pillars. The exam tests your ability to apply these principles to real-world scenarios.

  5. Security is always a priority: When in doubt, choose the more secure option. The exam heavily emphasizes security best practices.

Self-Assessment Checklist

Test yourself before moving to the next chapter:

  • I can explain the six advantages of cloud computing and give examples of each
  • I understand the difference between Regions, Availability Zones, and Edge Locations
  • I can design a multi-AZ architecture for high availability
  • I know when to use multi-Region deployments (compliance, disaster recovery, global performance)
  • I understand the Shared Responsibility Model and can identify customer vs. AWS responsibilities
  • I can explain all six pillars of the Well-Architected Framework
  • I understand IP addressing and CIDR notation
  • I know the difference between public and private IP addresses
  • I can explain how DNS works and why it's important

Practice Questions

Try these from your practice test bundles:

  • Fundamentals questions in Domain 1 Bundle 1
  • Global Infrastructure questions in Domain 2 Bundle 1
  • Expected score: 80%+ to proceed

If you scored below 80%:

  • Review Section 2 (AWS Global Infrastructure) for Region/AZ concepts
  • Review Section 3 (Shared Responsibility Model) for security responsibilities
  • Review Section 4 (Well-Architected Framework) for design principles

Quick Reference Card

AWS Global Infrastructure:

  • Region: Geographic area with multiple AZs (e.g., us-east-1)
  • Availability Zone: One or more data centers within a Region (e.g., us-east-1a)
  • Edge Location: CDN endpoint for CloudFront (400+ worldwide)

Shared Responsibility:

  • AWS: Physical security, hardware, infrastructure, managed service software
  • Customer: Data, applications, OS (for EC2), access management, encryption

Well-Architected Pillars:

  1. Operational Excellence: Run and monitor systems
  2. Security: Protect data and systems
  3. Reliability: Recover from failures
  4. Performance Efficiency: Use resources efficiently
  5. Cost Optimization: Avoid unnecessary costs
  6. Sustainability: Minimize environmental impact

Networking Basics:

  • /16 CIDR: 65,536 addresses (VPC)
  • /24 CIDR: 256 addresses (subnet)
  • Private IP ranges: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
  • AWS reserves: 5 IP addresses per subnet

Next Steps

You're now ready to dive into the exam domains! The next chapter covers Domain 1: Design Secure Architectures, which accounts for 30% of the exam. You'll learn about:

  • IAM (users, groups, roles, policies)
  • VPC security (security groups, NACLs)
  • Data encryption (KMS, encryption at rest and in transit)
  • Security services (WAF, Shield, GuardDuty, Macie)

Proceed to: 02_domain1_secure_architectures


Chapter 0 Complete - Total Words: ~11,000
Diagrams Created: 3
Estimated Study Time: 8-10 hours


Chapter Summary

What We Covered

This foundational chapter established the essential knowledge needed for the AWS Certified Solutions Architect - Associate exam. We explored:

  • āœ… AWS Global Infrastructure: Regions, Availability Zones, Edge Locations, and how they enable high availability and low latency
  • āœ… Well-Architected Framework: The six pillars (Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability) that guide architectural decisions
  • āœ… Shared Responsibility Model: Understanding what AWS manages versus what customers manage across different service types
  • āœ… Core AWS Services: Introduction to compute (EC2, Lambda), storage (S3, EBS), networking (VPC), and database services
  • āœ… Key Terminology: Essential terms like elasticity, scalability, fault tolerance, high availability, and disaster recovery
  • āœ… Service Categories: How AWS services are organized and when to use each category

Critical Takeaways

  1. Global Infrastructure Design: AWS has 30+ Regions worldwide, each with multiple isolated Availability Zones. Design for multi-AZ deployments for high availability and multi-Region for disaster recovery.

  2. Well-Architected Framework is Your Guide: Every architectural decision should be evaluated against the six pillars. This framework appears throughout the exam in scenario-based questions.

  3. Shared Responsibility: AWS secures the infrastructure (hardware, facilities, network), while customers secure what they put in the cloud (data, applications, access management). Know the boundaries.

  4. Service Selection Matters: Choose the right service for the job - managed services reduce operational overhead, serverless eliminates infrastructure management, and purpose-built services optimize for specific workloads.

  5. Regions and AZs are Foundational: Understanding how to leverage multiple AZs for fault tolerance and multiple Regions for disaster recovery is critical for 26% of the exam (Domain 2).

Self-Assessment Checklist

Test yourself before moving to Domain 1. You should be able to:

  • Explain AWS Global Infrastructure: Describe the relationship between Regions, Availability Zones, and Edge Locations
  • List the Six Pillars: Name all six pillars of the Well-Architected Framework and give an example of each
  • Draw the Shared Responsibility Model: Sketch what AWS manages vs. what customers manage for IaaS, PaaS, and SaaS
  • Identify Service Categories: Given a requirement, identify which AWS service category to use (compute, storage, database, networking)
  • Define Key Terms: Explain the difference between:
    • High availability vs. fault tolerance
    • Scalability vs. elasticity
    • RPO vs. RTO
    • Vertical scaling vs. horizontal scaling
  • Choose Deployment Strategies: Explain when to use single-AZ, multi-AZ, and multi-Region deployments
  • Understand Service Models: Differentiate between IaaS (EC2), PaaS (Elastic Beanstalk), and SaaS (WorkMail)

Practice Questions

Try these from your practice test bundles:

  • Fundamentals Bundle: Questions 1-20
  • Domain 1 Bundle 1: Questions 1-5 (IAM basics build on fundamentals)

Expected Score: 80%+ to proceed confidently

If you scored below 80%:

  • Review sections on: AWS Global Infrastructure, Well-Architected Framework
  • Focus on: Understanding the shared responsibility model boundaries
  • Revisit diagrams: Global infrastructure diagram, Well-Architected pillars

Quick Reference Card

Copy this to your notes for quick review:

AWS Global Infrastructure:

  • Region: Geographic area with 2+ Availability Zones (e.g., us-east-1)
  • Availability Zone: One or more isolated data centers with redundant power, networking
  • Edge Location: CDN endpoint for CloudFront (200+ locations globally)

Well-Architected Pillars:

  1. Operational Excellence: Run and monitor systems, continually improve
  2. Security: Protect information, systems, and assets
  3. Reliability: Recover from failures, meet demand
  4. Performance Efficiency: Use resources efficiently
  5. Cost Optimization: Avoid unnecessary costs
  6. Sustainability: Minimize environmental impact

Shared Responsibility:

  • AWS: Hardware, facilities, network infrastructure, managed service operations
  • Customer: Data, applications, access management, OS patching (for EC2), encryption

Key Service Categories:

  • Compute: EC2, Lambda, ECS, EKS, Fargate
  • Storage: S3, EBS, EFS, FSx, Storage Gateway
  • Database: RDS, DynamoDB, Aurora, ElastiCache, Redshift
  • Networking: VPC, Route 53, CloudFront, Direct Connect, VPN

Design Principles:

  • Design for failure (assume everything fails)
  • Decouple components (loose coupling)
  • Implement elasticity (scale automatically)
  • Think parallel (horizontal scaling)
  • Use managed services (reduce operational burden)

Next Steps

You're now ready to dive into Domain 1: Design Secure Architectures (Chapter 2). This domain covers:

  • IAM and access management (30% of exam weight)
  • Network security (VPC, security groups, NACLs)
  • Data protection (encryption, key management)

The fundamentals you learned here will be applied throughout all four domains. Keep this chapter as a reference as you progress through the more advanced topics.


Chapter 0 Complete āœ… | Next: Chapter 1 - Domain 1: Secure Architectures


Chapter Summary

What We Covered

  • āœ… AWS Global Infrastructure (Regions, Availability Zones, Edge Locations)
  • āœ… Well-Architected Framework (5 pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization)
  • āœ… Shared Responsibility Model (AWS vs Customer responsibilities)
  • āœ… Core AWS Services (Compute, Storage, Database, Networking)
  • āœ… Essential Terminology and Concepts
  • āœ… Design Principles for Cloud Architecture

Critical Takeaways

  1. AWS Global Infrastructure: Regions contain multiple isolated Availability Zones for fault tolerance; Edge Locations provide low-latency content delivery
  2. Well-Architected Framework: Five pillars guide architectural decisions - always consider all five when designing solutions
  3. Shared Responsibility: AWS secures the infrastructure; customers secure their data, applications, and access management
  4. Design for Failure: Assume everything fails; use multiple AZs, implement health checks, and automate recovery
  5. Loose Coupling: Decouple components using queues, load balancers, and managed services to improve resilience and scalability

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain the difference between Regions, Availability Zones, and Edge Locations
  • I understand all five pillars of the Well-Architected Framework
  • I can describe the Shared Responsibility Model and give examples of AWS vs customer responsibilities
  • I know the main AWS service categories (Compute, Storage, Database, Networking)
  • I understand key design principles: design for failure, loose coupling, elasticity, horizontal scaling
  • I can explain when to use EC2 vs Lambda vs containers
  • I understand the difference between S3, EBS, and EFS storage types

Practice Questions

Try these from your practice test bundles:

  • Fundamentals Bundle: Questions covering basic AWS concepts
  • Expected score: 80%+ to proceed

If you scored below 80%:

  • Review sections: AWS Global Infrastructure, Well-Architected Framework
  • Focus on: Understanding the difference between service types and when to use each

Quick Reference Card

AWS Global Infrastructure:

  • Region: Geographic area with multiple AZs (e.g., us-east-1)
  • Availability Zone: Isolated data center within a Region
  • Edge Location: CDN endpoint for CloudFront

Well-Architected Pillars:

  1. Operational Excellence - Run and monitor systems
  2. Security - Protect information and systems
  3. Reliability - Recover from failures, meet demand
  4. Performance Efficiency - Use resources efficiently
  5. Cost Optimization - Avoid unnecessary costs

Shared Responsibility:

  • AWS: Hardware, facilities, network, managed services
  • Customer: Data, applications, IAM, OS patching, encryption

Key Services:

  • Compute: EC2 (VMs), Lambda (serverless), ECS/EKS (containers)
  • Storage: S3 (object), EBS (block), EFS (file)
  • Database: RDS (relational), DynamoDB (NoSQL), Aurora (high-performance)
  • Networking: VPC (private network), Route 53 (DNS), CloudFront (CDN)

Design Principles:

  • Design for failure → Use Multi-AZ
  • Loose coupling → Use SQS, SNS, load balancers
  • Elasticity → Use Auto Scaling
  • Horizontal scaling → Add more instances, not bigger ones


Chapter Summary

What We Covered

This foundational chapter prepared you for the SAA-C03 exam by covering:

  • āœ… AWS Global Infrastructure: Regions, Availability Zones, Edge Locations, and Local Zones
  • āœ… Well-Architected Framework: Six pillars (Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability)
  • āœ… Shared Responsibility Model: AWS responsibilities vs customer responsibilities
  • āœ… Core AWS Services: Compute, storage, database, networking, and security services overview
  • āœ… Exam Structure: 65 questions (50 scored + 15 unscored), 130 minutes, passing score 720/1000
  • āœ… Domain Breakdown: Domain 1 (30%), Domain 2 (26%), Domain 3 (24%), Domain 4 (20%)

Critical Takeaways

  1. Global Infrastructure: 30+ Regions, 90+ AZs, 400+ Edge Locations - design for high availability across AZs
  2. Well-Architected Framework: Use as a guide for all architecture decisions, focus on trade-offs
  3. Shared Responsibility: AWS secures infrastructure, customers secure data and applications
  4. Exam Strategy: Read questions carefully, eliminate wrong answers, manage time (2 minutes per question)
  5. Domain Weights: Focus study time proportionally - Domain 1 (30%) gets most attention

Self-Assessment Checklist

Before proceeding to domain chapters, ensure you can:

  • Explain the difference between Regions, AZs, and Edge Locations?
  • Describe all six pillars of the Well-Architected Framework?
  • Understand the AWS Shared Responsibility Model?
  • Identify core AWS services by category (compute, storage, database, networking)?
  • Explain the exam structure and scoring?
  • Understand how to approach multiple-choice and multiple-response questions?

If you answered "no" to any: Review the relevant sections before proceeding.

If you answered "yes" to all: You're ready to begin Domain 1!


Next Steps: Proceed to 02_domain1_secure_architectures to begin learning about designing secure architectures (30% of exam).


Chapter 1: Design Secure Architectures (30% of exam)

Chapter Overview

What you'll learn:

  • IAM (Identity and Access Management): Users, groups, roles, and policies
  • Secure access patterns: MFA, least privilege, cross-account access
  • VPC security: Security groups, NACLs, network segmentation
  • Data protection: Encryption at rest and in transit, KMS
  • Security services: WAF, Shield, GuardDuty, Macie, Secrets Manager
  • Compliance and governance: Organizations, SCPs, Control Tower

Time to complete: 12-15 hours
Prerequisites: Chapter 0 (Fundamentals)
Exam weight: 30% of scored content

Why this matters: Security is the highest-weighted domain on the SAA-C03 exam. Every architecture you design must be secure by default. This chapter teaches you how to implement defense-in-depth security using AWS services, following the principle of least privilege and the AWS Shared Responsibility Model.


Section 1: IAM (Identity and Access Management) Fundamentals

Introduction

The problem: In any IT system, you need to control who can access what resources and what actions they can perform. Without proper access control, unauthorized users could access sensitive data, malicious actors could compromise systems, and legitimate users might accidentally delete critical resources. Traditional on-premises systems use Active Directory and file permissions, but cloud environments need more flexible, scalable access control.

The solution: AWS Identity and Access Management (IAM) provides centralized control over access to AWS resources. IAM allows you to create users, groups, and roles, and attach policies that define permissions. IAM is free, globally available, and integrates with all AWS services.

Why it's tested: IAM questions appear throughout the SAA-C03 exam, not just in Domain 1. Understanding IAM is fundamental to designing secure architectures. Questions test your ability to implement least privilege, use roles instead of long-term credentials, configure cross-account access, and troubleshoot permission issues.

Core Concepts

What is IAM?

What it is: IAM is a web service that helps you securely control access to AWS resources. You use IAM to control who is authenticated (signed in) and authorized (has permissions) to use resources. IAM is a feature of your AWS account offered at no additional charge.

Why it exists: Before IAM, AWS accounts had only a root user with full access to everything. This was insecure because:

  • You couldn't give different people different levels of access
  • You couldn't revoke access without changing the root password
  • You couldn't audit who did what
  • You couldn't implement least privilege

IAM solves these problems by allowing you to create multiple identities with specific permissions, audit all actions, and implement security best practices.

Real-world analogy: Think of IAM like a corporate office building's security system. The building owner (root user) has master access to everything. IAM users are like employees with ID badges - each badge grants access to specific floors and rooms based on their job role. IAM groups are like departments (all engineers get access to the engineering floor). IAM roles are like temporary visitor badges that grant specific access for a limited time.

How it works (Detailed step-by-step):

  1. You create an AWS account: When you create an AWS account, you start with a root user that has complete access to all AWS services and resources. This root user is identified by the email address used to create the account.

  2. You create IAM users: Instead of using the root user for daily tasks, you create IAM users for each person who needs access to AWS. Each IAM user has:

    • A unique name (e.g., "alice", "bob")
    • Credentials (password for console access, access keys for programmatic access)
    • Permissions (defined by attached policies)
  3. You organize users into groups: To simplify permission management, you create IAM groups (e.g., "Developers", "Administrators", "Auditors") and add users to groups. Policies attached to a group apply to all users in that group.

  4. You create IAM roles: For applications and services (not people), you create IAM roles. Roles are assumed temporarily and don't have long-term credentials. For example, an EC2 instance assumes a role to access S3.

  5. You attach policies: Policies are JSON documents that define permissions. You attach policies to users, groups, or roles to grant permissions. Policies specify:

    • Which actions are allowed (e.g., s3:GetObject, ec2:StartInstances)
    • Which resources the actions apply to (e.g., specific S3 buckets, all EC2 instances)
    • Conditions (e.g., only allow access from specific IP addresses)
  6. AWS evaluates permissions: When a user or role tries to perform an action, AWS evaluates all applicable policies to determine if the action is allowed. By default, all actions are denied unless explicitly allowed.

⭐ Must Know:

  • IAM is global - users, groups, roles, and policies are not Region-specific
  • Root user has complete access and should be secured with MFA and rarely used
  • IAM users are for people; IAM roles are for applications and services
  • Policies define permissions; they can be attached to users, groups, or roles
  • By default, all actions are denied (implicit deny) unless explicitly allowed
  • An explicit deny in any policy overrides all allows

šŸ“Š IAM Architecture Diagram:

graph TB
    subgraph "AWS Account"
        ROOT[Root User<br/>Full Access]
        
        subgraph "IAM Users"
            USER1[IAM User: Alice<br/>Developer]
            USER2[IAM User: Bob<br/>Admin]
            USER3[IAM User: Charlie<br/>Auditor]
        end
        
        subgraph "IAM Groups"
            GRP1[Group: Developers]
            GRP2[Group: Administrators]
            GRP3[Group: Auditors]
        end
        
        subgraph "IAM Roles"
            ROLE1[Role: EC2-S3-Access]
            ROLE2[Role: Lambda-Execution]
            ROLE3[Role: Cross-Account-Access]
        end
        
        subgraph "IAM Policies"
            POL1[Policy: S3-Read-Only]
            POL2[Policy: EC2-Full-Access]
            POL3[Policy: CloudWatch-Logs]
        end
        
        subgraph "AWS Resources"
            EC2[EC2 Instance]
            S3[S3 Bucket]
            LAMBDA[Lambda Function]
        end
    end
    
    ROOT -.Should not use.-> ROOT
    USER1 --> GRP1
    USER2 --> GRP2
    USER3 --> GRP3
    
    GRP1 --> POL1
    GRP2 --> POL2
    GRP3 --> POL3
    
    EC2 --> ROLE1
    LAMBDA --> ROLE2
    
    ROLE1 --> POL1
    ROLE2 --> POL3
    
    USER2 --> EC2
    USER1 --> S3
    
    style ROOT fill:#ffebee
    style USER1 fill:#e1f5fe
    style USER2 fill:#e1f5fe
    style USER3 fill:#e1f5fe
    style GRP1 fill:#f3e5f5
    style GRP2 fill:#f3e5f5
    style GRP3 fill:#f3e5f5
    style ROLE1 fill:#fff3e0
    style ROLE2 fill:#fff3e0
    style ROLE3 fill:#fff3e0
    style POL1 fill:#c8e6c9
    style POL2 fill:#c8e6c9
    style POL3 fill:#c8e6c9

See: diagrams/02_domain1_iam_overview.mmd

Diagram Explanation (detailed):

This diagram illustrates the complete IAM architecture and how different components interact within an AWS account.

Root User (Red - Top):
The root user sits at the top with complete, unrestricted access to all AWS services and resources. The dotted line with "Should not use" emphasizes that the root user should be secured with MFA and used only for tasks that specifically require root access (like changing account settings or closing the account). For day-to-day operations, you should use IAM users or roles instead.

IAM Users (Blue):
Three IAM users are shown: Alice (Developer), Bob (Administrator), and Charlie (Auditor). Each user represents a real person who needs access to AWS. Users have long-term credentials (passwords and/or access keys) and are assigned to groups based on their job function. Notice that users don't have direct policy attachments in this diagram - they inherit permissions from their groups, which is a best practice for easier management.

IAM Groups (Purple):
Groups are collections of users with similar access needs. The diagram shows three groups:

  • Developers: Contains Alice and other developers who need access to development resources
  • Administrators: Contains Bob and other admins who need broad access to manage AWS resources
  • Auditors: Contains Charlie and other auditors who need read-only access to review configurations and logs

Groups simplify permission management - instead of attaching policies to each user individually, you attach policies to groups. When a user joins or leaves a team, you simply add or remove them from the appropriate group.

IAM Roles (Orange):
Roles are shown for non-human entities:

  • EC2-S3-Access: A role that EC2 instances can assume to access S3 buckets
  • Lambda-Execution: A role that Lambda functions assume to write logs to CloudWatch
  • Cross-Account-Access: A role that allows users from another AWS account to access resources in this account

Roles don't have long-term credentials. Instead, they provide temporary security credentials when assumed. This is more secure than embedding access keys in application code.

IAM Policies (Green):
Policies are JSON documents that define permissions. The diagram shows three policies:

  • S3-Read-Only: Allows reading objects from S3 buckets but not writing or deleting
  • EC2-Full-Access: Allows all EC2 actions (start, stop, terminate instances, etc.)
  • CloudWatch-Logs: Allows writing logs to CloudWatch Logs

Policies are attached to groups and roles. The Developers group has the S3-Read-Only policy, meaning all developers can read S3 objects. The EC2-S3-Access role has the S3-Read-Only policy, meaning EC2 instances with this role can read S3 objects.

AWS Resources (Bottom):
The diagram shows how IAM entities interact with AWS resources:

  • The EC2 instance has the EC2-S3-Access role attached, allowing it to access S3
  • The Lambda function has the Lambda-Execution role attached, allowing it to write logs
  • Bob (Administrator) can manage EC2 instances because his Administrators group has the EC2-Full-Access policy
  • Alice (Developer) can read from S3 because her Developers group has the S3-Read-Only policy

Key Architectural Principles Shown:

  1. Least Privilege: Each entity has only the permissions it needs. Developers can read S3 but not delete. Auditors can view but not modify.
  2. Separation of Duties: Different groups have different permissions. Developers can't perform administrative tasks.
  3. Roles for Applications: EC2 and Lambda use roles, not embedded credentials, to access other services.
  4. Group-Based Management: Users inherit permissions from groups, making it easy to manage permissions for many users.
  5. Root User Protection: The root user is not used for daily operations, reducing the risk of compromise.

This architecture represents IAM best practices and is the foundation for secure AWS environments. Understanding this structure is critical for the SAA-C03 exam.

IAM Users

What it is: An IAM user is an entity that represents a person or application that interacts with AWS. Each IAM user has a unique name within the AWS account and can have credentials (password for console access, access keys for programmatic access) and permissions.

Why it exists: You need a way to give individuals access to AWS without sharing the root user credentials. IAM users provide individual identities with specific permissions, enabling accountability (you know who did what) and security (you can revoke access for specific users).

Real-world analogy: Think of IAM users like employee accounts in a company's computer system. Each employee has their own username and password, their own email address, and their own set of permissions based on their role. If an employee leaves, you disable their account without affecting others.

How it works (Detailed step-by-step):

  1. Creating an IAM user:

    • You navigate to the IAM console and click "Add users"
    • You specify a username (e.g., "alice.smith")
    • You choose the type of access:
      • AWS Management Console access: Provides a password for signing into the AWS web console
      • Programmatic access: Provides access keys (Access Key ID and Secret Access Key) for using the AWS CLI, SDKs, or APIs
    • You can enable both types of access for a single user
  2. Setting credentials:

    • Console password: You can auto-generate a password or create a custom password. You can require the user to change their password on first sign-in.
    • Access keys: AWS generates an Access Key ID (like a username) and Secret Access Key (like a password). The Secret Access Key is shown only once - if you lose it, you must create new access keys.
  3. Assigning permissions:

    • You can attach policies directly to the user (not recommended for most cases)
    • You can add the user to one or more groups (recommended - easier to manage)
    • You can set a permissions boundary (advanced - limits the maximum permissions the user can have)
  4. User signs in:

    • For console access: User navigates to the account-specific sign-in URL (https://ACCOUNT-ID.signin.aws.amazon.com/console) and enters their username and password
    • For programmatic access: User configures the AWS CLI or SDK with their access keys
  5. AWS authenticates and authorizes:

    • AWS verifies the credentials (authentication)
    • AWS evaluates all policies attached to the user and their groups to determine what actions are allowed (authorization)
    • The user can perform only the actions explicitly allowed by their policies

⭐ Must Know:

  • IAM users are for long-term credentials (people who need ongoing access)
  • Each user should represent one person - don't share IAM user credentials
  • Users can have console access, programmatic access, or both
  • Access keys should be rotated regularly (every 90 days is a common practice)
  • Users can have up to 2 active access keys (allows rotation without downtime)
  • Enable MFA (Multi-Factor Authentication) for all users, especially those with administrative access

Detailed Example 1: Creating a Developer User

Scenario: You're hiring a new developer, Alice, who needs access to AWS to deploy applications. She needs console access to view resources and programmatic access to deploy code.

Step-by-step implementation:

  1. Create the IAM user:

    aws iam create-user --user-name alice.smith
    
  2. Enable console access:

    aws iam create-login-profile --user-name alice.smith --password 'TempPassword123!' --password-reset-required
    

    This creates a temporary password that Alice must change on first sign-in.

  3. Create access keys for programmatic access:

    aws iam create-access-key --user-name alice.smith
    

    Output:

    {
      "AccessKey": {
        "AccessKeyId": "AKIAIOSFODNN7EXAMPLE",
        "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "Status": "Active",
        "CreateDate": "2025-01-15T10:30:00Z"
      }
    }
    

    Important: Save the SecretAccessKey immediately - it's shown only once!

  4. Add Alice to the Developers group (which has appropriate policies):

    aws iam add-user-to-group --user-name alice.smith --group-name Developers
    
  5. Enable MFA (Alice does this after first sign-in):

    • Alice signs in to the console
    • Navigates to IAM → Users → alice.smith → Security credentials
    • Clicks "Assign MFA device"
    • Scans QR code with authenticator app (Google Authenticator, Authy, etc.)
    • Enters two consecutive MFA codes to verify
  6. Alice configures her local environment:

    aws configure
    AWS Access Key ID: AKIAIOSFODNN7EXAMPLE
    AWS Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    Default region name: us-east-1
    Default output format: json
    

Result: Alice can now sign into the AWS console with her username and password (plus MFA code), and she can use the AWS CLI with her access keys. Her permissions are determined by the policies attached to the Developers group. If Alice leaves the company, you can delete her IAM user without affecting other developers.

Detailed Example 2: Rotating Access Keys

Scenario: Alice's access keys are 90 days old and need to be rotated for security. You need to rotate them without causing downtime for her applications.

Step-by-step implementation:

  1. Create a second access key (Alice can have up to 2 active keys):

    aws iam create-access-key --user-name alice.smith
    

    Output:

    {
      "AccessKey": {
        "AccessKeyId": "AKIAI44QH8DHBEXAMPLE",
        "SecretAccessKey": "je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY",
        "Status": "Active",
        "CreateDate": "2025-04-15T10:30:00Z"
      }
    }
    
  2. Update applications to use the new key:

    • Alice updates her AWS CLI configuration:
      aws configure set aws_access_key_id AKIAI44QH8DHBEXAMPLE
      aws configure set aws_secret_access_key je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY
      
    • Alice updates any applications or scripts that use the old key
    • Alice tests that everything works with the new key
  3. Deactivate the old key (don't delete yet - keep it as a backup):

    aws iam update-access-key --user-name alice.smith --access-key-id AKIAIOSFODNN7EXAMPLE --status Inactive
    
  4. Monitor for errors (wait 24-48 hours):

    • Check CloudTrail logs for any API calls using the old key
    • If any applications are still using the old key, they'll fail and you can identify them
    • Update those applications to use the new key
  5. Delete the old key (after confirming nothing is using it):

    aws iam delete-access-key --user-name alice.smith --access-key-id AKIAIOSFODNN7EXAMPLE
    

Result: Alice's access keys have been rotated without downtime. The two-key system allows graceful rotation - you create the new key, update applications, verify everything works, then delete the old key.

Detailed Example 3: Troubleshooting Permission Issues

Scenario: Alice tries to terminate an EC2 instance but gets an "Access Denied" error. You need to troubleshoot why.

Step-by-step troubleshooting:

  1. Check what policies are attached to Alice:

    aws iam list-attached-user-policies --user-name alice.smith
    aws iam list-groups-for-user --user-name alice.smith
    

    Output shows Alice is in the "Developers" group.

  2. Check what policies are attached to the Developers group:

    aws iam list-attached-group-policies --group-name Developers
    

    Output shows the group has the "DevelopersPolicy" attached.

  3. View the policy document:

    aws iam get-policy-version --policy-arn arn:aws:iam::123456789012:policy/DevelopersPolicy --version-id v1
    

    Output:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "ec2:Describe*",
            "ec2:StartInstances",
            "ec2:StopInstances"
          ],
          "Resource": "*"
        }
      ]
    }
    
  4. Identify the problem:

    • The policy allows ec2:StartInstances and ec2:StopInstances
    • The policy does NOT allow ec2:TerminateInstances
    • This is why Alice gets "Access Denied" when trying to terminate instances
  5. Decide on the fix:

    • Option 1: Add ec2:TerminateInstances to the policy if developers should be able to terminate instances
    • Option 2: Explain to Alice that developers can't terminate instances (this might be intentional to prevent accidental deletion)
    • Option 3: Create a separate policy for senior developers who need terminate permissions
  6. If you decide to grant the permission, update the policy:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "ec2:Describe*",
            "ec2:StartInstances",
            "ec2:StopInstances",
            "ec2:TerminateInstances"
          ],
          "Resource": "*",
          "Condition": {
            "StringEquals": {
              "ec2:ResourceTag/Environment": "Development"
            }
          }
        }
      ]
    }
    

    This updated policy allows terminating instances, but only if they're tagged with Environment=Development. This prevents developers from accidentally terminating production instances.

Result: You've identified the permission issue, understood why it exists, and implemented a solution that grants the necessary permission while maintaining security (developers can only terminate development instances, not production).

āš ļø Common Mistakes & Misconceptions:

  • Mistake 1: Sharing IAM user credentials among multiple people

    • Why it's wrong: You lose accountability - you can't tell who performed which action. If one person leaves, you have to change credentials for everyone.
    • Correct understanding: Create a separate IAM user for each person. Use IAM groups to manage permissions for multiple users with similar needs.
  • Mistake 2: Embedding access keys in application code

    • Why it's wrong: If the code is shared (e.g., pushed to GitHub), the access keys are exposed. Anyone with the keys can access your AWS account.
    • Correct understanding: Use IAM roles for applications running on AWS (EC2, Lambda, ECS). For applications running outside AWS, use temporary credentials from AWS STS or store credentials in a secrets manager.
  • Mistake 3: Never rotating access keys

    • Why it's wrong: If access keys are compromised, attackers have unlimited time to use them. Old keys might be embedded in forgotten scripts or applications.
    • Correct understanding: Rotate access keys every 90 days. Use AWS IAM Access Analyzer to identify unused access keys and delete them.
  • Mistake 4: Granting overly broad permissions

    • Why it's wrong: If an IAM user is compromised, the attacker has access to everything the user can access. This violates the principle of least privilege.
    • Correct understanding: Grant only the permissions needed for the user's job. Start with minimal permissions and add more as needed, rather than starting with broad permissions and trying to restrict them.

šŸ”— Connections to Other Topics:

  • Relates to IAM Roles (covered next) because: Roles are preferred over users for applications
  • Builds on IAM Policies (covered later) by: Policies define what users can do
  • Often used with MFA (covered later) to: Add an extra layer of security

šŸ’” Tips for Understanding:

  • Think of IAM users as "people accounts" - each person gets their own user
  • Remember: Users have long-term credentials; roles have temporary credentials
  • When troubleshooting permissions, always check: user policies, group policies, and resource policies

šŸŽÆ Exam Focus: Questions often test whether you understand when to use IAM users vs. roles, how to implement least privilege, and how to troubleshoot permission issues. Remember: roles are preferred for applications; users are for people.

IAM Groups

What it is: An IAM group is a collection of IAM users. Groups let you specify permissions for multiple users, making it easier to manage permissions. Users in a group automatically inherit the permissions assigned to the group.

Why it exists: Managing permissions for individual users becomes unmanageable as your organization grows. If you have 50 developers and need to change their permissions, you don't want to update 50 individual users. Groups solve this by allowing you to manage permissions once for the entire group.

Real-world analogy: Think of IAM groups like departments in a company. All employees in the Engineering department get access to the engineering tools and resources. When a new engineer joins, you add them to the Engineering department and they automatically get the appropriate access. When they leave, you remove them from the department.

How it works (Detailed step-by-step):

  1. Creating a group:

    • You create a group with a descriptive name (e.g., "Developers", "DatabaseAdmins", "Auditors")
    • You attach policies to the group that define what members can do
    • You add users to the group
  2. Users inherit permissions:

    • When a user is added to a group, they inherit all policies attached to that group
    • A user can be in multiple groups (e.g., Alice might be in both "Developers" and "OnCallEngineers")
    • The user's effective permissions are the union of all policies from all their groups plus any policies attached directly to the user
  3. Managing permissions at scale:

    • To grant a new permission to all developers, you update the Developers group policy once
    • All users in the group immediately get the new permission
    • To revoke access for a user, you remove them from the group

⭐ Must Know:

  • Groups are collections of users - they simplify permission management
  • Users can be in multiple groups (up to 10 groups per user)
  • Groups cannot be nested (a group cannot contain another group)
  • Groups cannot be used as principals in resource-based policies (you can't grant S3 bucket access to a group directly)
  • Best practice: Attach policies to groups, not individual users

Detailed Example 1: Organizing Users by Job Function

Scenario: Your company has developers, database administrators, and auditors. Each group needs different permissions.

Step-by-step implementation:

  1. Create groups for each job function:

    aws iam create-group --group-name Developers
    aws iam create-group --group-name DatabaseAdmins
    aws iam create-group --group-name Auditors
    
  2. Create policies for each group:

    Developers Policy (allows EC2, S3, Lambda access):

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "ec2:Describe*",
            "ec2:StartInstances",
            "ec2:StopInstances",
            "s3:GetObject",
            "s3:PutObject",
            "s3:ListBucket",
            "lambda:InvokeFunction",
            "lambda:GetFunction"
          ],
          "Resource": "*",
          "Condition": {
            "StringEquals": {
              "aws:RequestedRegion": ["us-east-1", "us-west-2"]
            }
          }
        }
      ]
    }
    

    DatabaseAdmins Policy (allows RDS, DynamoDB full access):

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "rds:*",
            "dynamodb:*",
            "cloudwatch:GetMetricStatistics",
            "cloudwatch:ListMetrics"
          ],
          "Resource": "*"
        }
      ]
    }
    

    Auditors Policy (read-only access to everything):

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "ec2:Describe*",
            "s3:GetObject",
            "s3:ListBucket",
            "rds:Describe*",
            "dynamodb:DescribeTable",
            "cloudtrail:LookupEvents",
            "cloudwatch:GetMetricStatistics"
          ],
          "Resource": "*"
        }
      ]
    }
    
  3. Attach policies to groups:

    aws iam put-group-policy --group-name Developers --policy-name DevelopersPolicy --policy-document file://developers-policy.json
    aws iam put-group-policy --group-name DatabaseAdmins --policy-name DatabaseAdminsPolicy --policy-document file://dbadmins-policy.json
    aws iam put-group-policy --group-name Auditors --policy-name AuditorsPolicy --policy-document file://auditors-policy.json
    
  4. Add users to appropriate groups:

    aws iam add-user-to-group --user-name alice.smith --group-name Developers
    aws iam add-user-to-group --user-name bob.jones --group-name DatabaseAdmins
    aws iam add-user-to-group --user-name charlie.brown --group-name Auditors
    

Result: You've organized users by job function. When a new developer joins, you simply add them to the Developers group and they automatically get all developer permissions. When you need to grant developers access to a new service, you update the Developers group policy once instead of updating each developer individually.

Detailed Example 2: Multi-Group Membership

Scenario: Alice is a developer who is also on the on-call rotation. During on-call, she needs additional permissions to restart services and view logs.

Step-by-step implementation:

  1. Create an OnCallEngineers group:

    aws iam create-group --group-name OnCallEngineers
    
  2. Create a policy for on-call permissions:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "ec2:RebootInstances",
            "ec2:TerminateInstances",
            "rds:RebootDBInstance",
            "cloudwatch:PutMetricAlarm",
            "cloudwatch:DeleteAlarms",
            "logs:GetLogEvents",
            "logs:FilterLogEvents",
            "sns:Publish"
          ],
          "Resource": "*",
          "Condition": {
            "StringEquals": {
              "ec2:ResourceTag/Environment": ["Production", "Staging"]
            }
          }
        }
      ]
    }
    
  3. Attach the policy to the OnCallEngineers group:

    aws iam put-group-policy --group-name OnCallEngineers --policy-name OnCallPolicy --policy-document file://oncall-policy.json
    
  4. Add Alice to both groups:

    aws iam add-user-to-group --user-name alice.smith --group-name Developers
    aws iam add-user-to-group --user-name alice.smith --group-name OnCallEngineers
    
  5. Alice's effective permissions:

    • From Developers group: Can start/stop EC2, read/write S3, invoke Lambda (in us-east-1 and us-west-2)
    • From OnCallEngineers group: Can reboot/terminate EC2, reboot RDS, manage CloudWatch alarms, read logs, publish SNS messages (for Production and Staging resources)
    • Combined: Alice has all permissions from both groups
  6. When Alice's on-call rotation ends:

    aws iam remove-user-from-group --user-name alice.smith --group-name OnCallEngineers
    

    Alice loses the on-call permissions but retains her developer permissions.

Result: Alice has different permissions based on her current responsibilities. During on-call, she has elevated permissions to respond to incidents. When her rotation ends, you simply remove her from the OnCallEngineers group without affecting her developer permissions.

Detailed Example 3: Temporary Project Access

Scenario: Your company is working on a special project that requires access to a specific S3 bucket. Multiple users from different teams need access for 3 months.

Step-by-step implementation:

  1. Create a project-specific group:

    aws iam create-group --group-name ProjectPhoenixTeam
    
  2. Create a policy for the project bucket:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "s3:GetObject",
            "s3:PutObject",
            "s3:DeleteObject",
            "s3:ListBucket"
          ],
          "Resource": [
            "arn:aws:s3:::project-phoenix-data",
            "arn:aws:s3:::project-phoenix-data/*"
          ]
        }
      ]
    }
    
  3. Attach the policy to the group:

    aws iam put-group-policy --group-name ProjectPhoenixTeam --policy-name ProjectPhoenixAccess --policy-document file://project-policy.json
    
  4. Add team members from different departments:

    aws iam add-user-to-group --user-name alice.smith --group-name ProjectPhoenixTeam  # Developer
    aws iam add-user-to-group --user-name bob.jones --group-name ProjectPhoenixTeam    # DBA
    aws iam add-user-to-group --user-name david.lee --group-name ProjectPhoenixTeam    # Data Scientist
    
  5. After 3 months, when the project ends:

    # Remove all users from the group
    aws iam remove-user-from-group --user-name alice.smith --group-name ProjectPhoenixTeam
    aws iam remove-user-from-group --user-name bob.jones --group-name ProjectPhoenixTeam
    aws iam remove-user-from-group --user-name david.lee --group-name ProjectPhoenixTeam
    
    # Delete the group
    aws iam delete-group-policy --group-name ProjectPhoenixTeam --policy-name ProjectPhoenixAccess
    aws iam delete-group --group-name ProjectPhoenixTeam
    

Result: You've granted temporary access to multiple users from different teams without modifying their permanent permissions. When the project ends, you clean up by removing users from the group and deleting the group. Each user retains their original permissions from their primary groups (Developers, DatabaseAdmins, etc.).

āš ļø Common Mistakes & Misconceptions:

  • Mistake 1: Trying to nest groups (putting a group inside another group)

    • Why it's wrong: IAM doesn't support nested groups. You can't create a "SeniorDevelopers" group that contains the "Developers" group.
    • Correct understanding: If you need hierarchical permissions, create separate groups with different policies. Users can be in multiple groups to get combined permissions.
  • Mistake 2: Attaching policies directly to users instead of using groups

    • Why it's wrong: This becomes unmanageable as your organization grows. If you have 50 developers with individual policies, updating permissions requires 50 changes.
    • Correct understanding: Always use groups for permission management. Attach policies to groups, then add users to groups. Only attach policies directly to users in exceptional cases.
  • Mistake 3: Creating too many groups with overlapping permissions

    • Why it's wrong: This creates confusion and makes it hard to understand what permissions a user has. You might have "Developers", "BackendDevelopers", "FrontendDevelopers", "SeniorDevelopers", etc., with unclear distinctions.
    • Correct understanding: Create groups based on clear job functions or responsibilities. Use descriptive names. Document what each group is for and what permissions it grants.
  • Mistake 4: Forgetting that users can be in multiple groups

    • Why it's wrong: You might create overly broad groups because you think users can only be in one group.
    • Correct understanding: Users can be in up to 10 groups. Use this to your advantage - create focused groups (Developers, OnCallEngineers, ProjectTeam) and add users to multiple groups as needed.

šŸ”— Connections to Other Topics:

  • Relates to IAM Users (covered previously) because: Groups contain users
  • Builds on IAM Policies (covered later) by: Policies attached to groups apply to all group members
  • Often used with Least Privilege (covered later) to: Grant minimum necessary permissions to groups

šŸ’” Tips for Understanding:

  • Think of groups as "permission templates" - create a group for each job function
  • Remember: Groups simplify management but don't provide additional security - they're just a way to organize users
  • When designing groups, think about how people's roles might change over time

šŸŽÆ Exam Focus: Questions often test whether you understand how to use groups effectively, how multi-group membership works, and how to troubleshoot permission issues involving groups. Remember: groups are for management convenience, not security boundaries.

IAM Roles

What it is: An IAM role is an IAM identity with specific permissions, but unlike users, roles are not associated with a specific person. Instead, roles are assumed by entities that need temporary access to AWS resources - such as EC2 instances, Lambda functions, or users from another AWS account. When an entity assumes a role, AWS provides temporary security credentials that expire after a specified time.

Why it exists: Embedding long-term credentials (access keys) in applications is insecure - if the application code is compromised or accidentally shared, the credentials are exposed. Roles solve this by providing temporary credentials that automatically rotate and expire. Roles also enable cross-account access and allow AWS services to access other AWS services on your behalf.

Real-world analogy: Think of IAM roles like temporary security badges at a conference. You don't get a permanent employee badge - instead, you check in at registration, show your ID, and receive a temporary badge that's valid for the day. The badge grants you access to specific areas based on your registration type (speaker, attendee, vendor). At the end of the day, the badge expires automatically. Similarly, when an application assumes a role, it gets temporary credentials that expire automatically.

How it works (Detailed step-by-step):

  1. Creating a role:

    • You create a role and specify who can assume it (the trust policy)
    • You attach permissions policies that define what the role can do
    • You optionally set a maximum session duration (1 hour to 12 hours)
  2. Trust policy (who can assume the role):

    • The trust policy is a JSON document that specifies which entities can assume the role
    • For EC2 instances: Trust policy allows the EC2 service to assume the role
    • For Lambda functions: Trust policy allows the Lambda service to assume the role
    • For cross-account access: Trust policy allows users from another AWS account to assume the role
  3. Assuming the role:

    • An entity (EC2 instance, Lambda function, IAM user) requests to assume the role
    • AWS STS (Security Token Service) validates the request against the trust policy
    • If allowed, STS returns temporary security credentials (Access Key ID, Secret Access Key, Session Token)
    • These credentials are valid for the session duration (default 1 hour, configurable up to 12 hours)
  4. Using temporary credentials:

    • The entity uses the temporary credentials to make AWS API calls
    • AWS validates the credentials and checks the role's permissions policies
    • The entity can perform only the actions allowed by the role's policies
  5. Automatic rotation:

    • Before the credentials expire, AWS automatically provides new credentials
    • For EC2 instances and Lambda functions, this happens transparently - you don't need to do anything
    • The credentials expire automatically after the session duration, limiting the impact if they're compromised

⭐ Must Know:

  • Roles provide temporary credentials that automatically rotate and expire
  • Roles are for applications and services, not for people (though users can assume roles for cross-account access)
  • Roles have two types of policies: trust policy (who can assume) and permissions policy (what they can do)
  • EC2 instances and Lambda functions should always use roles, never embedded access keys
  • Roles can be assumed by: AWS services, IAM users (same or different account), federated users, web identity providers

šŸ“Š IAM Roles Flow Diagram:

sequenceDiagram
    participant APP as Application<br/>(EC2 Instance)
    participant EC2 as EC2 Service
    participant STS as AWS STS<br/>(Security Token Service)
    participant S3 as S3 Service
    
    Note over APP,S3: Application needs to access S3
    
    APP->>EC2: Request temporary credentials<br/>for attached IAM role
    EC2->>STS: AssumeRole request<br/>for EC2-S3-Access role
    STS->>STS: Validate role trust policy<br/>(EC2 is allowed to assume this role)
    STS->>EC2: Return temporary credentials<br/>(Access Key, Secret Key, Session Token)<br/>Valid for 1-12 hours
    EC2->>APP: Provide temporary credentials
    
    Note over APP: Credentials are automatically<br/>rotated before expiration
    
    APP->>S3: GetObject request<br/>using temporary credentials
    S3->>S3: Validate credentials<br/>Check role permissions
    S3->>APP: Return object data
    
    Note over APP,S3: No long-term credentials stored!<br/>Credentials expire automatically

See: diagrams/02_domain1_iam_roles_flow.mmd

Diagram Explanation (detailed):

This sequence diagram illustrates how IAM roles work in practice, showing the complete flow from an application requesting access to receiving temporary credentials and using them to access AWS services.

Step 1: Application Needs Access:
The application running on an EC2 instance needs to access an S3 bucket. Instead of having access keys embedded in the application code, the EC2 instance has an IAM role attached to it (EC2-S3-Access role).

Step 2: Request Temporary Credentials:
The application uses the AWS SDK, which automatically detects that it's running on EC2 and requests temporary credentials from the EC2 metadata service. This happens transparently - the application code doesn't need to explicitly request credentials.

Step 3: AssumeRole Request to STS:
The EC2 service forwards the request to AWS Security Token Service (STS), asking to assume the EC2-S3-Access role on behalf of the instance.

Step 4: Validate Trust Policy:
STS checks the role's trust policy to verify that the EC2 service is allowed to assume this role. The trust policy for this role looks like:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

This policy says "Allow the EC2 service to assume this role."

Step 5: Return Temporary Credentials:
STS generates temporary security credentials consisting of:

  • Access Key ID: Like a username (e.g., ASIAXXX...)
  • Secret Access Key: Like a password
  • Session Token: Additional credential that proves these are temporary credentials
  • Expiration Time: When these credentials will expire (default 1 hour, max 12 hours)

These credentials are returned to the EC2 service, which provides them to the application.

Step 6: Automatic Rotation:
The AWS SDK automatically handles credential rotation. Before the credentials expire, the SDK requests new credentials from the metadata service. This happens transparently - the application doesn't need to handle credential rotation.

Step 7: Use Credentials to Access S3:
The application makes an API call to S3 (GetObject) using the temporary credentials. The request includes the Access Key ID, Secret Access Key, and Session Token.

Step 8: Validate and Authorize:
S3 validates the temporary credentials with STS and checks the role's permissions policy to determine if the GetObject action is allowed. The permissions policy for this role looks like:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-application-bucket",
        "arn:aws:s3:::my-application-bucket/*"
      ]
    }
  ]
}

This policy allows reading objects from the specific S3 bucket.

Step 9: Return Data:
If the action is allowed, S3 returns the requested object data to the application.

Key Security Benefits Shown:

  1. No Long-Term Credentials: The application never has access keys embedded in its code. If the application code is compromised, there are no permanent credentials to steal.

  2. Automatic Expiration: The temporary credentials expire after 1-12 hours. Even if an attacker obtains the credentials, they have limited time to use them.

  3. Automatic Rotation: The SDK automatically requests new credentials before the old ones expire, ensuring continuous operation without manual intervention.

  4. Least Privilege: The role has permissions only to read from a specific S3 bucket, not all S3 buckets or other AWS services. If the credentials are compromised, the damage is limited.

  5. Auditability: All actions performed using the role are logged in CloudTrail with the role name, making it easy to audit what happened and when.

This pattern is the recommended way to grant AWS services access to other AWS services. It's more secure than embedding access keys and requires no credential management by the application developer.

Detailed Example 2: Cross-Account Access with External ID

Imagine you're a SaaS company providing analytics services. Your customer (Company A) wants you to access their S3 bucket to analyze their data, but they want to ensure that only your application can access their data, not other customers' applications that might also use your service.

The Problem: If you just create an IAM role in Company A's account that trusts your AWS account, any application in your account could potentially assume that role. This is called the "confused deputy problem" - Company A's role might be tricked into granting access to the wrong application.

The Solution: Use an External ID, which acts like a secret password that only you and Company A know.

Setup Process:

  1. You Generate a Unique External ID: Your application generates a random, unique identifier for Company A (e.g., "CompanyA-12345-abcde"). This External ID is stored in your database associated with Company A's account.

  2. Company A Creates a Role: Company A creates an IAM role in their account with this trust policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::YOUR-ACCOUNT-ID:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "CompanyA-12345-abcde"
        }
      }
    }
  ]
}
  1. Your Application Assumes the Role: When your application needs to access Company A's data, it calls STS AssumeRole with the External ID:
aws sts assume-role \
  --role-arn arn:aws:iam::COMPANY-A-ACCOUNT:role/AnalyticsAccessRole \
  --role-session-name analytics-session \
  --external-id CompanyA-12345-abcde
  1. STS Validates: STS checks that:
    • The request comes from your AWS account (matches the Principal)
    • The External ID in the request matches the External ID in the trust policy
    • Only if both match does STS grant temporary credentials

Why This Works:

  • Even if another customer (Company B) tries to trick your application into accessing Company A's data, they don't know Company A's External ID
  • Each customer has a unique External ID, preventing cross-customer access
  • The External ID acts as a shared secret that proves the request is legitimate

Real-World Scenario: AWS CloudFormation uses this pattern. When you create a stack that needs to access resources in another account, CloudFormation requires an External ID to prevent unauthorized cross-account access.

Detailed Example 3: Service Control Policies (SCPs) in AWS Organizations

Imagine you're managing a large enterprise with 50 AWS accounts organized into different Organizational Units (OUs): Development, Testing, Production, and Security. You need to enforce company-wide security policies that cannot be overridden by individual account administrators.

The Challenge: Even if you create perfect IAM policies in each account, an account administrator could modify or delete those policies. You need a way to enforce policies at a higher level that cannot be bypassed.

The Solution: Service Control Policies (SCPs) in AWS Organizations act as guardrails that define the maximum permissions for all IAM entities in an account, regardless of their IAM policies.

How SCPs Work:

SCPs don't grant permissions - they define boundaries. An IAM entity can only perform actions that are allowed by BOTH:

  1. Their IAM policy (identity-based or resource-based)
  2. The SCPs applied to their account

Think of it like this: IAM policies say "what you can do," while SCPs say "what you're allowed to do." You need both to allow an action.

Example SCP Implementation:

Scenario: You want to prevent anyone in Development accounts from launching expensive EC2 instance types (like p3.16xlarge GPU instances that cost $24/hour), but Production accounts should be able to use them.

Step 1: Create an SCP for Development OU:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "StringEquals": {
          "ec2:InstanceType": [
            "p3.16xlarge",
            "p3.8xlarge",
            "p2.16xlarge"
          ]
        }
      }
    }
  ]
}

Step 2: Attach SCP to Development OU:
This SCP is attached to the Development OU, which contains 20 development accounts.

What Happens:

In Development Account:

  • A developer has full EC2 permissions via their IAM policy
  • They try to launch a p3.16xlarge instance
  • AWS evaluates: IAM policy says "Allow", but SCP says "Deny"
  • Result: Denied - The SCP overrides the IAM policy
  • Even if the account administrator gives themselves full admin permissions, they still cannot launch these instance types

In Production Account:

  • Production OU doesn't have this restrictive SCP
  • A production engineer with EC2 permissions can launch p3.16xlarge instances
  • AWS evaluates: IAM policy says "Allow", SCP doesn't deny
  • Result: Allowed

Key SCP Characteristics:

  1. Inheritance: SCPs attached to parent OUs apply to all child OUs and accounts. If you attach an SCP to the root of your organization, it applies to ALL accounts.

  2. Explicit Deny Wins: If any SCP denies an action, that action is denied regardless of IAM policies. This is the most powerful feature - it cannot be overridden.

  3. Default Deny: By default, SCPs use a "FullAWSAccess" policy that allows everything. When you create restrictive SCPs, you're adding denies on top of this.

  4. No Effect on Root User: SCPs do not affect the root user of member accounts. This is why you should always secure root users with MFA and avoid using them.

Common SCP Use Cases:

Use Case 1: Prevent Region Usage:
Force all resources to be created in specific regions for data residency compliance:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": [
            "us-east-1",
            "us-west-2"
          ]
        }
      }
    }
  ]
}

Use Case 2: Prevent Disabling Security Services:
Ensure CloudTrail and Config cannot be disabled:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": [
        "cloudtrail:StopLogging",
        "cloudtrail:DeleteTrail",
        "config:DeleteConfigurationRecorder",
        "config:StopConfigurationRecorder"
      ],
      "Resource": "*"
    }
  ]
}

Use Case 3: Require MFA for Sensitive Actions:
Require MFA for deleting resources:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": [
        "ec2:TerminateInstances",
        "rds:DeleteDBInstance",
        "s3:DeleteBucket"
      ],
      "Resource": "*",
      "Condition": {
        "BoolIfExists": {
          "aws:MultiFactorAuthPresent": "false"
        }
      }
    }
  ]
}

⭐ Must Know (SCPs):

  • SCPs define maximum permissions - they don't grant permissions
  • Explicit deny in SCP cannot be overridden by any IAM policy
  • SCPs apply to all IAM entities in an account except the root user
  • SCPs are inherited from parent OUs to child OUs and accounts
  • You need both IAM policy Allow AND no SCP Deny for an action to succeed
  • SCPs are evaluated before IAM policies in the authorization flow

šŸ’” Tips for Understanding SCPs:

  • Think of SCPs as "permission boundaries for entire accounts"
  • Use SCPs for organization-wide security requirements that must not be bypassed
  • Start with broad SCPs at the root, then add more specific ones at OU level
  • Test SCPs in a non-production OU first to avoid accidentally blocking critical operations

āš ļø Common Mistakes with SCPs:

  • Mistake: Thinking SCPs grant permissions

    • Why it's wrong: SCPs only restrict permissions. You still need IAM policies to grant permissions.
    • Correct understanding: SCPs set boundaries; IAM policies grant permissions within those boundaries.
  • Mistake: Forgetting SCPs don't affect root user

    • Why it's wrong: Root user can still perform actions denied by SCPs
    • Correct understanding: Always secure root user separately with MFA and avoid using it for daily operations.
  • Mistake: Creating overly restrictive SCPs that block AWS service operations

    • Why it's wrong: Some AWS services need to perform actions on your behalf (like CloudFormation creating resources)
    • Correct understanding: Use condition keys to allow service-to-service calls while restricting user actions.

Section 2: Network Security & VPC Architecture

Introduction

The problem: Applications need to be accessible to users while remaining protected from attacks. Public internet exposure creates security risks, but complete isolation makes applications unusable.

The solution: Amazon Virtual Private Cloud (VPC) provides network isolation with fine-grained control over traffic flow, allowing you to create secure network architectures that balance accessibility with protection.

Why it's tested: Network security is fundamental to the "Design Secure Architectures" domain (30% of exam). Questions test your ability to design VPC architectures with proper segmentation, access controls, and traffic filtering.

Core Concepts

Virtual Private Cloud (VPC) Fundamentals

What it is: A VPC is a logically isolated section of the AWS cloud where you can launch AWS resources in a virtual network that you define. You have complete control over your virtual networking environment, including IP address ranges, subnets, route tables, and network gateways.

Why it exists: When AWS launched, all resources were in a shared network space. Customers needed network isolation for security, compliance, and to replicate their on-premises network architectures in the cloud. VPC provides this isolation while maintaining the flexibility and scalability of cloud computing.

Real-world analogy: Think of a VPC like a private office building within a large business district (AWS Region). The building has its own address range (CIDR block), multiple floors (subnets), security checkpoints (security groups and NACLs), and controlled entry/exit points (internet gateways and NAT gateways). Just as you control who enters your building and which floors they can access, you control network traffic in your VPC.

How it works (Detailed step-by-step):

  1. Create VPC with CIDR Block: You define an IP address range for your VPC using CIDR notation (e.g., 10.0.0.0/16). This gives you 65,536 IP addresses to use within your VPC. AWS reserves 5 IP addresses in each subnet for networking purposes (network address, VPC router, DNS, future use, and broadcast).

  2. Divide into Subnets: You create subnets within your VPC, each in a specific Availability Zone. Each subnet gets a portion of the VPC's IP address range (e.g., 10.0.1.0/24 for public subnet, 10.0.2.0/24 for private subnet). Subnets cannot span multiple Availability Zones.

  3. Configure Route Tables: Each subnet has a route table that determines where network traffic is directed. The route table contains rules (routes) that specify which traffic goes where. For example, a route might say "send traffic destined for 10.0.0.0/16 to local (within VPC)" and "send traffic destined for 0.0.0.0/0 (internet) to the internet gateway."

  4. Attach Internet Gateway (for public access): An internet gateway is a horizontally scaled, redundant, and highly available VPC component that allows communication between your VPC and the internet. You attach one internet gateway per VPC. Resources in subnets with routes to the internet gateway can communicate with the internet if they have public IP addresses.

  5. Configure Security Groups: Security groups act as virtual firewalls for your EC2 instances. They control inbound and outbound traffic at the instance level. Security groups are stateful - if you allow inbound traffic, the response traffic is automatically allowed outbound.

  6. Configure Network ACLs: Network Access Control Lists (NACLs) provide an additional layer of security at the subnet level. They control traffic entering and leaving subnets. NACLs are stateless - you must explicitly allow both inbound and outbound traffic.

  7. Launch Resources: You launch EC2 instances, RDS databases, and other resources into your subnets. Each resource gets a private IP address from the subnet's CIDR range. Resources in public subnets can optionally receive public IP addresses or Elastic IPs for internet communication.

  8. Traffic Flow: When an instance sends traffic, AWS evaluates security groups, NACLs, and route tables to determine if the traffic is allowed and where it should go. This evaluation happens at wire speed without impacting performance.

šŸ“Š VPC Architecture Diagram:

graph TB
    subgraph "AWS Cloud"
        subgraph "VPC 10.0.0.0/16"
            IGW[Internet Gateway]
            
            subgraph "Availability Zone A"
                subgraph "Public Subnet 10.0.1.0/24"
                    WEB1[Web Server<br/>Public IP: 54.x.x.x<br/>Private IP: 10.0.1.10]
                    NAT1[NAT Gateway<br/>Elastic IP: 52.x.x.x]
                end
                
                subgraph "Private Subnet 10.0.2.0/24"
                    APP1[App Server<br/>Private IP: 10.0.2.10]
                    DB1[RDS Primary<br/>Private IP: 10.0.2.20]
                end
            end
            
            subgraph "Availability Zone B"
                subgraph "Public Subnet 10.0.3.0/24"
                    WEB2[Web Server<br/>Public IP: 54.x.x.y<br/>Private IP: 10.0.3.10]
                    NAT2[NAT Gateway<br/>Elastic IP: 52.x.x.y]
                end
                
                subgraph "Private Subnet 10.0.4.0/24"
                    APP2[App Server<br/>Private IP: 10.0.4.10]
                    DB2[RDS Standby<br/>Private IP: 10.0.4.20]
                end
            end
        end
    end
    
    INTERNET[Internet Users]
    INTERNET -->|HTTPS 443| IGW
    IGW --> WEB1
    IGW --> WEB2

See: diagrams/02_domain1_vpc_architecture.mmd

Diagram Explanation (Comprehensive):

This diagram shows a production-ready, highly available VPC architecture spanning two Availability Zones (AZ-A and AZ-B) within a single AWS Region. Let me explain each component and how they work together:

VPC Foundation (10.0.0.0/16):
The entire VPC uses the 10.0.0.0/16 CIDR block, providing 65,536 IP addresses. This is a private IP range (RFC 1918) that won't conflict with public internet addresses. The /16 subnet mask means the first 16 bits are fixed (10.0), and the remaining 16 bits can vary, giving us flexibility to create many subnets.

Internet Gateway (IGW):
The Internet Gateway is the entry and exit point for internet traffic. It's a highly available, horizontally scaled AWS-managed component attached to the VPC. The IGW performs Network Address Translation (NAT) for instances with public IP addresses, translating between private IPs (10.0.x.x) and public IPs (54.x.x.x). It's the only way for resources in public subnets to communicate directly with the internet.

Public Subnets (10.0.1.0/24 and 10.0.3.0/24):
These subnets are "public" because their route tables have a route sending internet-bound traffic (0.0.0.0/0) to the Internet Gateway. Each public subnet provides 256 IP addresses (actually 251 usable, as AWS reserves 5). Resources in public subnets can have public IP addresses and communicate directly with the internet. In this architecture, we place web servers and NAT Gateways in public subnets because they need to accept connections from or initiate connections to the internet.

Web Servers (WEB1 and WEB2):
Each web server has two IP addresses: a private IP from the subnet range (10.0.1.10 and 10.0.3.10) and a public IP (54.x.x.x and 54.x.x.y) for internet communication. When internet users send HTTPS requests to the public IP, the Internet Gateway translates it to the private IP and forwards it to the web server. The web server processes the request and sends the response back through the IGW. Having web servers in both AZs provides high availability - if AZ-A fails, WEB2 in AZ-B continues serving traffic.

NAT Gateways (NAT1 and NAT2):
NAT Gateways enable instances in private subnets to initiate outbound connections to the internet (for software updates, API calls, etc.) while preventing inbound connections from the internet. Each NAT Gateway has an Elastic IP address (a static public IP) and is placed in a public subnet. When an app server in a private subnet sends traffic to the internet, the traffic is routed to the NAT Gateway, which translates the private IP to its Elastic IP, sends the traffic to the internet, receives the response, and forwards it back to the app server. Having separate NAT Gateways in each AZ provides high availability and reduces cross-AZ data transfer costs.

Private Subnets (10.0.2.0/24 and 10.0.4.0/24):
These subnets are "private" because their route tables send internet-bound traffic to a NAT Gateway instead of directly to the Internet Gateway. Resources in private subnets only have private IP addresses and cannot be directly accessed from the internet. This provides an additional security layer - even if an attacker compromises the web server, they cannot directly access the app servers or databases. The private subnets can still initiate outbound connections through the NAT Gateway for updates and external API calls.

Application Servers (APP1 and APP2):
These servers run the business logic and are placed in private subnets for security. They only have private IPs (10.0.2.10 and 10.0.4.10) and cannot be accessed directly from the internet. Web servers communicate with app servers using private IPs within the VPC. The app servers can make outbound internet connections through their respective NAT Gateways for tasks like calling external APIs or downloading updates.

RDS Database Instances (DB1 and DB2):
The database instances are also in private subnets with only private IPs (10.0.2.20 and 10.0.4.20). DB1 is the primary instance handling all read and write operations, while DB2 is a standby replica in a different AZ for high availability. RDS automatically performs synchronous replication from DB1 to DB2, ensuring zero data loss. If DB1 fails, RDS automatically promotes DB2 to primary within 1-2 minutes. The databases are the most critical and sensitive components, so they're placed in the most protected layer with no internet access.

Route Tables:

  • Public Route Table: Contains two routes: (1) 10.0.0.0/16 → local (traffic within VPC stays in VPC), and (2) 0.0.0.0/0 → IGW (all other traffic goes to internet). This table is associated with both public subnets.
  • Private Route Table AZ-A: Contains (1) 10.0.0.0/16 → local, and (2) 0.0.0.0/0 → NAT1 (internet traffic goes through NAT Gateway in AZ-A). Associated with private subnets in AZ-A.
  • Private Route Table AZ-B: Same as AZ-A but routes to NAT2. Associated with private subnets in AZ-B.

Traffic Flow Examples:

  1. User Request Flow: Internet user → IGW → WEB1 (public subnet) → APP1 (private subnet) → DB1 (private subnet) → response back through same path.

  2. Outbound Update Flow: APP1 needs to download updates → traffic routed to NAT1 (via route table) → NAT1 translates private IP to Elastic IP → IGW → Internet → response back through same path.

  3. Cross-AZ Communication: WEB1 (AZ-A) can communicate with APP2 (AZ-B) using private IPs because both are in the same VPC (10.0.0.0/16 → local route).

  4. Database Replication: DB1 → DB2 synchronous replication happens over private IPs within the VPC, never leaving AWS's network.

Security Layers:
This architecture implements defense in depth with multiple security layers:

  1. Network Segmentation: Public and private subnets separate internet-facing and internal resources
  2. No Direct Internet Access: App servers and databases cannot be accessed from internet
  3. Controlled Outbound Access: Private resources can only reach internet through NAT Gateways
  4. High Availability: Resources in multiple AZs ensure service continuity during failures
  5. Least Privilege: Each tier only has the network access it needs

This is the recommended architecture pattern for production workloads on AWS, balancing security, availability, and operational requirements.

Detailed Example 1: Three-Tier Web Application VPC Design

Let's design a VPC for an e-commerce application with web servers, application servers, and databases. The application needs to be highly available, secure, and scalable.

Requirements:

  • Support 100 web servers, 200 application servers, 10 database instances
  • High availability across 2 Availability Zones
  • Web servers accessible from internet
  • Application servers and databases not directly accessible from internet
  • Application servers need to call external payment APIs
  • Comply with PCI-DSS requirements for payment processing

Design Solution:

Step 1: Choose VPC CIDR Block
We'll use 10.0.0.0/16 (65,536 IPs) to ensure we have enough addresses for growth.

Step 2: Plan Subnet Structure
We need 6 subnets (3 tiers Ɨ 2 AZs):

  • Public Subnet AZ-A: 10.0.1.0/24 (256 IPs) - Web servers
  • Public Subnet AZ-B: 10.0.2.0/24 (256 IPs) - Web servers
  • Private Subnet AZ-A: 10.0.11.0/24 (256 IPs) - App servers
  • Private Subnet AZ-B: 10.0.12.0/24 (256 IPs) - App servers
  • Database Subnet AZ-A: 10.0.21.0/24 (256 IPs) - Databases
  • Database Subnet AZ-B: 10.0.22.0/24 (256 IPs) - Databases

Step 3: Configure Internet Gateway
Attach one Internet Gateway to the VPC for internet connectivity.

Step 4: Configure NAT Gateways
Deploy NAT Gateway in each public subnet (one per AZ) for high availability:

  • NAT Gateway 1 in 10.0.1.0/24 (AZ-A)
  • NAT Gateway 2 in 10.0.2.0/24 (AZ-B)

Step 5: Configure Route Tables

  • Public Route Table: 0.0.0.0/0 → IGW, 10.0.0.0/16 → local
    • Associate with public subnets in both AZs
  • Private Route Table AZ-A: 0.0.0.0/0 → NAT Gateway 1, 10.0.0.0/16 → local
    • Associate with private subnet and database subnet in AZ-A
  • Private Route Table AZ-B: 0.0.0.0/0 → NAT Gateway 2, 10.0.0.0/16 → local
    • Associate with private subnet and database subnet in AZ-B

Step 6: Configure Security Groups

Web Server Security Group:

  • Inbound: Allow HTTPS (443) from 0.0.0.0/0 (internet)
  • Inbound: Allow HTTP (80) from 0.0.0.0/0 (for redirect to HTTPS)
  • Outbound: Allow all traffic (default)

Application Server Security Group:

  • Inbound: Allow port 8080 from Web Server Security Group only
  • Outbound: Allow HTTPS (443) to 0.0.0.0/0 (for payment API calls)
  • Outbound: Allow port 3306 to Database Security Group (MySQL)

Database Security Group:

  • Inbound: Allow port 3306 from Application Server Security Group only
  • Outbound: Allow all traffic to 10.0.0.0/16 (for replication)

Step 7: Configure Network ACLs
Use default NACL (allow all) for simplicity, or create custom NACLs for additional security:

  • Public Subnet NACL: Allow inbound 80, 443, ephemeral ports (1024-65535)
  • Private Subnet NACL: Allow inbound from VPC CIDR only
  • Database Subnet NACL: Allow inbound 3306 from private subnets only

Step 8: Deploy Resources

  • Launch web servers in public subnets with public IPs
  • Launch application servers in private subnets (no public IPs)
  • Launch RDS Multi-AZ database with primary in AZ-A, standby in AZ-B
  • Configure Application Load Balancer in public subnets to distribute traffic to web servers

Security Benefits of This Design:

  1. Network Isolation: Each tier is in separate subnets with different security controls
  2. Least Privilege Access: Security groups enforce minimum necessary access between tiers
  3. No Direct Database Access: Databases cannot be reached from internet, only from app servers
  4. Controlled Outbound Access: App servers can only reach specific external endpoints
  5. High Availability: Resources in multiple AZs survive AZ failures
  6. Defense in Depth: Multiple security layers (subnets, security groups, NACLs)
  7. PCI-DSS Compliance: Payment processing servers isolated in private subnets

Cost Considerations:

  • NAT Gateways: $0.045/hour Ɨ 2 = $65/month + data processing charges
  • Data Transfer: Cross-AZ traffic costs $0.01/GB (minimize by using same-AZ NAT)
  • Elastic IPs: Free when attached to running NAT Gateways

Detailed Example 2: Security Group vs NACL - When to Use Each

Understanding the difference between Security Groups and Network ACLs is critical for the exam. Let's explore a scenario that demonstrates when to use each.

Scenario: You're securing a web application where you've noticed suspicious traffic patterns. Some IP addresses are making thousands of requests per second (potential DDoS), and you need to block them. You also need to ensure that only your application servers can access your database.

Security Groups Approach:

Security Groups are stateful, instance-level firewalls. When you allow inbound traffic, the response is automatically allowed outbound.

Example Security Group for Web Server:

Inbound Rules:
- Type: HTTPS, Protocol: TCP, Port: 443, Source: 0.0.0.0/0
- Type: HTTP, Protocol: TCP, Port: 80, Source: 0.0.0.0/0

Outbound Rules:
- Type: All traffic, Protocol: All, Port: All, Destination: 0.0.0.0/0

Problem with Security Groups for DDoS:
Security Groups cannot block specific IP addresses. They can only allow traffic from specific sources. To block the malicious IPs, you would need to:

  1. Remove the rule allowing 0.0.0.0/0
  2. Add rules allowing only legitimate IP ranges
  3. This is impractical when you need to allow all internet users except specific attackers

Network ACL Approach:

Network ACLs are stateless, subnet-level firewalls. You must explicitly allow both inbound and outbound traffic. NACLs support both ALLOW and DENY rules, and rules are evaluated in order by rule number.

Example NACL for Public Subnet:

Inbound Rules:
Rule #  Type        Protocol  Port    Source          Allow/Deny
10      HTTP        TCP       80      0.0.0.0/0       ALLOW
20      HTTPS       TCP       443     0.0.0.0/0       ALLOW
30      Custom      TCP       1024-   0.0.0.0/0       ALLOW (ephemeral ports)
                              65535
50      All traffic All       All     198.51.100.5/32 DENY (malicious IP)
60      All traffic All       All     198.51.100.6/32 DENY (malicious IP)
100     All traffic All       All     0.0.0.0/0       DENY (default deny)

Outbound Rules:
Rule #  Type        Protocol  Port    Destination     Allow/Deny
10      HTTP        TCP       80      0.0.0.0/0       ALLOW
20      HTTPS       TCP       443     0.0.0.0/0       ALLOW
30      Custom      TCP       1024-   0.0.0.0/0       ALLOW (ephemeral ports)
                              65535
100     All traffic All       All     0.0.0.0/0       DENY (default deny)

How NACL Blocks Malicious IPs:

  1. Traffic from 198.51.100.5 arrives at the subnet
  2. NACL evaluates rules in order (10, 20, 30, 50...)
  3. Rule 50 matches (source IP 198.51.100.5) and denies the traffic
  4. Traffic is blocked before reaching any instance in the subnet
  5. This protects all instances in the subnet simultaneously

Why NACLs Are Better for IP Blocking:

  • Can explicitly DENY specific IPs or ranges
  • Evaluated before traffic reaches instances (reduces load)
  • Protects entire subnet, not just individual instances
  • Rules evaluated in order, allowing fine-grained control

Database Security Group Example:

For the database tier, Security Groups are ideal because you want to allow access only from specific sources (application servers), not block specific sources.

Database Security Group:
Inbound Rules:
- Type: MySQL/Aurora, Protocol: TCP, Port: 3306, Source: sg-app-servers
- Type: Custom TCP, Protocol: TCP, Port: 3306, Source: sg-bastion (for admin access)

Outbound Rules:
- Type: All traffic, Protocol: All, Port: All, Destination: 0.0.0.0/0

Why Security Groups Are Better for Database Access:

  • Stateful: Response traffic automatically allowed
  • Can reference other security groups (sg-app-servers) instead of IP ranges
  • Automatically updates when instances are added/removed from app server group
  • Simpler to manage than maintaining IP lists in NACLs

Decision Framework:

Use Security Groups when:

  • āœ… Controlling access between application tiers (web → app → database)
  • āœ… Allowing traffic from specific sources (other security groups, IP ranges)
  • āœ… You want stateful firewall behavior (automatic response traffic)
  • āœ… You need instance-level granularity
  • āœ… You want to reference other security groups dynamically

Use Network ACLs when:

  • āœ… Blocking specific IP addresses or ranges (DDoS mitigation)
  • āœ… Adding an additional layer of defense (defense in depth)
  • āœ… Enforcing subnet-level policies that apply to all resources
  • āœ… You need explicit control over both inbound and outbound traffic
  • āœ… Compliance requires stateless firewall rules

Use Both (Defense in Depth):

  • āœ… NACL blocks known malicious IPs at subnet boundary
  • āœ… Security Group allows only legitimate application traffic at instance level
  • āœ… Provides multiple layers of protection

Common Exam Scenario:
"A web application is experiencing a DDoS attack from specific IP addresses. How can you quickly block these IPs?"

Answer: Use Network ACL DENY rules. Security Groups cannot deny traffic, only allow it. NACLs can explicitly deny specific IPs and are evaluated before traffic reaches instances.

VPN and Direct Connect for Hybrid Connectivity

What they are: AWS Site-to-Site VPN and AWS Direct Connect are services that securely connect your on-premises data center or office network to your AWS VPC, enabling hybrid cloud architectures.

Why they exist: Many organizations cannot move all their infrastructure to the cloud immediately. They need secure, reliable connections between on-premises systems and AWS resources. Public internet connections are insecure and unreliable for production workloads. VPN and Direct Connect provide secure, private connectivity options.

Real-world analogy: Think of your on-premises network and AWS VPC as two office buildings in different cities. VPN is like making a secure phone call over the public phone network - it's encrypted and private, but uses public infrastructure. Direct Connect is like having a dedicated private fiber optic cable between the buildings - it's more expensive but provides better performance, reliability, and security.

AWS Site-to-Site VPN:

A VPN connection creates an encrypted tunnel over the public internet between your on-premises network and your VPC. It uses IPsec (Internet Protocol Security) to encrypt all traffic.

How VPN Works (Step-by-step):

  1. Create Virtual Private Gateway (VGW): Attach a VGW to your VPC. This is the VPN endpoint on the AWS side. The VGW is highly available across multiple AZs automatically.

  2. Create Customer Gateway: Define your on-premises VPN device's public IP address in AWS. This tells AWS where to establish the VPN tunnel.

  3. Create VPN Connection: AWS generates VPN configuration including pre-shared keys, tunnel IP addresses, and routing information. You download this configuration.

  4. Configure On-Premises Device: Apply the AWS-provided configuration to your on-premises VPN device (firewall, router, or VPN appliance).

  5. Establish Tunnels: AWS creates two VPN tunnels (for redundancy) to different AWS endpoints. Your device establishes IPsec tunnels to both endpoints.

  6. Configure Routing: Update your VPC route tables to send traffic destined for your on-premises network (e.g., 192.168.0.0/16) to the VGW. Update your on-premises routing to send AWS-bound traffic through the VPN tunnels.

  7. Traffic Flow: When an EC2 instance sends traffic to an on-premises IP, the VPC route table directs it to the VGW, which encrypts it and sends it through the VPN tunnel. Your on-premises device decrypts it and forwards it to the destination.

VPN Characteristics:

  • Bandwidth: Up to 1.25 Gbps per tunnel (2.5 Gbps total with both tunnels)
  • Latency: Variable, depends on internet path (typically 50-200ms)
  • Cost: $0.05/hour per VPN connection + data transfer charges
  • Setup Time: Minutes to hours
  • Encryption: IPsec encryption (AES-256)
  • Availability: Two tunnels for redundancy

When to Use VPN:

  • āœ… Quick setup needed (hours, not weeks)
  • āœ… Budget-conscious (low monthly cost)
  • āœ… Bandwidth requirements under 1 Gbps
  • āœ… Temporary or backup connectivity
  • āœ… Multiple remote offices need AWS access
  • āœ… Encryption required by compliance

AWS Direct Connect:

Direct Connect provides a dedicated network connection from your on-premises data center to AWS through a Direct Connect location (AWS partner facility). Traffic never traverses the public internet.

How Direct Connect Works (Step-by-step):

  1. Choose Direct Connect Location: Select an AWS Direct Connect location near your data center. These are facilities operated by AWS partners (like Equinix, CoreSite).

  2. Order Cross-Connect: Work with the facility provider to establish a physical fiber connection from your equipment to AWS's equipment in the same facility. This is called a "cross-connect."

  3. Create Direct Connect Connection: In AWS console, create a Direct Connect connection specifying the location and bandwidth (1 Gbps or 10 Gbps).

  4. Create Virtual Interface (VIF): Create a private VIF to access your VPC, or a public VIF to access AWS public services (S3, DynamoDB) without going through the internet.

  5. Configure BGP: Direct Connect uses Border Gateway Protocol (BGP) for dynamic routing. You configure BGP on your router to exchange routes with AWS.

  6. Attach to Virtual Private Gateway or Direct Connect Gateway: Connect your VIF to a VGW (for single VPC) or Direct Connect Gateway (for multiple VPCs/regions).

  7. Update Route Tables: Configure VPC route tables to send on-premises traffic to the VGW. BGP automatically advertises your VPC routes to your on-premises network.

  8. Traffic Flow: Traffic flows over the dedicated fiber connection, never touching the public internet. AWS routes it directly to your VPC.

Direct Connect Characteristics:

  • Bandwidth: 1 Gbps, 10 Gbps, or 100 Gbps dedicated connections
  • Latency: Consistent, low latency (typically 10-50ms)
  • Cost: Port hour charges ($0.30/hour for 1 Gbps) + data transfer out charges
  • Setup Time: Weeks to months (physical installation required)
  • Encryption: Not encrypted by default (use VPN over Direct Connect for encryption)
  • Availability: Single connection (use two for redundancy)

When to Use Direct Connect:

  • āœ… High bandwidth requirements (>1 Gbps)
  • āœ… Consistent, low latency needed
  • āœ… Large data transfers (cheaper than internet transfer)
  • āœ… Predictable network performance required
  • āœ… Long-term connectivity (justify setup time/cost)
  • āœ… Accessing AWS public services without internet

Comparison Table:

Feature Site-to-Site VPN Direct Connect
Connection Type Encrypted tunnel over internet Dedicated private connection
Bandwidth Up to 1.25 Gbps per tunnel 1/10/100 Gbps dedicated
Latency Variable (50-200ms) Consistent, low (10-50ms)
Setup Time Minutes to hours Weeks to months
Cost $0.05/hour + data transfer Port hours + data transfer
Encryption IPsec (built-in) Not encrypted (add VPN if needed)
Availability 2 tunnels (redundant) Single connection (order 2 for HA)
Use Case Quick setup, backup, low bandwidth High bandwidth, consistent performance

Hybrid Architecture Pattern: VPN + Direct Connect:

For maximum reliability, many organizations use both:

  • Primary: Direct Connect for production traffic (high bandwidth, low latency)
  • Backup: VPN for failover if Direct Connect fails
  • Configuration: Use BGP to prefer Direct Connect (lower BGP metric), automatically failover to VPN if Direct Connect is unavailable

Detailed Example: Hybrid Cloud Architecture with Direct Connect

Scenario: A financial services company has a data center in New York with 500 TB of customer data. They're migrating applications to AWS us-east-1 region but must keep the database on-premises for compliance. Applications in AWS need low-latency access to the on-premises database.

Requirements:

  • Consistent latency under 20ms for database queries
  • Bandwidth for 10 Gbps peak traffic
  • Highly available (99.99% uptime)
  • Secure connection (encrypted)
  • Access to multiple VPCs in us-east-1

Solution Design:

Step 1: Order Two Direct Connect Connections

  • Order two 10 Gbps Direct Connect connections at different Direct Connect locations (e.g., Equinix NY5 and CoreSite NY1) for redundancy
  • Each connection costs $2.25/hour = $1,620/month

Step 2: Create Direct Connect Gateway

  • Create a Direct Connect Gateway to connect multiple VPCs to the Direct Connect connections
  • This allows all VPCs to share the same Direct Connect connections

Step 3: Create Private Virtual Interfaces

  • Create two private VIFs, one on each Direct Connect connection
  • Associate both VIFs with the Direct Connect Gateway
  • Configure BGP with AS numbers and BGP keys

Step 4: Attach VPCs to Direct Connect Gateway

  • Attach Virtual Private Gateways from Production VPC, Development VPC, and Testing VPC to the Direct Connect Gateway
  • All three VPCs can now communicate with on-premises over Direct Connect

Step 5: Configure VPN for Encryption

  • Create Site-to-Site VPN connections over each Direct Connect connection
  • This provides IPsec encryption for data in transit (compliance requirement)
  • VPN over Direct Connect combines Direct Connect's performance with VPN's encryption

Step 6: Configure BGP Routing

  • On-premises router advertises 192.168.0.0/16 (on-premises network) to AWS via BGP
  • AWS advertises VPC CIDR blocks (10.0.0.0/16, 10.1.0.0/16, 10.2.0.0/16) to on-premises
  • Configure BGP weights to prefer primary Direct Connect connection, failover to secondary if primary fails

Step 7: Update Route Tables

  • VPC route tables: 192.168.0.0/16 → Virtual Private Gateway
  • On-premises routing: 10.0.0.0/8 → Direct Connect router

Traffic Flow:

  1. Application in AWS Production VPC (10.0.1.10) queries on-premises database (192.168.1.50)
  2. VPC route table sends traffic to VGW
  3. VGW sends traffic to Direct Connect Gateway
  4. Direct Connect Gateway sends traffic over primary Direct Connect connection
  5. VPN encrypts traffic over Direct Connect
  6. Traffic arrives at on-premises data center, VPN decrypts
  7. On-premises router forwards to database server
  8. Response follows same path in reverse

Availability:

  • If primary Direct Connect fails, BGP automatically reroutes traffic to secondary Direct Connect
  • If both Direct Connect connections fail, traffic can failover to internet-based VPN (not shown, but recommended)
  • Achieves 99.99% availability with dual Direct Connect + VPN backup

Performance:

  • Latency: 10-15ms (Direct Connect) vs 50-100ms (internet VPN)
  • Bandwidth: 10 Gbps per connection, 20 Gbps total
  • Consistent performance (no internet congestion)

Cost Analysis:

  • Direct Connect: $2.25/hour Ɨ 2 connections Ɨ 730 hours = $3,285/month
  • Data Transfer Out: $0.02/GB for first 10 TB = $200/month (for 10 TB)
  • VPN: $0.05/hour Ɨ 2 connections Ɨ 730 hours = $73/month
  • Total: ~$3,560/month

Compared to Internet Transfer:

  • Transferring 10 TB/month over internet: $0.09/GB Ɨ 10,000 GB = $900/month
  • Direct Connect saves money at high data volumes (>40 TB/month)
  • Plus benefits of consistent performance and lower latency

AWS Security Services

AWS WAF (Web Application Firewall):

What it is: AWS WAF is a web application firewall that protects your web applications from common web exploits and bots that could affect availability, compromise security, or consume excessive resources.

Why it exists: Traditional network firewalls (security groups, NACLs) operate at the network layer (Layer 3/4) and cannot inspect HTTP/HTTPS request content. Web applications face application-layer attacks (Layer 7) like SQL injection, cross-site scripting (XSS), and bot attacks that require deep packet inspection. WAF provides this application-layer protection.

Real-world analogy: Think of WAF like a security guard at a nightclub entrance who checks IDs and searches bags. Network firewalls are like the fence around the building - they control who can approach, but WAF inspects what people are carrying and what they're trying to do once they're at the door.

How WAF Works:

  1. Deploy WAF: Attach WAF to CloudFront distribution, Application Load Balancer, API Gateway, or AppSync GraphQL API.

  2. Create Web ACL: A Web Access Control List (Web ACL) contains rules that define what traffic to allow, block, or count.

  3. Add Rules: Rules inspect HTTP/HTTPS requests for patterns like:

    • SQL injection attempts (e.g., ' OR 1=1-- in query parameters)
    • Cross-site scripting (e.g., <script> tags in input fields)
    • Requests from specific countries (geo-blocking)
    • Requests from known malicious IP addresses
    • Rate limiting (e.g., max 2000 requests per 5 minutes from single IP)
  4. Rule Evaluation: When a request arrives, WAF evaluates rules in priority order. First matching rule determines the action (allow, block, count).

  5. Action:

    • Allow: Request passes through to your application
    • Block: WAF returns 403 Forbidden to the client
    • Count: WAF logs the match but allows the request (for testing rules)
  6. Logging: WAF logs all requests to CloudWatch Logs, S3, or Kinesis Data Firehose for analysis.

WAF Rule Examples:

SQL Injection Protection:

Rule: Block requests where query string contains SQL keywords
Pattern: (union|select|insert|update|delete|drop|create|alter)
Action: Block

Rate Limiting:

Rule: Block IPs making more than 2000 requests in 5 minutes
Rate: 2000 requests per 5 minutes
Action: Block for 10 minutes

Geo-Blocking:

Rule: Block requests from countries not in allowed list
Countries: Allow only US, CA, UK, DE, FR
Action: Block all others

Managed Rule Groups:
AWS provides managed rule groups maintained by AWS and AWS Marketplace sellers:

  • Core Rule Set: Protection against OWASP Top 10 vulnerabilities
  • Known Bad Inputs: Blocks patterns known to be malicious
  • SQL Database: Protects against SQL injection
  • Linux/Windows Operating System: Blocks OS-specific exploits
  • PHP/WordPress: Protects PHP and WordPress applications

When to Use WAF:

  • āœ… Protecting web applications from OWASP Top 10 attacks
  • āœ… Blocking bot traffic and scrapers
  • āœ… Rate limiting to prevent DDoS
  • āœ… Geo-blocking for compliance or business reasons
  • āœ… Custom rules for application-specific threats
  • āœ… Protecting APIs from abuse

AWS Shield:

What it is: AWS Shield is a managed DDoS (Distributed Denial of Service) protection service that safeguards applications running on AWS.

Why it exists: DDoS attacks attempt to make applications unavailable by overwhelming them with traffic. These attacks can cost thousands of dollars per hour in bandwidth charges and lost revenue. Shield provides automatic protection against common DDoS attacks.

Two Tiers:

Shield Standard (Free, automatic):

  • Protects against most common Layer 3/4 DDoS attacks (SYN floods, UDP floods, reflection attacks)
  • Automatically enabled for all AWS customers
  • Protects CloudFront, Route 53, Elastic Load Balancing
  • No configuration required

Shield Advanced ($3,000/month):

  • Protection against larger, more sophisticated attacks
  • 24/7 access to AWS DDoS Response Team (DRT)
  • Real-time attack notifications and forensics
  • DDoS cost protection (credits for scaling costs during attacks)
  • Protection for EC2, ELB, CloudFront, Route 53, Global Accelerator
  • Integration with WAF at no additional cost

How Shield Works:

  1. Traffic Analysis: Shield continuously analyzes traffic patterns to establish baselines for normal traffic.

  2. Anomaly Detection: When traffic deviates from normal patterns (sudden spike, unusual packet types), Shield detects potential DDoS attack.

  3. Automatic Mitigation: Shield automatically applies mitigation techniques:

    • Traffic scrubbing (filtering malicious packets)
    • Rate limiting (throttling excessive requests)
    • Traffic shaping (prioritizing legitimate traffic)
  4. Scaling: AWS infrastructure automatically scales to absorb attack traffic without impacting your application.

  5. Notification (Shield Advanced): DRT notifies you of attacks and provides forensics.

Common DDoS Attack Types Shield Protects Against:

SYN Flood: Attacker sends many SYN packets (TCP connection requests) but never completes the handshake, exhausting server connection table.

  • Shield Mitigation: Filters incomplete connections, uses SYN cookies

UDP Flood: Attacker sends large volumes of UDP packets to random ports, consuming bandwidth and server resources.

  • Shield Mitigation: Rate limits UDP traffic, filters packets to unused ports

DNS Query Flood: Attacker sends massive DNS queries to Route 53, attempting to overwhelm DNS service.

  • Shield Mitigation: Route 53 scales automatically, Shield filters malicious queries

HTTP Flood: Attacker sends legitimate-looking HTTP requests at high volume to exhaust application resources.

  • Shield Mitigation: Works with WAF to rate limit and filter malicious requests

When to Use Shield Advanced:

  • āœ… Business-critical applications that cannot tolerate downtime
  • āœ… Applications that have been targeted by DDoS attacks before
  • āœ… Need for 24/7 expert support during attacks
  • āœ… Concern about DDoS-related AWS charges
  • āœ… Compliance requirements for DDoS protection

AWS GuardDuty:

What it is: Amazon GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts, workloads, and data.

Why it exists: Traditional security tools require manual log analysis and correlation across multiple sources. GuardDuty uses machine learning to automatically analyze billions of events across AWS CloudTrail, VPC Flow Logs, and DNS logs to identify threats without requiring you to deploy or manage any infrastructure.

Real-world analogy: GuardDuty is like a security operations center (SOC) analyst who monitors security cameras, access logs, and network traffic 24/7, looking for suspicious patterns. Instead of you having to watch all the logs, GuardDuty does it automatically and alerts you only when it finds something suspicious.

How GuardDuty Works:

  1. Enable GuardDuty: One-click enable in AWS console. No agents or sensors to deploy.

  2. Data Sources: GuardDuty automatically analyzes:

    • CloudTrail Events: API calls and management events (who did what, when)
    • VPC Flow Logs: Network traffic patterns (who talked to whom)
    • DNS Logs: DNS queries (what domains were resolved)
    • S3 Data Events: S3 object-level API activity
    • EKS Audit Logs: Kubernetes API calls
  3. Threat Intelligence: GuardDuty uses threat intelligence feeds from:

    • AWS Security
    • CrowdStrike
    • Proofpoint
    • Known malicious IPs, domains, and patterns
  4. Machine Learning: GuardDuty builds baselines of normal behavior for your environment and detects anomalies.

  5. Findings: When GuardDuty detects a threat, it generates a finding with:

    • Severity (Low, Medium, High)
    • Threat type and description
    • Affected resources
    • Recommended remediation
  6. Integration: Findings are sent to:

    • GuardDuty console
    • EventBridge (for automated response)
    • Security Hub (for centralized security view)

Example GuardDuty Findings:

UnauthorizedAccess:EC2/SSHBruteForce:

  • What it detected: An EC2 instance is being targeted by SSH brute force attack (many failed login attempts from external IP)
  • Why it matters: Attacker is trying to guess SSH passwords to gain access
  • Remediation: Block the source IP in security group or NACL, review SSH key management

CryptoCurrency:EC2/BitcoinTool.B!DNS:

  • What it detected: An EC2 instance is querying a domain associated with Bitcoin mining
  • Why it matters: Instance may be compromised and used for cryptocurrency mining
  • Remediation: Investigate the instance, check for unauthorized processes, consider terminating and rebuilding

Trojan:EC2/DNSDataExfiltration:

  • What it detected: An EC2 instance is making DNS queries that appear to be exfiltrating data
  • Why it matters: Attacker may be stealing data by encoding it in DNS queries
  • Remediation: Isolate the instance, investigate for malware, review data access logs

UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration:

  • What it detected: IAM credentials from an EC2 instance are being used from an external IP
  • Why it matters: Instance credentials were stolen and are being used outside AWS
  • Remediation: Revoke the credentials, investigate how they were stolen, rotate all credentials

Recon:IAMUser/MaliciousIPCaller:

  • What it detected: API calls are being made from a known malicious IP address
  • Why it matters: Attacker may have compromised IAM credentials and is reconnaissance
  • Remediation: Review CloudTrail for unauthorized actions, rotate credentials, enable MFA

When to Use GuardDuty:

  • āœ… Continuous threat detection without managing infrastructure
  • āœ… Detecting compromised instances and credentials
  • āœ… Identifying reconnaissance and data exfiltration
  • āœ… Compliance requirements for threat monitoring
  • āœ… Automated security monitoring across multiple accounts

Cost: $4.50 per million CloudTrail events analyzed + $1.00 per GB of VPC Flow Logs + $0.50 per million DNS queries. Typical cost: $50-200/month per account.


Section 3: Data Security & Encryption

Introduction

The problem: Data is the most valuable asset for most organizations. Data breaches can result in millions of dollars in losses, regulatory fines, and reputational damage. Data must be protected both when stored (at rest) and when transmitted (in transit).

The solution: AWS provides comprehensive encryption services and key management tools to protect data throughout its lifecycle. Encryption transforms readable data into unreadable ciphertext that can only be decrypted with the correct key.

Why it's tested: Data protection is a core component of the "Design Secure Architectures" domain. The exam tests your understanding of when and how to use encryption, key management best practices, and compliance requirements.

Core Concepts

AWS Key Management Service (KMS)

What it is: AWS KMS is a managed service that makes it easy to create and control the cryptographic keys used to encrypt your data. KMS uses Hardware Security Modules (HSMs) to protect the security of your keys.

Why it exists: Managing encryption keys is complex and risky. If you lose keys, you lose access to your data. If keys are compromised, your data is exposed. KMS provides secure, auditable key management without requiring you to operate your own HSM infrastructure.

Real-world analogy: Think of KMS like a bank's safe deposit box system. The bank (AWS) provides the secure vault (HSM) and manages access controls, but only you have the key to your specific box. You can authorize others to access your box, and the bank keeps detailed records of every access.

How KMS Works (Detailed step-by-step):

  1. Create Customer Master Key (CMK): You create a CMK in KMS, which is a logical representation of a master key. The actual key material never leaves the HSM. You can choose:

    • AWS-managed CMK: AWS creates and manages the key (free, automatic rotation)
    • Customer-managed CMK: You create and manage the key ($1/month, optional rotation)
    • Custom key store: Keys stored in CloudHSM cluster you control (advanced use case)
  2. Define Key Policy: The key policy is a resource-based policy that controls who can use and manage the key. It's similar to an IAM policy but attached to the key itself. Example policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Enable IAM User Permissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:root"
      },
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "Allow use of the key for encryption",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/EC2-S3-Access"
      },
      "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:GenerateDataKey"
      ],
      "Resource": "*"
    }
  ]
}
  1. Encrypt Data: When you need to encrypt data, you call KMS Encrypt API with your data and the CMK ID. KMS uses the CMK to encrypt your data and returns the ciphertext. The CMK never leaves KMS.

  2. Store Ciphertext: You store the encrypted data (ciphertext) in your storage service (S3, EBS, RDS, etc.). The ciphertext is useless without the CMK to decrypt it.

  3. Decrypt Data: When you need to access the data, you call KMS Decrypt API with the ciphertext. KMS verifies you have permission to use the CMK, decrypts the data, and returns the plaintext.

  4. Audit: Every KMS API call is logged in CloudTrail, providing a complete audit trail of who used which keys, when, and for what purpose.

Envelope Encryption:

For large data (>4 KB), KMS uses envelope encryption to improve performance:

  1. Generate Data Key: Call KMS GenerateDataKey API. KMS generates a data encryption key (DEK), encrypts it with your CMK, and returns both the plaintext DEK and encrypted DEK.

  2. Encrypt Data Locally: Use the plaintext DEK to encrypt your data locally (in your application or AWS service). This is fast because it doesn't require network calls to KMS.

  3. Store Encrypted Data + Encrypted DEK: Store both the encrypted data and the encrypted DEK together. Delete the plaintext DEK from memory.

  4. Decrypt Data: To decrypt, send the encrypted DEK to KMS. KMS decrypts it with your CMK and returns the plaintext DEK. Use the plaintext DEK to decrypt your data locally.

Why Envelope Encryption:

  • KMS can only encrypt/decrypt up to 4 KB directly
  • Encrypting large data locally is faster than sending it to KMS
  • You only need to call KMS once per data key, not once per data block
  • Most AWS services (S3, EBS, RDS) use envelope encryption automatically

Detailed Example 1: S3 Bucket Encryption with KMS

Scenario: You're storing customer financial records in S3. Compliance requires that all data be encrypted at rest with keys you control, and you must be able to audit all access to the encryption keys.

Solution: Use S3 with SSE-KMS (Server-Side Encryption with KMS).

Step 1: Create Customer-Managed CMK

aws kms create-key \
  --description "S3 encryption key for financial records" \
  --key-policy file://key-policy.json

Step 2: Create Alias for Easy Reference

aws kms create-alias \
  --alias-name alias/financial-records-key \
  --target-key-id <key-id-from-step-1>

Step 3: Configure S3 Bucket Default Encryption

aws s3api put-bucket-encryption \
  --bucket financial-records-bucket \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "aws:kms",
        "KMSMasterKeyID": "alias/financial-records-key"
      },
      "BucketKeyEnabled": true
    }]
  }'

Step 4: Upload Object
When you upload an object, S3 automatically:

  1. Calls KMS GenerateDataKey with your CMK
  2. Receives plaintext DEK and encrypted DEK
  3. Encrypts the object with the plaintext DEK (AES-256)
  4. Stores the encrypted object and encrypted DEK as metadata
  5. Deletes the plaintext DEK from memory

Step 5: Download Object
When you download an object, S3 automatically:

  1. Retrieves the encrypted DEK from object metadata
  2. Calls KMS Decrypt with the encrypted DEK
  3. Receives the plaintext DEK
  4. Decrypts the object with the plaintext DEK
  5. Returns the plaintext object to you
  6. Deletes the plaintext DEK from memory

What You Get:

  • Encryption at Rest: All objects encrypted with AES-256
  • Key Control: You control the CMK, can disable or delete it
  • Audit Trail: CloudTrail logs every KMS API call (who accessed which objects)
  • Compliance: Meets requirements for customer-managed encryption keys
  • Performance: Bucket Key feature reduces KMS API calls by 99% (lower cost)

Cost:

  • CMK: $1/month
  • KMS API calls: $0.03 per 10,000 requests
  • With Bucket Key: ~$1-5/month for typical workload
  • Without Bucket Key: Could be $100s/month for high-volume workloads

Detailed Example 2: EBS Volume Encryption

Scenario: You're launching EC2 instances that process sensitive healthcare data (PHI). HIPAA compliance requires that all data on disk be encrypted.

Solution: Use EBS encryption with KMS.

Step 1: Enable EBS Encryption by Default

aws ec2 enable-ebs-encryption-by-default --region us-east-1

This ensures all new EBS volumes are automatically encrypted.

Step 2: Specify Custom CMK (Optional)

aws ec2 modify-ebs-default-kms-key-id \
  --kms-key-id alias/ebs-encryption-key \
  --region us-east-1

Step 3: Launch Instance with Encrypted Volume

aws ec2 run-instances \
  --image-id ami-12345678 \
  --instance-type t3.medium \
  --block-device-mappings '[{
    "DeviceName": "/dev/xvda",
    "Ebs": {
      "VolumeSize": 100,
      "VolumeType": "gp3",
      "Encrypted": true,
      "KmsKeyId": "alias/ebs-encryption-key"
    }
  }]'

How EBS Encryption Works:

  1. Volume Creation: When you create an encrypted EBS volume, AWS generates a unique data key for that volume using your CMK.

  2. Data Encryption: All data written to the volume is encrypted using AES-256 with the data key. This happens in the EC2 hypervisor, transparent to your instance.

  3. Data Key Storage: The encrypted data key is stored with the volume metadata. The plaintext data key is stored in memory on the EC2 host (never on disk).

  4. Snapshots: When you create a snapshot of an encrypted volume, the snapshot is automatically encrypted with the same data key. You can copy the snapshot to another region and re-encrypt with a different CMK.

  5. Volume Attachment: When you attach an encrypted volume to an instance, the EC2 service calls KMS to decrypt the data key. The plaintext data key is loaded into the EC2 host's memory.

  6. Performance: Encryption/decryption happens in hardware on the EC2 host, with no performance impact compared to unencrypted volumes.

What You Get:

  • Transparent Encryption: No application changes required
  • Data at Rest: All data on volume encrypted
  • Snapshots: Automatically encrypted
  • Data in Transit: Data moving between EC2 and EBS is encrypted
  • No Performance Impact: Hardware-accelerated encryption
  • Compliance: Meets HIPAA, PCI-DSS encryption requirements

Important Notes:

  • You cannot encrypt an existing unencrypted volume directly
  • To encrypt existing volume: Create snapshot → Copy snapshot with encryption → Create volume from encrypted snapshot
  • Root volumes can be encrypted (requires encrypted AMI or encryption during launch)
  • Encrypted volumes can only be attached to instance types that support EBS encryption

Detailed Example 3: RDS Database Encryption

Scenario: You're running a PostgreSQL database in RDS that stores customer credit card information. PCI-DSS requires encryption of cardholder data at rest.

Solution: Enable RDS encryption with KMS.

Step 1: Create Encrypted RDS Instance

aws rds create-db-instance \
  --db-instance-identifier payments-db \
  --db-instance-class db.r5.large \
  --engine postgres \
  --master-username admin \
  --master-user-password <password> \
  --allocated-storage 100 \
  --storage-encrypted \
  --kms-key-id alias/rds-encryption-key \
  --backup-retention-period 7 \
  --multi-az

What Gets Encrypted:

  • DB Instance Storage: All data files encrypted
  • Automated Backups: Encrypted with same key
  • Read Replicas: Encrypted with same key (or different key if cross-region)
  • Snapshots: Encrypted with same key
  • Logs: CloudWatch Logs encrypted

How RDS Encryption Works:

  1. Instance Creation: RDS generates a unique data key for the instance using your CMK.

  2. Storage Encryption: All data written to storage is encrypted using AES-256 with the data key. This includes:

    • Database files
    • Transaction logs
    • Temporary files
  3. Backup Encryption: Automated backups and snapshots are encrypted with the same data key.

  4. Read Replica Encryption: Read replicas in the same region use the same CMK. Cross-region replicas can use a different CMK in the destination region.

  5. Transparent to Application: Your application connects to RDS normally. Encryption/decryption happens transparently in the RDS service.

Important Limitations:

  • Cannot enable encryption on existing unencrypted DB instance
  • To encrypt existing DB: Create snapshot → Copy snapshot with encryption → Restore from encrypted snapshot
  • Cannot disable encryption once enabled
  • Cannot change the CMK after creation (must create new instance)

Encryption in Transit:

In addition to encryption at rest, you should encrypt data in transit between your application and RDS:

Step 1: Download RDS Certificate Bundle

wget https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem

Step 2: Configure Application to Use SSL

import psycopg2

conn = psycopg2.connect(
    host="payments-db.abc123.us-east-1.rds.amazonaws.com",
    port=5432,
    database="payments",
    user="admin",
    password="<password>",
    sslmode="verify-full",
    sslrootcert="/path/to/global-bundle.pem"
)

Step 3: Enforce SSL Connections (PostgreSQL)

ALTER USER admin SET ssl TO on;

What You Get:

  • Data at Rest: All database files encrypted
  • Data in Transit: SSL/TLS encryption between application and database
  • Backup Encryption: Automated backups and snapshots encrypted
  • Compliance: Meets PCI-DSS, HIPAA encryption requirements
  • Audit Trail: CloudTrail logs all KMS key usage

AWS Certificate Manager (ACM)

What it is: AWS Certificate Manager is a service that lets you easily provision, manage, and deploy SSL/TLS certificates for use with AWS services and your internal connected resources.

Why it exists: Managing SSL/TLS certificates is complex and error-prone. Certificates expire and must be renewed, private keys must be securely stored, and certificate deployment must be coordinated across multiple servers. ACM automates certificate provisioning and renewal, eliminating these operational burdens.

Real-world analogy: Think of ACM like a passport office that issues and renews passports automatically. Instead of you having to remember to renew your passport every 10 years and go through the application process, the passport office automatically sends you a new passport before the old one expires.

How ACM Works:

  1. Request Certificate: You request a certificate for your domain (e.g., www.example.com) through ACM console or API.

  2. Domain Validation: ACM must verify you own the domain. Two methods:

    • DNS Validation: Add a CNAME record to your DNS (recommended, automatic renewal)
    • Email Validation: Click link in email sent to domain owner
  3. Certificate Issuance: Once validated, ACM issues the certificate signed by Amazon's Certificate Authority.

  4. Deploy Certificate: Attach the certificate to:

    • CloudFront distribution
    • Application Load Balancer
    • Network Load Balancer
    • API Gateway
    • Elastic Beanstalk
  5. Automatic Renewal: ACM automatically renews certificates before they expire (60 days before expiration). No action required from you.

  6. Private Key Security: ACM stores private keys securely in AWS. You never have access to the private key, reducing risk of compromise.

Detailed Example: HTTPS for Web Application

Scenario: You're deploying a web application on EC2 instances behind an Application Load Balancer. You need to enable HTTPS with a valid SSL certificate for www.example.com.

Step 1: Request Certificate

aws acm request-certificate \
  --domain-name www.example.com \
  --subject-alternative-names example.com \
  --validation-method DNS

Step 2: Validate Domain Ownership
ACM provides a CNAME record to add to your DNS:

Name: _abc123.www.example.com
Value: _xyz789.acm-validations.aws

Add this record to your Route 53 hosted zone:

aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch file://dns-validation.json

Step 3: Wait for Validation
ACM automatically validates the domain (usually within minutes) and issues the certificate.

Step 4: Attach Certificate to ALB

aws elbv2 create-listener \
  --load-balancer-arn <alb-arn> \
  --protocol HTTPS \
  --port 443 \
  --certificates CertificateArn=<acm-certificate-arn> \
  --default-actions Type=forward,TargetGroupArn=<target-group-arn>

Step 5: Configure HTTP to HTTPS Redirect

aws elbv2 create-listener \
  --load-balancer-arn <alb-arn> \
  --protocol HTTP \
  --port 80 \
  --default-actions Type=redirect,RedirectConfig='{
    "Protocol": "HTTPS",
    "Port": "443",
    "StatusCode": "HTTP_301"
  }'

What You Get:

  • Valid SSL Certificate: Trusted by all browsers
  • Automatic Renewal: No manual renewal required
  • Free: No cost for ACM certificates used with AWS services
  • Secure Key Storage: Private keys never exposed
  • Easy Deployment: One-click attachment to AWS services

Traffic Flow:

  1. User visits http://www.example.com
  2. ALB redirects to https://www.example.com (HTTP 301)
  3. User's browser connects to ALB on port 443
  4. ALB presents ACM certificate
  5. Browser validates certificate (trusted by Amazon CA)
  6. TLS handshake completes, encrypted connection established
  7. ALB decrypts HTTPS traffic, forwards HTTP to EC2 instances
  8. EC2 instances process request, return response to ALB
  9. ALB encrypts response, sends HTTPS to user

Important Notes:

  • ACM certificates are free when used with AWS services
  • ACM certificates cannot be exported (private key stays in AWS)
  • For use outside AWS (on-premises servers), use imported certificates or AWS Private CA
  • Certificates are regional (must request in same region as ALB/CloudFront)
  • CloudFront requires certificates in us-east-1 region

Comparison Tables

Encryption Options Comparison

Service Encryption Method Key Management Use Case Cost
S3 SSE-S3 AES-256 AWS-managed keys Simple encryption, no key control needed Free
S3 SSE-KMS AES-256 Customer-managed CMK Audit trail, key rotation, compliance $1/month + API calls
S3 SSE-C AES-256 Customer-provided keys You manage keys outside AWS Free (you manage keys)
S3 Client-Side Your choice You manage Encrypt before upload, maximum control Free (you manage)
EBS Encryption AES-256 AWS or customer CMK Transparent EC2 volume encryption $1/month (if custom CMK)
RDS Encryption AES-256 AWS or customer CMK Database encryption at rest $1/month (if custom CMK)

Security Services Comparison

Service Layer Purpose Cost When to Use
Security Groups Instance (L3/L4) Allow traffic to instances Free Control access between tiers
NACLs Subnet (L3/L4) Allow/deny traffic to subnets Free Block specific IPs, subnet-level rules
AWS WAF Application (L7) Block web exploits, bots $5/month + rules Protect web apps from OWASP Top 10
AWS Shield Network (L3/L4) DDoS protection Free (Standard) Automatic DDoS protection
GuardDuty Account-wide Threat detection ~$50-200/month Detect compromised resources
Macie S3 data Sensitive data discovery ~$1/GB scanned Find PII/PHI in S3

IAM Authentication Methods

Method Use Case Pros Cons
IAM Users Long-term credentials for people Simple, direct access Hard to manage at scale, credentials can leak
IAM Roles Temporary credentials for services Secure, automatic rotation Requires trust relationship setup
IAM Identity Center SSO for multiple accounts Centralized, SAML/OIDC support Requires setup, additional service
Cognito User Pools Application user authentication Built for web/mobile apps Not for AWS resource access
Cognito Identity Pools Temporary AWS credentials for app users Federated access, mobile-friendly Complex setup for advanced scenarios

Decision Frameworks

Choosing Encryption Method

When choosing S3 encryption:

šŸ“Š Decision Tree:

Start: Need S3 encryption?
ā”œā”€ Need audit trail of key usage?
│  ā”œā”€ Yes → Use SSE-KMS (customer-managed CMK)
│  └─ No → Continue
ā”œā”€ Need to control key rotation?
│  ā”œā”€ Yes → Use SSE-KMS (customer-managed CMK)
│  └─ No → Continue
ā”œā”€ Need to manage keys outside AWS?
│  ā”œā”€ Yes → Use SSE-C or Client-Side Encryption
│  └─ No → Continue
ā”œā”€ Want simplest solution?
│  └─ Yes → Use SSE-S3 (AWS-managed keys)

Decision Logic Explained:

  • SSE-KMS: Choose when you need compliance audit trails, key rotation control, or ability to disable keys. Costs $1/month per CMK + API calls.
  • SSE-S3: Choose for simple encryption without key management overhead. Free and automatic.
  • SSE-C: Choose when you must manage keys in your own key management system. You provide keys with each request.
  • Client-Side: Choose when you need to encrypt data before it leaves your application. Maximum control but most complex.

Choosing Network Security Controls

When securing a multi-tier application:

Layer 1: Network Segmentation

  • āœ… Use separate subnets for each tier (web, app, database)
  • āœ… Public subnets for internet-facing resources only
  • āœ… Private subnets for internal resources
  • āœ… Separate subnets per Availability Zone

Layer 2: Security Groups

  • āœ… Web tier: Allow 80/443 from 0.0.0.0/0
  • āœ… App tier: Allow app port from web tier security group only
  • āœ… Database tier: Allow database port from app tier security group only
  • āœ… Use security group references instead of IP addresses

Layer 3: Network ACLs (optional, for additional security)

  • āœ… Block known malicious IPs at subnet boundary
  • āœ… Enforce subnet-level policies (e.g., no outbound to internet from database subnet)
  • āœ… Add explicit deny rules for compliance

Layer 4: AWS WAF (for web tier)

  • āœ… Attach to Application Load Balancer or CloudFront
  • āœ… Enable managed rule groups (Core Rule Set, Known Bad Inputs)
  • āœ… Add rate limiting rules
  • āœ… Enable logging for analysis

Layer 5: GuardDuty (account-wide)

  • āœ… Enable in all accounts and regions
  • āœ… Configure EventBridge rules for automated response
  • āœ… Integrate with Security Hub for centralized view

Choosing Hybrid Connectivity

When connecting on-premises to AWS:

Requirement VPN Direct Connect Both
Quick setup (hours) āœ… āŒ āœ… (VPN first, DX later)
Low cost (<$100/month) āœ… āŒ āŒ
High bandwidth (>1 Gbps) āŒ āœ… āœ…
Consistent latency āŒ āœ… āœ…
Encryption required āœ… āŒ (add VPN) āœ…
High availability āœ… (2 tunnels) āŒ (order 2) āœ…
Temporary/backup āœ… āŒ āœ… (VPN as backup)

Recommendation:

  • Start with VPN if you need connectivity quickly or have budget constraints
  • Upgrade to Direct Connect when you need consistent performance or high bandwidth
  • Use both for production workloads requiring high availability and encryption

Key Facts & Figures

IAM Limits:

  • Users per account: 5,000 (soft limit, can be increased)
  • Groups per account: 300
  • Roles per account: 1,000
  • Policies per user/group/role: 10 managed policies
  • Policy size: 6,144 characters (managed), 10,240 characters (inline)
  • MFA devices per user: 8

VPC Limits:

  • VPCs per region: 5 (default, can be increased to 100s)
  • Subnets per VPC: 200
  • Internet Gateways per VPC: 1
  • NAT Gateways per AZ: 5
  • Security Groups per VPC: 2,500
  • Rules per Security Group: 60 inbound, 60 outbound
  • Security Groups per network interface: 5
  • NACLs per VPC: 200
  • Rules per NACL: 20 (default, can be increased to 40)

KMS Limits:

  • CMKs per region: 10,000 (customer-managed)
  • API request rate: 5,500/second (shared across all CMKs in region)
  • Encrypt/Decrypt: 4 KB maximum data size
  • GenerateDataKey: Returns 256-bit key (32 bytes)

Important Numbers to Remember:

  • ⭐ Security Group: Stateful, allow rules only, evaluated as a whole
  • ⭐ NACL: Stateless, allow and deny rules, evaluated in order by rule number
  • ⭐ KMS API rate: 5,500 requests/second (use S3 Bucket Keys to reduce calls)
  • ⭐ VPN bandwidth: 1.25 Gbps per tunnel, 2 tunnels per connection
  • ⭐ Direct Connect: 1 Gbps, 10 Gbps, or 100 Gbps dedicated connections
  • ⭐ WAF rate limit: Can configure per IP (e.g., 2000 requests per 5 minutes)

šŸŽÆ Exam Focus: Questions often test:

  • Difference between Security Groups (stateful) and NACLs (stateless)
  • When to use SSE-KMS vs SSE-S3 for S3 encryption
  • How to block specific IP addresses (use NACL, not Security Group)
  • Cross-account access patterns (IAM roles with trust policies)
  • VPN vs Direct Connect selection criteria
  • WAF use cases for application-layer protection

Chapter Summary

What We Covered

This chapter covered the "Design Secure Architectures" domain, which represents 30% of the SAA-C03 exam. We explored three major areas:

āœ… Section 1: Identity and Access Management

  • IAM users, groups, roles, and policies
  • IAM policy evaluation logic and best practices
  • Cross-account access with IAM roles and external IDs
  • AWS Organizations and Service Control Policies (SCPs)
  • IAM Identity Center for SSO
  • Federation with SAML and OIDC
  • Cognito for application user authentication

āœ… Section 2: Network Security & VPC Architecture

  • VPC fundamentals and subnet design
  • Security Groups vs Network ACLs
  • Multi-tier VPC architectures
  • NAT Gateways for private subnet internet access
  • VPN and Direct Connect for hybrid connectivity
  • AWS WAF for application-layer protection
  • AWS Shield for DDoS protection
  • GuardDuty for threat detection

āœ… Section 3: Data Security & Encryption

  • AWS KMS for key management
  • Encryption at rest (S3, EBS, RDS)
  • Encryption in transit (TLS/SSL)
  • AWS Certificate Manager for SSL certificates
  • Envelope encryption patterns
  • Compliance and audit requirements

Critical Takeaways

  1. IAM Best Practices: Always use IAM roles for AWS services instead of embedding access keys. Enable MFA for all users. Follow principle of least privilege. Use SCPs to enforce organization-wide policies.

  2. Network Segmentation: Separate public and private subnets. Place only internet-facing resources in public subnets. Use Security Groups for instance-level control and NACLs for subnet-level control.

  3. Defense in Depth: Use multiple security layers (network segmentation + security groups + NACLs + WAF + GuardDuty). No single security control is sufficient.

  4. Encryption Everywhere: Encrypt data at rest with KMS. Encrypt data in transit with TLS. Use customer-managed CMKs when you need audit trails or key rotation control.

  5. Hybrid Connectivity: Use VPN for quick setup and low cost. Use Direct Connect for high bandwidth and consistent performance. Use both for high availability.

  6. Stateful vs Stateless: Security Groups are stateful (return traffic automatically allowed). NACLs are stateless (must explicitly allow both directions). This is a frequent exam question.

  7. Key Management: KMS provides secure, auditable key management. Use envelope encryption for large data. Enable automatic key rotation for compliance.

  8. Application Security: Use WAF to protect against OWASP Top 10 vulnerabilities. Use Shield for DDoS protection. Use GuardDuty for threat detection.

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain the difference between IAM users, groups, and roles
  • I understand when to use IAM roles vs IAM users
  • I can describe how IAM policy evaluation works (explicit deny > explicit allow > implicit deny)
  • I understand the difference between Security Groups and NACLs
  • I can design a multi-tier VPC architecture with public and private subnets
  • I know when to use VPN vs Direct Connect
  • I understand how KMS encryption works (envelope encryption)
  • I can explain the difference between SSE-S3, SSE-KMS, and SSE-C
  • I know when to use AWS WAF vs AWS Shield
  • I understand how GuardDuty detects threats
  • I can describe how to implement cross-account access with IAM roles
  • I know how Service Control Policies (SCPs) work in AWS Organizations

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle 1: Questions 1-20 (IAM and access management)
  • Domain 1 Bundle 2: Questions 21-40 (Network security)
  • Domain 1 Bundle 3: Questions 41-50 (Data security and encryption)
  • Full Practice Test 1: Questions 1-20 (Domain 1 questions)

Expected score: 70%+ to proceed confidently

If you scored below 70%:

  • Review sections: Focus on areas where you missed questions
  • Key topics to strengthen:
    • IAM policy evaluation logic
    • Security Groups vs NACLs (stateful vs stateless)
    • KMS encryption patterns
    • VPC architecture design
    • Cross-account access patterns

Quick Reference Card

IAM Key Concepts:

  • User: Long-term credentials for people
  • Group: Collection of users with same permissions
  • Role: Temporary credentials for services or cross-account access
  • Policy: JSON document defining permissions
  • SCP: Organization-wide permission boundaries

Network Security:

  • Security Group: Stateful, instance-level, allow rules only
  • NACL: Stateless, subnet-level, allow and deny rules
  • WAF: Application-layer (L7) firewall for web apps
  • Shield: DDoS protection (L3/L4)

Encryption:

  • SSE-S3: AWS-managed keys, free, simple
  • SSE-KMS: Customer-managed keys, audit trail, $1/month
  • SSE-C: Customer-provided keys, you manage
  • Client-Side: Encrypt before upload, maximum control

Hybrid Connectivity:

  • VPN: Encrypted tunnel over internet, up to 1.25 Gbps, $0.05/hour
  • Direct Connect: Dedicated connection, 1/10/100 Gbps, consistent latency

Decision Points:

  • Block specific IPs → Use NACL (not Security Group)
  • Need audit trail for encryption → Use SSE-KMS (not SSE-S3)
  • Cross-account access → Use IAM role with trust policy
  • Protect web app from SQL injection → Use AWS WAF
  • Detect compromised instances → Use GuardDuty

Next Chapter: 03_domain2_resilient_architectures - Design Resilient Architectures (26% of exam)


Chapter Summary

What We Covered

This chapter covered Domain 1: Design Secure Architectures (30% of the exam), the highest-weighted domain. We explored three major task areas:

  • āœ… Task 1.1 - Secure Access to AWS Resources: IAM users, groups, roles, policies, MFA, cross-account access, federation, AWS Organizations, SCPs, IAM Identity Center
  • āœ… Task 1.2 - Secure Workloads and Applications: VPC security architecture, security groups, NACLs, WAF, Shield, GuardDuty, Macie, Secrets Manager, VPN, Direct Connect, network segmentation
  • āœ… Task 1.3 - Data Security Controls: KMS encryption, data at rest and in transit, ACM certificates, S3 encryption options, backup strategies, compliance frameworks

Critical Takeaways

  1. IAM is the Foundation of AWS Security: Every AWS interaction requires authentication and authorization through IAM. Master the principle of least privilege, use roles instead of access keys, and always enable MFA for privileged accounts.

  2. Defense in Depth with Multiple Security Layers: Combine security groups (stateful, instance-level), NACLs (stateless, subnet-level), WAF (application-level), and Shield (DDoS protection) for comprehensive security.

  3. Encryption Everywhere: Encrypt data at rest using KMS, encrypt data in transit using TLS/SSL with ACM certificates. AWS provides encryption options for every storage service - use them.

  4. Network Segmentation is Critical: Use public subnets for internet-facing resources, private subnets for application/database tiers, and isolated subnets for highly sensitive data. Control traffic flow with route tables and security groups.

  5. Automate Security Monitoring: Use GuardDuty for threat detection, Macie for sensitive data discovery, Security Hub for centralized security findings, and Config for compliance monitoring.

  6. Cross-Account Access Patterns: Use IAM roles with trust policies for cross-account access, not IAM users with access keys. Implement SCPs in AWS Organizations to enforce security boundaries.

  7. Secrets Management: Never hardcode credentials. Use Secrets Manager for automatic rotation or Systems Manager Parameter Store for simple configuration data.

Self-Assessment Checklist

Test yourself before moving to Domain 2. You should be able to:

IAM and Access Management:

  • Explain the difference between IAM users, groups, roles, and policies
  • Design a cross-account access strategy using IAM roles
  • Implement MFA for root and privileged users
  • Create IAM policies with conditions and resource-level permissions
  • Configure AWS Organizations with SCPs to enforce security boundaries
  • Set up IAM Identity Center (SSO) for multi-account access
  • Understand when to use SAML federation vs. Cognito

Network Security:

  • Design a multi-tier VPC architecture with public and private subnets
  • Configure security groups with proper ingress/egress rules
  • Implement NACLs for subnet-level traffic control
  • Explain the difference between security groups (stateful) and NACLs (stateless)
  • Set up VPC endpoints to avoid internet traffic for AWS services
  • Configure AWS WAF rules to protect against common attacks
  • Implement AWS Shield Advanced for DDoS protection
  • Use GuardDuty findings to respond to threats

Data Protection:

  • Encrypt S3 buckets using SSE-S3, SSE-KMS, or SSE-C
  • Create and manage KMS customer managed keys (CMKs)
  • Implement key rotation policies
  • Configure RDS encryption at rest and in transit
  • Use ACM to provision and manage SSL/TLS certificates
  • Set up S3 bucket policies to enforce encryption
  • Implement S3 Object Lock for compliance requirements
  • Configure AWS Backup for automated backup management

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle 1: Questions 1-50 (comprehensive security coverage)
  • Domain 1 Bundle 2: Questions 1-50 (additional security scenarios)
  • Security Services Bundle: Questions 1-50 (IAM, KMS, WAF, Shield, GuardDuty)

Expected Score: 75%+ to proceed

If you scored below 75%:

  • IAM weak: Review IAM policy evaluation logic, cross-account roles, federation
  • Network security weak: Review VPC architecture, security groups vs. NACLs, WAF rules
  • Data protection weak: Review KMS encryption, S3 encryption options, certificate management
  • Revisit diagrams: IAM policy evaluation, VPC security layers, KMS encryption flow

Common Exam Traps

Watch out for these in Domain 1 questions:

  1. IAM Policy Evaluation: Remember explicit DENY always wins, even over explicit ALLOW
  2. Security Group vs. NACL: Security groups are stateful (return traffic automatic), NACLs are stateless (must allow both directions)
  3. KMS Key Policies: Both key policy AND IAM policy must allow access (not just one)
  4. S3 Encryption: SSE-S3 uses AWS-managed keys, SSE-KMS uses customer-managed keys with audit trail
  5. Cross-Account Access: Use roles with trust policies, not IAM users with access keys
  6. VPC Endpoints: Gateway endpoints (S3, DynamoDB) are free, Interface endpoints cost money
  7. WAF vs. Shield: WAF protects Layer 7 (HTTP/HTTPS), Shield protects Layer 3/4 (network/transport)

Quick Reference Card

IAM Best Practices:

  • Enable MFA for root and privileged users
  • Use roles for applications, not access keys
  • Apply least privilege principle
  • Use IAM Access Analyzer to identify overly permissive policies
  • Rotate credentials regularly
  • Use SCPs to enforce organizational policies

Network Security Layers:

  1. VPC: Network isolation
  2. Subnets: Public (internet-facing) vs. Private (internal)
  3. Route Tables: Control traffic routing
  4. NACLs: Stateless subnet-level firewall (allow/deny rules)
  5. Security Groups: Stateful instance-level firewall (allow rules only)
  6. WAF: Application-level protection (SQL injection, XSS)
  7. Shield: DDoS protection (Standard free, Advanced paid)

Encryption Options:

  • S3: SSE-S3 (AWS-managed), SSE-KMS (customer-managed), SSE-C (customer-provided), Client-side
  • EBS: KMS encryption (enabled by default for new volumes)
  • RDS: KMS encryption at rest, SSL/TLS in transit
  • DynamoDB: KMS encryption at rest
  • In Transit: TLS/SSL certificates from ACM

Key Services by Use Case:

  • Identity Management: IAM, IAM Identity Center, Cognito, Directory Service
  • Network Security: Security Groups, NACLs, WAF, Shield, Network Firewall
  • Threat Detection: GuardDuty, Inspector, Detective, Security Hub
  • Data Protection: KMS, Secrets Manager, Certificate Manager, Macie
  • Compliance: Config, CloudTrail, Audit Manager, Artifact

Decision Frameworks

When to use which IAM identity:

  • IAM User: Individual person needing long-term AWS access
  • IAM Group: Collection of users with similar permissions
  • IAM Role: Applications, AWS services, or temporary access
  • Federated Identity: Enterprise users with existing identity provider

When to use which encryption:

  • SSE-S3: Simple encryption, AWS manages everything
  • SSE-KMS: Need audit trail, key rotation, fine-grained access control
  • SSE-C: Must control encryption keys outside AWS
  • Client-side: Encrypt before sending to AWS

When to use which network security:

  • Security Group: Instance-level protection, allow rules only
  • NACL: Subnet-level protection, explicit deny rules needed
  • WAF: Protect against application-layer attacks (SQL injection, XSS)
  • Shield Standard: Free DDoS protection for all AWS customers
  • Shield Advanced: Enhanced DDoS protection with cost protection

Integration with Other Domains

Security concepts from Domain 1 integrate with:

  • Domain 2 (Resilient Architectures): Security groups in multi-AZ deployments, encrypted backups
  • Domain 3 (High-Performing Architectures): VPC endpoints for performance, encryption overhead considerations
  • Domain 4 (Cost-Optimized Architectures): KMS key costs, VPC endpoint pricing, Shield Advanced costs

Next Steps

You're now ready for Domain 2: Design Resilient Architectures (Chapter 3). This domain covers:

  • Scalable and loosely coupled architectures (26% of exam weight)
  • High availability and fault tolerance
  • Disaster recovery strategies
  • Auto Scaling and load balancing

Security principles from this chapter will be applied throughout Domain 2, especially in designing secure, resilient architectures.


Chapter 1 Complete āœ… | Next: Chapter 2 - Domain 2: Resilient Architectures


Chapter Summary

What We Covered

  • āœ… IAM: Users, Groups, Roles, Policies, and Access Management
  • āœ… IAM Identity Center (AWS SSO) for centralized access
  • āœ… Multi-Account Strategy with AWS Organizations and Control Tower
  • āœ… VPC Security: Security Groups, NACLs, VPC Flow Logs
  • āœ… Network Protection: AWS WAF, Shield, Network Firewall
  • āœ… Threat Detection: GuardDuty, Macie, Security Hub, Inspector
  • āœ… Data Encryption: KMS, CloudHSM, ACM, Secrets Manager
  • āœ… Secure Connectivity: VPN, Direct Connect, PrivateLink
  • āœ… Application Security: Cognito, API Gateway authorization

Critical Takeaways

  1. IAM Best Practices: Enable MFA for all users, use roles instead of access keys, apply least privilege principle, use IAM policies with conditions
  2. Security Groups vs NACLs: Security groups are stateful (return traffic automatic), NACLs are stateless (must allow both directions); use security groups for instance-level, NACLs for subnet-level
  3. Encryption Everywhere: Encrypt data at rest with KMS, encrypt in transit with TLS/SSL (ACM), rotate keys regularly, use envelope encryption for large data
  4. Defense in Depth: Layer multiple security controls - WAF at edge, security groups at instance, encryption at rest, IAM for access, GuardDuty for threats
  5. Zero Trust: Never trust, always verify - use IAM roles with temporary credentials, implement MFA, monitor with CloudTrail, detect threats with GuardDuty

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain the difference between IAM users, groups, and roles
  • I understand when to use identity-based vs resource-based policies
  • I can design a multi-account strategy using Organizations and SCPs
  • I know the difference between security groups and NACLs
  • I can explain how to protect against DDoS attacks using Shield and WAF
  • I understand KMS key types (AWS managed vs customer managed)
  • I can describe when to use VPN vs Direct Connect vs PrivateLink
  • I know how to implement encryption at rest and in transit
  • I understand how GuardDuty detects threats
  • I can explain the shared responsibility model for security

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle 1: Questions 1-20 (IAM and access management)
  • Domain 1 Bundle 2: Questions 1-20 (Network security)
  • Domain 1 Bundle 3: Questions 1-20 (Data protection)
  • Security Services Bundle: Questions 1-25
  • Expected score: 75%+ to proceed

If you scored below 75%:

  • Review sections: IAM policies and roles, Security groups vs NACLs, KMS encryption
  • Focus on: Understanding when to use each security service and how they integrate

Quick Reference Card

IAM Essentials:

  • User: Long-term credentials (person or application)
  • Group: Collection of users with shared permissions
  • Role: Temporary credentials (for services or federated users)
  • Policy: JSON document defining permissions

Network Security:

  • Security Group: Stateful, instance-level, allow rules only
  • NACL: Stateless, subnet-level, allow and deny rules
  • WAF: Application-layer protection (Layer 7)
  • Shield: DDoS protection (Standard free, Advanced paid)

Encryption Services:

  • KMS: Managed encryption keys (CMK)
  • CloudHSM: Dedicated hardware security module
  • ACM: SSL/TLS certificates for HTTPS
  • Secrets Manager: Rotate secrets automatically

Threat Detection:

  • GuardDuty: Intelligent threat detection (ML-based)
  • Macie: Discover and protect sensitive data in S3
  • Security Hub: Centralized security findings
  • Inspector: Vulnerability scanning for EC2 and containers

Secure Connectivity:

  • VPN: Encrypted connection over internet
  • Direct Connect: Dedicated private connection
  • PrivateLink: Private access to AWS services

Decision Points:

  • Need DDoS protection? → Shield Standard (free) or Shield Advanced (paid)
  • Need application firewall? → WAF
  • Need to detect threats? → GuardDuty
  • Need to find sensitive data? → Macie
  • Need encryption keys? → KMS (managed) or CloudHSM (dedicated)
  • Need SSL certificates? → ACM
  • Need to rotate secrets? → Secrets Manager
  • Need private connectivity? → VPN (quick), Direct Connect (dedicated), PrivateLink (service-specific)


Chapter Summary

What We Covered

This chapter covered Domain 1: Design Secure Architectures (30% of the exam), the most heavily weighted domain. We explored three major task areas:

āœ… Task 1.1: Design Secure Access to AWS Resources

  • IAM fundamentals: Users, groups, roles, and policies
  • Multi-account strategies with AWS Organizations and Control Tower
  • Federated access with IAM Identity Center and external identity providers
  • Cross-account access patterns and role assumption
  • Security best practices: MFA, least privilege, password policies

āœ… Task 1.2: Design Secure Workloads and Applications

  • VPC security architecture: Security groups, NACLs, flow logs
  • Network segmentation with public/private subnets
  • Secure connectivity: VPN, Direct Connect, PrivateLink
  • Application security: WAF, Shield, GuardDuty, Macie
  • User authentication and authorization with Cognito

āœ… Task 1.3: Determine Appropriate Data Security Controls

  • Encryption at rest with KMS and CloudHSM
  • Encryption in transit with ACM and TLS
  • Data lifecycle management and retention policies
  • Backup and disaster recovery strategies
  • Compliance and governance with Config, CloudTrail, and Audit Manager

Critical Takeaways

  1. IAM is the foundation of AWS security: Master users, groups, roles, and policies. Always apply least privilege principle. Use roles for applications, not access keys.

  2. Defense in depth: Layer multiple security controls (security groups + NACLs + WAF + Shield). No single point of failure in security.

  3. Encryption everywhere: Encrypt data at rest (KMS), in transit (TLS/ACM), and in use when possible. Use AWS managed keys for simplicity, customer managed keys for control.

  4. Network segmentation is critical: Use public subnets for internet-facing resources, private subnets for backend systems. Control traffic flow with route tables and security groups.

  5. Automate security: Use Config for compliance monitoring, GuardDuty for threat detection, Security Hub for centralized findings. Don't rely on manual checks.

  6. Shared responsibility model: AWS secures the infrastructure, you secure your data, applications, and configurations. Know where the line is drawn.

  7. Audit everything: Enable CloudTrail in all regions, use CloudWatch Logs for centralized logging, set up alerts for suspicious activity.

  8. Secrets management: Never hardcode credentials. Use Secrets Manager for automatic rotation, Systems Manager Parameter Store for configuration.

  9. Multi-account strategy: Use AWS Organizations for centralized management, SCPs for guardrails, Control Tower for automated account setup.

  10. Compliance is continuous: Use AWS Artifact for compliance reports, Config for continuous monitoring, Audit Manager for audit readiness.

Key Services Quick Reference

Identity & Access Management:

  • IAM: Users, groups, roles, policies (identity-based and resource-based)
  • IAM Identity Center: Centralized SSO for multiple accounts
  • AWS Organizations: Multi-account management with SCPs
  • Control Tower: Automated account setup with guardrails
  • Cognito: User authentication for web/mobile apps

Network Security:

  • VPC: Isolated network with subnets, route tables, gateways
  • Security Groups: Stateful firewall at instance level
  • NACLs: Stateless firewall at subnet level
  • WAF: Web application firewall (Layer 7)
  • Shield: DDoS protection (Standard free, Advanced paid)
  • Network Firewall: Managed firewall for VPC

Data Protection:

  • KMS: Managed encryption keys (CMKs)
  • CloudHSM: Dedicated hardware security module
  • ACM: SSL/TLS certificates for HTTPS
  • Secrets Manager: Automatic secret rotation
  • Macie: Discover and protect sensitive data in S3

Threat Detection & Monitoring:

  • GuardDuty: Intelligent threat detection using ML
  • Security Hub: Centralized security findings
  • Inspector: Vulnerability scanning for EC2 and containers
  • Detective: Security investigation with ML
  • CloudTrail: API call logging and auditing
  • Config: Resource configuration tracking and compliance

Secure Connectivity:

  • VPN: Encrypted connection over internet
  • Direct Connect: Dedicated private connection (1-100 Gbps)
  • PrivateLink: Private access to AWS services without internet
  • Transit Gateway: Hub-and-spoke network topology

Decision Frameworks

When to use IAM Users vs Roles:

  • IAM Users: For human administrators who need console access
  • IAM Roles: For applications, EC2 instances, Lambda functions, cross-account access
  • Never: Hardcode access keys in code or share credentials

Choosing Encryption Solutions:

  • KMS with AWS managed keys: Simplest, AWS handles rotation
  • KMS with customer managed keys: More control, you manage rotation
  • CloudHSM: Regulatory compliance requiring dedicated hardware
  • Client-side encryption: Maximum control, you manage everything

Network Security Layers:

  1. Edge: CloudFront + WAF + Shield (DDoS protection)
  2. VPC: NACLs (subnet level) + Security Groups (instance level)
  3. Application: WAF rules, API Gateway throttling
  4. Data: Encryption at rest (KMS) and in transit (TLS)

Secure Connectivity Options:

Requirement Solution Use Case
Quick setup, encrypted Site-to-Site VPN Dev/test, temporary, <1 Gbps
Dedicated, high bandwidth Direct Connect Production, 1-100 Gbps, consistent latency
Private AWS service access PrivateLink Access S3, DynamoDB without internet
Multiple VPC connectivity Transit Gateway Hub-and-spoke, centralized routing

Common Exam Patterns

Pattern 1: "Most Secure" Questions

  • Look for: MFA, encryption, least privilege, private subnets
  • Eliminate: Public access, hardcoded credentials, overly permissive policies
  • Choose: Defense in depth with multiple layers

Pattern 2: "Compliance Requirements"

  • Look for: Audit trails (CloudTrail), compliance monitoring (Config), encryption (KMS)
  • Eliminate: Solutions without logging or encryption
  • Choose: Automated compliance checking and reporting

Pattern 3: "Secure Application Access"

  • Look for: Cognito (user pools), IAM roles (not users), API Gateway with authorization
  • Eliminate: Hardcoded credentials, IAM users for applications
  • Choose: Temporary credentials with automatic rotation

Pattern 4: "Data Protection"

  • Look for: Encryption at rest and in transit, KMS, ACM, Secrets Manager
  • Eliminate: Unencrypted data, plaintext secrets
  • Choose: AWS managed encryption services with automatic key rotation

Pattern 5: "Network Isolation"

  • Look for: Private subnets, security groups, NACLs, VPC endpoints
  • Eliminate: Public subnets for databases, overly permissive security groups
  • Choose: Layered security with least privilege network access

Self-Assessment Checklist

Test yourself before moving to the next chapter:

IAM & Access Management:

  • I can explain the difference between IAM users, groups, and roles
  • I understand identity-based vs resource-based policies
  • I can design a multi-account strategy with Organizations and SCPs
  • I know when to use IAM Identity Center for SSO
  • I can implement cross-account access with role assumption

Network Security:

  • I can design a VPC with public and private subnets
  • I understand the difference between security groups and NACLs
  • I know when to use VPN vs Direct Connect vs PrivateLink
  • I can implement WAF rules to protect web applications
  • I understand how to use VPC Flow Logs for security analysis

Data Protection:

  • I can explain when to use KMS vs CloudHSM
  • I understand how to encrypt data at rest and in transit
  • I know how to use Secrets Manager for credential rotation
  • I can implement S3 bucket policies for data access control
  • I understand data lifecycle and retention policies

Threat Detection:

  • I know what GuardDuty detects and how it works
  • I can explain when to use Macie for sensitive data discovery
  • I understand how to use Security Hub for centralized findings
  • I know how to enable and analyze CloudTrail logs
  • I can use Config for compliance monitoring

Compliance & Governance:

  • I understand the AWS shared responsibility model
  • I can use AWS Artifact to access compliance reports
  • I know how to implement automated compliance checking
  • I can design audit-ready architectures
  • I understand how to use Control Tower for governance

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle 1: Questions 1-20 (IAM and access management)
  • Domain 1 Bundle 2: Questions 21-40 (Network security)
  • Domain 1 Bundle 3: Questions 41-60 (Data protection and compliance)
  • Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • 60-74%: Review specific sections where you missed questions
  • Below 60%: Re-read the entire chapter and take detailed notes
  • Focus on:
    • IAM policy evaluation logic (explicit deny > explicit allow > implicit deny)
    • Security group vs NACL differences (stateful vs stateless)
    • Encryption key management (AWS managed vs customer managed)
    • VPC connectivity options (VPN vs Direct Connect vs PrivateLink)
    • Threat detection services (GuardDuty vs Macie vs Inspector)

Quick Reference Card

Copy this to your notes for quick review:

IAM Policy Evaluation:

  1. Explicit DENY (always wins)
  2. Explicit ALLOW (if no deny)
  3. Implicit DENY (default)

Security Groups vs NACLs:

Feature Security Group NACL
Level Instance Subnet
State Stateful Stateless
Rules Allow only Allow + Deny
Evaluation All rules Numbered order

Encryption Services:

  • At Rest: KMS (managed keys), CloudHSM (dedicated), EBS encryption, S3 encryption
  • In Transit: TLS/SSL, ACM (certificates), VPN (IPsec)
  • Secrets: Secrets Manager (rotation), Parameter Store (config)

Threat Detection:

  • GuardDuty: Intelligent threat detection (VPC Flow Logs, CloudTrail, DNS logs)
  • Macie: Sensitive data discovery in S3 (PII, credentials)
  • Inspector: Vulnerability scanning (EC2, containers, Lambda)
  • Security Hub: Centralized security findings from all services

Secure Connectivity:

  • VPN: $0.05/hour, up to 1.25 Gbps, encrypted over internet
  • Direct Connect: $0.30/hour (1 Gbps), dedicated, consistent latency
  • PrivateLink: $0.01/hour + data, private AWS service access
  • Transit Gateway: $0.05/hour + data, hub-and-spoke for multiple VPCs

Must Memorize:

  • Default VPC CIDR: 172.31.0.0/16
  • Security groups: Stateful, allow only, all rules evaluated
  • NACLs: Stateless, allow + deny, numbered order (lowest first)
  • IAM policy size limit: 2,048 characters (inline), 6,144 characters (managed)
  • KMS key rotation: Automatic every 365 days (AWS managed), manual (customer managed)
  • CloudTrail: 90 days in Event History (free), S3 for longer retention

Congratulations! You've completed Domain 1 (30% of exam). This is the most heavily weighted domain, so mastering this content is critical for exam success.

Next Chapter: 03_domain2_resilient_architectures - Design Resilient Architectures (26% of exam)


Chapter Summary

What We Covered

This chapter covered Domain 1: Design Secure Architectures (30% of exam), the most heavily weighted domain. You learned:

  • āœ… IAM Fundamentals: Users, groups, roles, policies, and the principle of least privilege
  • āœ… Access Management: Cross-account access, federation, IAM Identity Center, and STS
  • āœ… Multi-Account Strategy: AWS Organizations, SCPs, Control Tower, and account isolation
  • āœ… Network Security: VPC architecture, security groups, NACLs, and network segmentation
  • āœ… Secure Connectivity: VPN, Direct Connect, PrivateLink, and Transit Gateway
  • āœ… Threat Detection: GuardDuty, Macie, Inspector, and Security Hub
  • āœ… Application Security: WAF, Shield, API Gateway security, and ALB authentication
  • āœ… Data Protection: KMS encryption, Secrets Manager, ACM, and data lifecycle
  • āœ… Compliance: CloudTrail, Config, Audit Manager, and compliance frameworks

Critical Takeaways

  1. IAM Policy Evaluation: Explicit DENY always wins → Explicit ALLOW → Implicit DENY (default)
  2. Security Groups vs NACLs: Security groups are stateful (instance-level), NACLs are stateless (subnet-level)
  3. Encryption Strategy: Use KMS for at-rest encryption, TLS/SSL for in-transit, Secrets Manager for rotation
  4. Multi-Account Security: Use Organizations + SCPs for centralized governance, Control Tower for guardrails
  5. Network Segmentation: Public subnets for internet-facing resources, private subnets for backend, isolated subnets for data
  6. Threat Detection: GuardDuty for intelligent threats, Macie for sensitive data, Inspector for vulnerabilities
  7. Secure Connectivity: VPN for encrypted internet, Direct Connect for dedicated, PrivateLink for AWS services
  8. Zero Trust Principles: Never trust, always verify, least privilege, assume breach

Self-Assessment Checklist

Test yourself before moving on. Can you:

IAM & Access Management:

  • Explain the difference between IAM users, groups, and roles?
  • Describe how IAM policy evaluation works (deny, allow, default)?
  • Configure cross-account access using IAM roles?
  • Implement MFA for root and IAM users?
  • Use IAM Identity Center for SSO across multiple accounts?
  • Explain when to use resource-based vs identity-based policies?

Network Security:

  • Design a multi-tier VPC architecture with public and private subnets?
  • Configure security groups and NACLs correctly?
  • Explain the difference between stateful and stateless firewalls?
  • Implement VPC endpoints for private AWS service access?
  • Design secure connectivity using VPN or Direct Connect?
  • Use Transit Gateway for hub-and-spoke network topology?

Threat Detection & Response:

  • Configure GuardDuty for threat detection?
  • Use Macie to discover sensitive data in S3?
  • Set up Inspector for vulnerability scanning?
  • Aggregate findings in Security Hub?
  • Automate remediation using EventBridge and Lambda?

Data Protection:

  • Encrypt data at rest using KMS?
  • Implement encryption in transit using TLS/SSL?
  • Manage secrets using Secrets Manager with automatic rotation?
  • Configure S3 bucket encryption and access controls?
  • Use CloudTrail for audit logging and compliance?

Application Security:

  • Configure WAF rules to protect against common attacks?
  • Use Shield for DDoS protection?
  • Implement API Gateway authorization (IAM, Cognito, Lambda)?
  • Configure ALB authentication with Cognito?

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle 1: Questions 1-50 (Expected score: 70%+ to proceed)
  • Domain 1 Bundle 2: Questions 51-100 (Expected score: 75%+ to proceed)
  • Domain 1 Bundle 3: Questions 101-150 (Expected score: 80%+ to proceed)

If you scored below 70%:

  • Review sections on IAM policy evaluation and cross-account access
  • Focus on security groups vs NACLs differences
  • Study encryption services (KMS, Secrets Manager, ACM)
  • Practice threat detection service selection (GuardDuty, Macie, Inspector)

If you scored 70-80%:

  • Review advanced topics: SCPs, Control Tower, PrivateLink
  • Study WAF rule configuration and DDoS mitigation
  • Practice multi-account architecture design
  • Focus on compliance and audit logging

If you scored 80%+:

  • Excellent! You're ready to move to Domain 2
  • Continue practicing with full practice tests
  • Review any specific topics where you made mistakes

Next Steps: Proceed to 03_domain2_resilient_architectures to learn about designing resilient architectures (26% of exam).


Chapter Summary

What We Covered

This comprehensive chapter explored the critical domain of designing secure architectures on AWS, covering 30% of the SAA-C03 exam content. We examined three major task areas:

Task 1.1: Design Secure Access to AWS Resources

  • āœ… IAM fundamentals: users, groups, roles, and policies
  • āœ… Multi-factor authentication and root account security
  • āœ… Cross-account access patterns and role switching
  • āœ… AWS Organizations and Service Control Policies
  • āœ… IAM Identity Center for centralized access management
  • āœ… Federation with SAML and OIDC providers
  • āœ… AWS Control Tower for multi-account governance

Task 1.2: Design Secure Workloads and Applications

  • āœ… VPC security architecture with security groups and NACLs
  • āœ… Network segmentation strategies (public/private subnets)
  • āœ… AWS WAF for application-layer protection
  • āœ… AWS Shield for DDoS protection
  • āœ… Amazon GuardDuty for threat detection
  • āœ… Amazon Macie for sensitive data discovery
  • āœ… VPN and Direct Connect for hybrid connectivity
  • āœ… VPC endpoints and PrivateLink for private connectivity

Task 1.3: Determine Appropriate Data Security Controls

  • āœ… AWS KMS for encryption key management
  • āœ… Encryption at rest for S3, EBS, RDS, and other services
  • āœ… Encryption in transit with TLS/SSL and ACM
  • āœ… Data lifecycle management and retention policies
  • āœ… AWS Backup for centralized backup management
  • āœ… Compliance frameworks and AWS Config
  • āœ… CloudTrail for audit logging and governance

Critical Takeaways

Security Best Practices:

  1. Principle of Least Privilege: Always grant minimum permissions necessary - start with deny all, then add specific allows
  2. Defense in Depth: Use multiple layers of security (IAM + security groups + NACLs + encryption + monitoring)
  3. Enable MFA Everywhere: Especially for root accounts and privileged users
  4. Encrypt Everything: Data at rest and in transit - use AWS KMS for centralized key management
  5. Monitor Continuously: Enable CloudTrail, GuardDuty, Config, and Security Hub for comprehensive visibility

IAM Key Concepts:

  • Identity-based policies attach to users/groups/roles; resource-based policies attach to resources
  • Explicit deny always wins in policy evaluation
  • Use roles for EC2 instances and Lambda functions (never embed credentials)
  • Cross-account access requires both trust policy and permissions policy
  • SCPs provide guardrails but don't grant permissions

Network Security Essentials:

  • Security groups are stateful (return traffic automatically allowed)
  • NACLs are stateless (must explicitly allow both inbound and outbound)
  • Use private subnets for databases and application servers
  • VPC endpoints eliminate internet gateway dependency for AWS services
  • PrivateLink enables private connectivity to third-party services

Encryption Fundamentals:

  • AWS-managed keys (SSE-S3, SSE-RDS) are easiest but least flexible
  • Customer-managed keys (CMK) in KMS provide full control and audit trail
  • Envelope encryption protects data encryption keys with master keys
  • Enable encryption by default for all new resources
  • Use ACM for SSL/TLS certificate management (automatic renewal)

Compliance and Governance:

  • AWS Config tracks resource configuration changes and compliance
  • CloudTrail logs all API calls for audit and forensics
  • AWS Backup provides centralized backup management across services
  • Use AWS Artifact to access compliance reports and agreements
  • Implement data residency controls with region restrictions

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

IAM and Access Management:

  • Explain the difference between IAM users, groups, and roles
  • Describe how IAM policy evaluation works (explicit deny, explicit allow, implicit deny)
  • Configure cross-account access using IAM roles
  • Implement MFA for root and privileged users
  • Design an AWS Organizations structure with SCPs
  • Set up IAM Identity Center for SSO
  • Configure SAML federation with an external identity provider

Network Security:

  • Design a multi-tier VPC architecture with public and private subnets
  • Configure security groups and NACLs correctly
  • Explain the difference between security groups (stateful) and NACLs (stateless)
  • Implement VPC endpoints for S3 and DynamoDB
  • Set up PrivateLink for private service connectivity
  • Configure AWS WAF rules to protect against common attacks
  • Design a hybrid network with VPN or Direct Connect

Data Protection:

  • Enable encryption at rest for S3, EBS, and RDS
  • Configure AWS KMS customer-managed keys
  • Implement encryption in transit with TLS/SSL
  • Set up AWS Certificate Manager for SSL certificates
  • Configure S3 bucket policies to enforce encryption
  • Implement data lifecycle policies for compliance
  • Set up AWS Backup for centralized backup management

Monitoring and Compliance:

  • Enable CloudTrail for all regions and validate log files
  • Configure AWS Config rules for compliance monitoring
  • Set up GuardDuty for threat detection
  • Use Macie to discover sensitive data in S3
  • Implement Security Hub for centralized security findings
  • Create AWS Config remediation actions
  • Design a compliance architecture for specific frameworks (HIPAA, PCI-DSS)

Practice Questions

Try these from your practice test bundles:

Beginner Level (Target: 80%+ correct):

  • Domain 1 Bundle 1: Questions 1-20 (IAM basics, security groups, encryption fundamentals)
  • Security Services Bundle: Questions 1-15 (foundational security concepts)

Intermediate Level (Target: 70%+ correct):

  • Domain 1 Bundle 2: Questions 21-40 (cross-account access, VPC security, KMS)
  • Full Practice Test 1: Domain 1 questions (mixed difficulty)

Advanced Level (Target: 60%+ correct):

  • Domain 1 Bundle 3: Questions 41-50 (complex architectures, compliance, advanced IAM)
  • Full Practice Test 2: Domain 1 questions (challenging scenarios)

If you scored below target:

  • Below 60%: Review the entire chapter, focus on fundamentals
  • 60-70%: Review specific weak areas identified in practice tests
  • 70-80%: Focus on advanced topics and edge cases
  • Above 80%: You're ready! Move to next domain

Quick Reference Card

Copy this to your notes for quick review:

IAM Quick Facts

  • Policy Evaluation: Explicit Deny > Explicit Allow > Implicit Deny
  • Root Account: Enable MFA, don't use for daily tasks, lock away credentials
  • Roles: Use for EC2, Lambda, cross-account access (never embed credentials)
  • SCPs: Guardrails only, don't grant permissions, affect all accounts in OU
  • Federation: SAML for enterprise, OIDC for web/mobile, Cognito for app users

Network Security Quick Facts

  • Security Groups: Stateful, allow rules only, instance-level
  • NACLs: Stateless, allow + deny rules, subnet-level, numbered rules (lowest first)
  • VPC Endpoints: Gateway (S3, DynamoDB), Interface (most other services)
  • PrivateLink: Private connectivity without internet gateway or NAT
  • WAF: Layer 7 protection, rate limiting, geo-blocking, SQL injection prevention

Encryption Quick Facts

  • At Rest: S3 (SSE-S3, SSE-KMS, SSE-C), EBS (KMS), RDS (KMS)
  • In Transit: TLS/SSL, ACM for certificates, HTTPS for S3
  • KMS: Customer-managed keys, automatic rotation, audit trail, key policies
  • Envelope Encryption: Data key encrypts data, master key encrypts data key
  • Default Encryption: Enable for S3 buckets, EBS volumes, RDS instances

Monitoring Quick Facts

  • CloudTrail: API call logging, 90-day history, S3 for long-term storage
  • Config: Resource configuration tracking, compliance rules, remediation
  • GuardDuty: Threat detection, ML-based, VPC Flow Logs + DNS logs + CloudTrail
  • Macie: Sensitive data discovery in S3, PII detection, data classification
  • Security Hub: Centralized security findings, compliance checks, integrations

Common Exam Scenarios

  • Scenario: Least privilege access → Solution: Start with deny all, add specific allows, use roles
  • Scenario: Cross-account access → Solution: IAM role with trust policy + permissions policy
  • Scenario: Encrypt data at rest → Solution: Enable KMS encryption for all storage services
  • Scenario: DDoS protection → Solution: Shield Standard (free) + Shield Advanced (paid) + WAF
  • Scenario: Audit API calls → Solution: Enable CloudTrail in all regions, validate log files
  • Scenario: Compliance monitoring → Solution: AWS Config rules + Security Hub + automated remediation
  • Scenario: Private connectivity → Solution: VPC endpoints (AWS services) or PrivateLink (third-party)

Next Chapter: 03_domain2_resilient_architectures - Design Resilient Architectures (26% of exam)

Chapter Summary

What We Covered

This chapter covered Domain 1: Design Secure Architectures (30% of the exam), focusing on three critical task areas:

āœ… Task 1.1: Design secure access to AWS resources

  • IAM users, groups, roles, and policies
  • Multi-factor authentication (MFA) and root user security
  • Cross-account access and role switching
  • AWS Organizations and Service Control Policies (SCPs)
  • Federation with SAML and OIDC
  • IAM Identity Center (AWS SSO) for centralized access
  • Principle of least privilege and permissions boundaries

āœ… Task 1.2: Design secure workloads and applications

  • VPC security architecture (security groups, NACLs, subnets)
  • Network segmentation and isolation strategies
  • AWS WAF for application-layer protection
  • AWS Shield for DDoS protection
  • GuardDuty for threat detection
  • Macie for sensitive data discovery
  • Secrets Manager for credential management
  • VPN and Direct Connect for hybrid connectivity
  • VPC endpoints and PrivateLink for private connectivity

āœ… Task 1.3: Determine appropriate data security controls

  • Encryption at rest with AWS KMS
  • Encryption in transit with TLS/SSL and ACM
  • S3 encryption options (SSE-S3, SSE-KMS, SSE-C)
  • EBS and RDS encryption
  • Data backup and replication strategies
  • CloudTrail for API logging and audit trails
  • AWS Config for compliance monitoring
  • Data lifecycle management and retention policies

Critical Takeaways

Security is a shared responsibility between AWS and you:

  • AWS: Physical security, infrastructure, managed service security
  • You: Data encryption, access control, network configuration, application security

Key Security Principles:

  1. Least Privilege: Grant only the minimum permissions needed
  2. Defense in Depth: Multiple layers of security controls
  3. Encryption Everywhere: Encrypt data at rest and in transit
  4. Audit Everything: Enable logging and monitoring for all resources
  5. Automate Security: Use AWS Config, Security Hub, and automation for compliance

Most Important Services to Master:

  • IAM: Foundation of all AWS security - roles, policies, MFA
  • KMS: Encryption key management for all AWS services
  • VPC: Network isolation and security controls
  • CloudTrail: Audit trail for all API calls
  • GuardDuty: Automated threat detection
  • Secrets Manager: Secure credential storage and rotation

Common Exam Patterns:

  • Questions about least privilege → Use roles with specific permissions, not broad access
  • Questions about cross-account access → IAM roles with trust policies
  • Questions about encryption → Enable KMS encryption for all storage services
  • Questions about DDoS protection → Shield Standard (free) + Shield Advanced + WAF
  • Questions about compliance → CloudTrail + Config + Security Hub
  • Questions about private connectivity → VPC endpoints or PrivateLink

Self-Assessment Checklist

Test yourself before moving to the next chapter. You should be able to:

IAM and Access Management

  • Explain the difference between IAM users, groups, and roles
  • Describe when to use identity-based vs resource-based policies
  • Configure cross-account access using IAM roles
  • Implement MFA for root and IAM users
  • Design a multi-account strategy with AWS Organizations
  • Explain how Service Control Policies (SCPs) work
  • Configure federation with SAML or OIDC
  • Use IAM Identity Center for centralized access management

Network Security

  • Design a VPC with public and private subnets
  • Configure security groups and NACLs correctly
  • Explain the difference between security groups and NACLs
  • Implement network segmentation strategies
  • Configure VPC endpoints for AWS services
  • Set up AWS PrivateLink for third-party services
  • Design VPN or Direct Connect for hybrid connectivity
  • Implement AWS WAF rules for application protection

Data Protection

  • Enable encryption at rest for S3, EBS, and RDS
  • Configure AWS KMS customer-managed keys
  • Implement encryption in transit with TLS/SSL
  • Use AWS Certificate Manager for SSL/TLS certificates
  • Configure S3 bucket policies for encryption enforcement
  • Set up Secrets Manager for credential rotation
  • Enable CloudTrail for API logging
  • Configure AWS Config for compliance monitoring

Threat Detection and Monitoring

  • Enable GuardDuty for threat detection
  • Configure Macie for sensitive data discovery
  • Set up Security Hub for centralized security findings
  • Use CloudWatch for security monitoring and alerting
  • Implement VPC Flow Logs for network traffic analysis
  • Configure AWS Config rules for compliance checks

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle 1: Questions 1-25 (IAM and access management)
  • Domain 1 Bundle 2: Questions 26-50 (Network security)
  • Domain 1 Bundle 3: Questions 51-75 (Data protection and monitoring)
  • Security Services Bundle: All questions (comprehensive security review)

Expected Score: 75%+ to proceed confidently

If you scored below 75%:

  • 60-74%: Review specific sections where you struggled, then retry
  • Below 60%: Re-read this entire chapter, focusing on diagrams and examples
  • Focus on understanding WHY certain solutions are correct, not just memorizing

Quick Reference Card

Copy this to your notes for quick review:

IAM Quick Facts

  • Users: Long-term credentials, for humans
  • Roles: Temporary credentials, for services and cross-account
  • Groups: Collection of users with same permissions
  • Policies: JSON documents defining permissions
  • MFA: Required for root user and privileged users
  • Policy Evaluation: Explicit deny > Explicit allow > Implicit deny

Network Security Quick Facts

  • Security Groups: Stateful, allow rules only, instance-level
  • NACLs: Stateless, allow + deny rules, subnet-level
  • VPC Endpoints: Private connectivity to AWS services (no internet)
  • PrivateLink: Private connectivity to third-party services
  • WAF: Layer 7 protection, rate limiting, SQL injection prevention
  • Shield: DDoS protection (Standard free, Advanced paid)

Encryption Quick Facts

  • At Rest: KMS for S3, EBS, RDS, DynamoDB
  • In Transit: TLS/SSL, ACM for certificates
  • KMS: Customer-managed keys, automatic rotation, audit trail
  • Envelope Encryption: Data key encrypts data, master key encrypts data key
  • S3 Encryption: SSE-S3 (AWS-managed), SSE-KMS (customer-managed), SSE-C (customer-provided)

Monitoring Quick Facts

  • CloudTrail: API call logging, 90-day history, S3 for long-term
  • Config: Resource configuration tracking, compliance rules
  • GuardDuty: Threat detection, ML-based, VPC Flow + DNS + CloudTrail
  • Macie: Sensitive data discovery in S3, PII detection
  • Security Hub: Centralized security findings, compliance checks

Decision Points

  • Least privilege access → Start with deny all, add specific allows, use roles
  • Cross-account access → IAM role with trust policy + permissions policy
  • Encrypt data at rest → Enable KMS encryption for all storage services
  • DDoS protection → Shield Standard (free) + Shield Advanced (paid) + WAF
  • Audit API calls → Enable CloudTrail in all regions, validate log files
  • Compliance monitoring → AWS Config rules + Security Hub + automated remediation
  • Private connectivity → VPC endpoints (AWS services) or PrivateLink (third-party)

Congratulations! You've completed Domain 1: Design Secure Architectures. This is the largest domain (30% of the exam), so mastering this content is critical for exam success.

Next Chapter: 03_domain2_resilient_architectures - Design Resilient Architectures (26% of exam)


Chapter Summary

What We Covered

This chapter covered the three major task areas of Domain 1: Design Secure Architectures (30% of exam):

Task 1.1: Design Secure Access to AWS Resources

  • āœ… IAM users, groups, roles, and policies
  • āœ… Multi-factor authentication (MFA) and root user security
  • āœ… Cross-account access and role switching
  • āœ… AWS Organizations and Service Control Policies (SCPs)
  • āœ… IAM Identity Center (AWS SSO) for centralized access
  • āœ… Federation with SAML 2.0 and OIDC
  • āœ… AWS Control Tower for multi-account governance

Task 1.2: Design Secure Workloads and Applications

  • āœ… VPC security architecture (security groups, NACLs, subnets)
  • āœ… Network segmentation strategies
  • āœ… AWS WAF for application protection
  • āœ… AWS Shield for DDoS protection
  • āœ… Amazon GuardDuty for threat detection
  • āœ… Amazon Macie for sensitive data discovery
  • āœ… VPN and Direct Connect for hybrid connectivity
  • āœ… VPC endpoints and PrivateLink for private connectivity

Task 1.3: Determine Appropriate Data Security Controls

  • āœ… AWS KMS for encryption key management
  • āœ… Encryption at rest (S3, EBS, RDS, DynamoDB)
  • āœ… Encryption in transit (TLS/SSL, ACM)
  • āœ… Data backup and replication strategies
  • āœ… AWS Backup for centralized backup management
  • āœ… CloudTrail for API logging and audit trails
  • āœ… AWS Config for compliance monitoring

Critical Takeaways

  1. Principle of Least Privilege: Always start with minimum permissions and add only what's needed. Use IAM roles instead of long-term credentials whenever possible.

  2. Defense in Depth: Layer multiple security controls (security groups + NACLs + WAF + Shield) for comprehensive protection.

  3. Encryption Everywhere: Enable encryption at rest for all storage services and encryption in transit for all data transfers. Use AWS KMS for centralized key management.

  4. Audit and Monitor: Enable CloudTrail in all regions, use Config for compliance, and GuardDuty for threat detection. Centralize findings in Security Hub.

  5. Secure by Default: Use AWS managed services that provide built-in security features. Enable MFA for all privileged accounts, especially root users.

  6. Network Isolation: Use private subnets for backend resources, public subnets only for internet-facing components. Use VPC endpoints to avoid internet traffic.

  7. Identity Federation: For enterprise environments, federate with existing identity providers (Active Directory, Okta) rather than creating duplicate IAM users.

  8. Compliance Automation: Use AWS Config rules and Security Hub to continuously monitor compliance and automatically remediate violations.

Self-Assessment Checklist

Test yourself before moving on. Can you:

IAM and Access Management

  • Explain the difference between IAM users, groups, and roles?
  • Describe how to implement cross-account access securely?
  • Configure MFA for root and IAM users?
  • Create an IAM policy with conditions and variables?
  • Explain when to use resource-based vs identity-based policies?
  • Implement least privilege access using permissions boundaries?
  • Set up AWS Organizations with SCPs for multi-account governance?

Network Security

  • Design a multi-tier VPC architecture with proper security?
  • Explain the difference between security groups and NACLs?
  • Configure AWS WAF rules to protect against common attacks?
  • Implement DDoS protection using Shield and WAF?
  • Set up VPC endpoints for private AWS service access?
  • Design a hybrid network with VPN or Direct Connect?
  • Explain when to use PrivateLink vs VPC peering?

Data Protection

  • Enable encryption at rest for S3, EBS, RDS, and DynamoDB?
  • Configure KMS customer-managed keys with proper key policies?
  • Implement encryption in transit using TLS/SSL and ACM?
  • Set up automated backup strategies using AWS Backup?
  • Configure S3 Object Lock for compliance requirements?
  • Enable CloudTrail logging and log file validation?
  • Use AWS Config to monitor resource compliance?

Threat Detection and Response

  • Enable GuardDuty for threat detection?
  • Configure Macie to discover sensitive data in S3?
  • Set up Security Hub for centralized security findings?
  • Implement automated remediation using EventBridge and Lambda?
  • Use Systems Manager Session Manager for secure instance access?

Practice Questions

Try these from your practice test bundles:

Beginner Level (Build Confidence):

  • Domain 1 Bundle 1: Questions 1-20
  • Security Services Bundle: Questions 1-15
  • Expected score: 70%+ to proceed

Intermediate Level (Test Understanding):

  • Domain 1 Bundle 2: Questions 1-20
  • Full Practice Test 1: Domain 1 questions
  • Expected score: 75%+ to proceed

Advanced Level (Challenge Yourself):

  • Domain 1 Bundle 3: Questions 1-20
  • Expected score: 70%+ to proceed

If you scored below target:

  • Below 60%: Review the entire chapter again, focus on fundamentals
  • 60-70%: Review specific sections where you struggled
  • 70-80%: Review quick facts and decision points
  • 80%+: You're ready! Move to next domain

Quick Reference Card

Copy this to your notes for quick review:

IAM Essentials

  • Users: Long-term credentials, use for humans
  • Roles: Temporary credentials, use for services and cross-account
  • Groups: Collection of users, attach policies to groups
  • Policies: JSON documents defining permissions
  • MFA: Required for root and privileged users
  • STS: Temporary credentials, 15 min - 12 hours

Network Security

  • Security Groups: Stateful, allow rules only, instance-level
  • NACLs: Stateless, allow + deny rules, subnet-level
  • WAF: Layer 7 protection, rate limiting, geo-blocking
  • Shield Standard: Free DDoS protection (Layer 3/4)
  • Shield Advanced: $3,000/month, Layer 7 protection, DDoS Response Team

Encryption Services

  • KMS: Key management, automatic rotation, audit trail
  • ACM: Free SSL/TLS certificates, automatic renewal
  • CloudHSM: Dedicated hardware, FIPS 140-2 Level 3
  • Secrets Manager: Automatic rotation, RDS integration

Monitoring Services

  • CloudTrail: API call logging, 90-day free history
  • Config: Resource configuration tracking, compliance rules
  • GuardDuty: Threat detection, ML-based, $4.50/million events
  • Macie: Sensitive data discovery, PII detection
  • Security Hub: Centralized findings, compliance frameworks

Key Decision Points

Scenario Solution
Cross-account access IAM role with trust policy
Encrypt data at rest Enable KMS encryption
DDoS protection Shield Standard + WAF
Private AWS service access VPC endpoints (Gateway or Interface)
Audit API calls CloudTrail in all regions
Compliance monitoring AWS Config rules + Security Hub
Threat detection GuardDuty + automated remediation
Sensitive data discovery Macie for S3 buckets

Chapter Summary

What We Covered

This chapter explored the critical domain of Design Secure Architectures (30% of the exam), covering three major task areas:

āœ… Task 1.1: Design secure access to AWS resources

  • IAM users, groups, roles, and policies
  • Multi-factor authentication (MFA) and password policies
  • Cross-account access and role switching
  • AWS Organizations and Service Control Policies (SCPs)
  • Federation with SAML and OIDC
  • IAM Identity Center (AWS SSO)
  • Least privilege principle and permissions boundaries

āœ… Task 1.2: Design secure workloads and applications

  • VPC security architecture (security groups, NACLs)
  • Network segmentation (public/private subnets)
  • AWS WAF, Shield, and DDoS protection
  • GuardDuty threat detection and Macie data discovery
  • Secrets Manager and Parameter Store
  • VPN and Direct Connect for hybrid connectivity
  • VPC endpoints and PrivateLink

āœ… Task 1.3: Determine appropriate data security controls

  • Encryption at rest with KMS
  • Encryption in transit with ACM/TLS
  • Key management and rotation
  • S3 encryption options and bucket policies
  • RDS and EBS encryption
  • Backup strategies and compliance
  • CloudTrail logging and Config rules

Critical Takeaways

  1. IAM Best Practices: Always use roles for applications, enable MFA for privileged users, follow least privilege, and never share credentials.

  2. Defense in Depth: Layer multiple security controls (security groups + NACLs + WAF + Shield) for comprehensive protection.

  3. Encryption Everywhere: Encrypt data at rest (KMS) and in transit (TLS/SSL), with proper key management and rotation.

  4. Network Segmentation: Use public subnets for internet-facing resources, private subnets for backend, and VPC endpoints for AWS service access.

  5. Monitoring and Compliance: Enable CloudTrail in all regions, use Config for compliance, GuardDuty for threats, and Security Hub for centralized visibility.

  6. Cross-Account Access: Use IAM roles with trust policies, not access keys, for secure cross-account access.

  7. Secrets Management: Never hardcode credentials - use Secrets Manager with automatic rotation or Parameter Store for configuration.

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain the difference between IAM users, groups, and roles
  • I understand when to use security groups vs NACLs
  • I can design a multi-tier VPC with proper security controls
  • I know how to implement encryption at rest and in transit
  • I understand cross-account access patterns with IAM roles
  • I can explain the purpose of WAF, Shield, GuardDuty, and Macie
  • I know when to use VPC endpoints vs internet gateway
  • I understand KMS key policies and grants
  • I can design a compliant architecture with proper logging
  • I know how to implement least privilege access

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle 1: Questions 1-50 (Expected score: 70%+)
  • Security Services Bundle: Questions 1-50 (Expected score: 70%+)
  • Full Practice Test 1: Domain 1 questions (Expected score: 75%+)

If you scored below 70%:

  • Review sections on IAM policies and evaluation logic
  • Focus on VPC security architecture patterns
  • Study encryption options and when to use each
  • Practice identifying security requirements from scenarios

Quick Reference Card

IAM Essentials:

  • Users: Long-term credentials for people
  • Roles: Temporary credentials for applications/services
  • Groups: Collection of users with common permissions
  • Policies: JSON documents defining permissions

VPC Security:

  • Security Groups: Stateful, instance-level, allow rules only
  • NACLs: Stateless, subnet-level, allow and deny rules
  • VPC Endpoints: Private access to AWS services (Gateway for S3/DynamoDB, Interface for others)

Encryption:

  • At Rest: KMS (CMK or AWS-managed), S3 SSE, EBS encryption
  • In Transit: TLS/SSL with ACM certificates
  • Key Rotation: Automatic for AWS-managed, manual for customer-managed

Security Services:

  • WAF: Layer 7 protection, rate limiting, SQL injection/XSS blocking
  • Shield: DDoS protection (Standard free, Advanced $3K/month)
  • GuardDuty: Threat detection using ML ($4.50/million events)
  • Macie: Sensitive data discovery in S3
  • Security Hub: Centralized security findings

Decision Points:

  • Need cross-account access? → IAM role with trust policy
  • Need to encrypt data? → Enable KMS encryption
  • Need DDoS protection? → Shield Standard + WAF
  • Need private AWS access? → VPC endpoints
  • Need to audit API calls? → CloudTrail
  • Need compliance monitoring? → Config rules
  • Need threat detection? → GuardDuty
  • Need to find sensitive data? → Macie

Next Chapter: Proceed to 03_domain2_resilient_architectures to learn about designing resilient and highly available architectures.

Chapter Summary

What We Covered

This chapter covered the critical security concepts for AWS Solutions Architect certification, representing 30% of the exam content. You learned:

  • āœ… IAM Fundamentals: Users, groups, roles, policies, and the principle of least privilege
  • āœ… Advanced IAM: Cross-account access, federation, IAM Identity Center, and SCPs
  • āœ… Network Security: VPC architecture, security groups, NACLs, and network segmentation
  • āœ… Application Security: WAF, Shield, GuardDuty, Macie, and threat protection
  • āœ… Data Protection: KMS encryption, ACM certificates, and data lifecycle management
  • āœ… Compliance: CloudTrail, Config, Audit Manager, and governance frameworks

Critical Takeaways

  1. IAM Best Practices: Always use roles for applications, enable MFA for privileged users, implement least privilege, and rotate credentials regularly
  2. Defense in Depth: Layer security controls (IAM + Security Groups + NACLs + WAF + encryption) for comprehensive protection
  3. Encryption Everywhere: Encrypt data at rest with KMS, encrypt in transit with TLS/SSL, and manage keys with proper access controls
  4. Network Segmentation: Use public subnets for internet-facing resources, private subnets for backend, and VPC endpoints for AWS service access
  5. Automated Security: Leverage GuardDuty for threat detection, Macie for data discovery, and Config for compliance monitoring
  6. Cross-Account Strategy: Use Organizations with SCPs for centralized governance, IAM roles for access, and Control Tower for multi-account management

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

IAM & Access Control:

  • Explain the difference between IAM users, groups, and roles
  • Design a cross-account access strategy using IAM roles
  • Implement federation with SAML or OIDC providers
  • Configure SCPs to restrict actions across an organization
  • Troubleshoot IAM policy evaluation and permission issues

Network Security:

  • Design a multi-tier VPC architecture with proper segmentation
  • Configure security groups and NACLs for defense in depth
  • Implement VPC endpoints for private AWS service access
  • Set up VPN or Direct Connect for hybrid connectivity
  • Analyze VPC Flow Logs to identify security issues

Application & Data Security:

  • Configure WAF rules to protect against common attacks
  • Implement Shield Advanced for DDoS protection
  • Set up GuardDuty and respond to security findings
  • Use Secrets Manager for credential rotation
  • Encrypt data at rest with KMS and in transit with ACM

Compliance & Governance:

  • Enable CloudTrail for API logging and log file validation
  • Create Config rules for compliance monitoring
  • Implement backup strategies with AWS Backup
  • Design architectures that meet regulatory requirements
  • Use Security Hub for centralized security management

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle 1: Questions 1-25 (IAM and access control)
  • Domain 1 Bundle 2: Questions 26-50 (Network and application security)
  • Domain 1 Bundle 3: Questions 51-75 (Data protection and compliance)
  • Security Services Bundle: All questions
  • Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • Review sections on IAM policy evaluation and cross-account access
  • Practice designing VPC architectures with proper security layers
  • Focus on understanding when to use each security service (WAF vs Shield vs GuardDuty)
  • Revisit encryption concepts and key management best practices

Quick Reference Card

Key IAM Concepts:

  • Users: Long-term credentials for people
  • Groups: Collections of users with common permissions
  • Roles: Temporary credentials for applications/services
  • Policies: JSON documents defining permissions
  • Trust Policy: Who can assume a role
  • Permissions Boundary: Maximum permissions limit

Network Security Layers:

  1. Security Groups: Stateful, instance-level, allow only
  2. NACLs: Stateless, subnet-level, allow and deny
  3. WAF: Layer 7 application protection
  4. Shield: DDoS protection
  5. Network Firewall: Advanced traffic filtering

Encryption Services:

  • KMS: Key management, envelope encryption
  • ACM: SSL/TLS certificate management
  • Secrets Manager: Credential rotation
  • S3 SSE: Server-side encryption (SSE-S3, SSE-KMS, SSE-C)
  • EBS Encryption: Transparent encryption for volumes

Security Monitoring:

  • CloudTrail: API call logging
  • GuardDuty: Threat detection ($4.50/million events)
  • Macie: Sensitive data discovery ($1/GB scanned)
  • Security Hub: Centralized findings
  • Config: Resource compliance tracking

Common Exam Scenarios:

  • Cross-account access → IAM role with trust policy
  • Encrypt S3 bucket → Enable SSE-KMS with bucket policy
  • DDoS protection → Shield Standard + WAF rate limiting
  • Private AWS access → VPC endpoints (Gateway or Interface)
  • Audit API calls → CloudTrail with log file validation
  • Compliance monitoring → Config rules + Security Hub
  • Threat detection → GuardDuty + EventBridge automation

You're ready to proceed when you can:

  • Design secure multi-tier architectures from scratch
  • Troubleshoot IAM permission issues using policy evaluation logic
  • Choose the right security service for each threat scenario
  • Implement encryption for data at rest and in transit
  • Configure network security with defense in depth

Next: Move to Chapter 2: Resilient Architectures to learn about high availability and fault tolerance.


Chapter Summary

What We Covered

This chapter covered the essential concepts for designing secure architectures on AWS, which accounts for 30% of the SAA-C03 exam (the largest domain). We explored three major task areas:

Task 1.1: Design Secure Access to AWS Resources

  • āœ… IAM users, groups, roles, and policies
  • āœ… Multi-factor authentication (MFA) and password policies
  • āœ… IAM Identity Center (AWS SSO) for centralized access
  • āœ… Cross-account access and role switching
  • āœ… AWS Organizations and Service Control Policies (SCPs)
  • āœ… AWS Control Tower for multi-account governance
  • āœ… Federation with SAML 2.0 and OIDC
  • āœ… AWS STS for temporary credentials
  • āœ… Resource-based policies and permissions boundaries
  • āœ… Least privilege principle and policy evaluation logic

Task 1.2: Design Secure Workloads and Applications

  • āœ… VPC security architecture (public/private subnets)
  • āœ… Security groups and Network ACLs
  • āœ… NAT Gateway and Internet Gateway
  • āœ… VPC endpoints (Gateway and Interface)
  • āœ… AWS PrivateLink for private connectivity
  • āœ… VPN and Direct Connect for hybrid connectivity
  • āœ… AWS WAF for web application protection
  • āœ… AWS Shield for DDoS protection
  • āœ… Amazon GuardDuty for threat detection
  • āœ… Amazon Macie for sensitive data discovery
  • āœ… AWS Secrets Manager for credential management
  • āœ… AWS Network Firewall for advanced filtering
  • āœ… VPC Flow Logs for network monitoring

Task 1.3: Determine Appropriate Data Security Controls

  • āœ… AWS KMS for encryption key management
  • āœ… Encryption at rest (S3, EBS, RDS, DynamoDB)
  • āœ… Encryption in transit (TLS/SSL with ACM)
  • āœ… S3 bucket encryption and policies
  • āœ… S3 Object Lock for compliance
  • āœ… S3 Versioning and MFA Delete
  • āœ… AWS CloudTrail for API logging
  • āœ… AWS Config for compliance monitoring
  • āœ… AWS Backup for centralized backup management
  • āœ… Key rotation and certificate renewal
  • āœ… Data classification and lifecycle policies

Critical Takeaways

  1. Least Privilege: Always grant the minimum permissions necessary. Start with deny-all, then add specific permissions. Use IAM Access Analyzer to identify overly permissive policies.

  2. IAM Policy Evaluation: Explicit Deny > Explicit Allow > Implicit Deny. If any policy has an explicit deny, access is denied regardless of allows.

  3. MFA Everywhere: Enable MFA for root user (mandatory), IAM users with console access, and privileged operations (like S3 MFA Delete).

  4. Root User Protection: Don't use root user for daily tasks. Enable MFA, delete access keys, use only for account-level tasks (billing, account closure).

  5. Cross-Account Access: Use IAM roles with trust policies, not IAM users with access keys. Roles provide temporary credentials and are more secure.

  6. Service Control Policies: SCPs set permission guardrails for entire AWS Organizations. They don't grant permissions, only limit what IAM policies can grant.

  7. Security Groups vs NACLs: Security groups are stateful (return traffic automatic), NACLs are stateless (must allow both directions). Security groups support allow rules only, NACLs support both allow and deny.

  8. VPC Endpoints: Gateway endpoints (S3, DynamoDB) are free and use route tables. Interface endpoints (most services) cost $0.01/hour + data transfer but provide private IPs.

  9. AWS WAF: Protects against common web exploits (SQL injection, XSS). Use managed rules for quick deployment, custom rules for specific needs. Costs $5/month + $1/rule + $0.60/million requests.

  10. AWS Shield: Standard (free, automatic DDoS protection), Advanced ($3,000/month, enhanced protection + DDoS Response Team + cost protection).

  11. GuardDuty: Threat detection using ML, analyzes VPC Flow Logs, CloudTrail, DNS logs. Costs $4.50/million events. Findings can trigger automated remediation via EventBridge.

  12. Secrets Manager: Automatic rotation for RDS, Redshift, DocumentDB. Costs $0.40/secret/month + $0.05/10,000 API calls. Use for database credentials, API keys, OAuth tokens.

  13. KMS Encryption: Customer Managed Keys (CMK) give full control, AWS Managed Keys are free but limited control. CMK costs $1/month + $0.03/10,000 requests.

  14. S3 Encryption: SSE-S3 (free, AWS-managed keys), SSE-KMS (CMK control, audit trail), SSE-C (customer-provided keys). Enable default encryption on buckets.

  15. S3 Object Lock: WORM (Write Once Read Many) for compliance. Governance mode (can be overridden with permissions), Compliance mode (cannot be deleted even by root).

  16. CloudTrail: Logs all API calls, essential for security auditing. Enable log file validation to detect tampering. Store logs in separate security account.

  17. Encryption in Transit: Use TLS 1.2+ for all connections. ACM provides free SSL/TLS certificates with automatic renewal. Use ALB or CloudFront for TLS termination.

  18. Defense in Depth: Layer multiple security controls (IAM + Security Groups + NACLs + WAF + Encryption). If one layer fails, others provide protection.

  19. Shared Responsibility Model: AWS secures infrastructure (physical, network, hypervisor). You secure data, applications, IAM, OS, network configuration.

  20. Compliance: Use AWS Artifact for compliance reports, Config for continuous compliance monitoring, Security Hub for centralized security findings.

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

IAM and Access Management:

  • Create IAM policies with appropriate permissions
  • Explain IAM policy evaluation logic (Deny > Allow > Implicit Deny)
  • Configure cross-account access using IAM roles
  • Set up IAM Identity Center for SSO
  • Implement MFA for users and root account
  • Design least privilege access policies
  • Use IAM Access Analyzer to identify overly permissive policies
  • Configure Service Control Policies in AWS Organizations

Network Security:

  • Design multi-tier VPC architecture with public/private subnets
  • Configure security groups and NACLs correctly
  • Explain the difference between stateful and stateless firewalls
  • Implement VPC endpoints to secure AWS service access
  • Configure AWS PrivateLink for private connectivity
  • Set up VPN or Direct Connect for hybrid connectivity
  • Use VPC Flow Logs for network monitoring
  • Implement AWS Network Firewall for advanced filtering

Application Security:

  • Configure AWS WAF to protect against web exploits
  • Implement AWS Shield for DDoS protection
  • Set up GuardDuty for threat detection
  • Configure Macie for sensitive data discovery
  • Use Secrets Manager for credential rotation
  • Implement API Gateway authorization (IAM, Cognito, Lambda)
  • Configure ALB authentication with Cognito
  • Use Systems Manager Session Manager for secure instance access

Data Security:

  • Configure KMS customer managed keys
  • Implement encryption at rest for S3, EBS, RDS, DynamoDB
  • Enable encryption in transit with TLS/SSL
  • Configure S3 bucket policies for encryption enforcement
  • Implement S3 Object Lock for compliance
  • Set up S3 Versioning and MFA Delete
  • Configure CloudTrail for API logging
  • Use AWS Config for compliance monitoring
  • Implement AWS Backup for centralized backup management
  • Configure automatic key rotation

Security Monitoring:

  • Enable CloudTrail with log file validation
  • Configure GuardDuty findings and automated remediation
  • Use Security Hub for centralized security findings
  • Implement Config rules for compliance monitoring
  • Analyze VPC Flow Logs for security incidents
  • Use CloudWatch Logs for application security monitoring
  • Configure EventBridge for security automation

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle 1: Questions 1-25 (Focus: IAM and access management)
  • Domain 1 Bundle 2: Questions 26-50 (Focus: Network security)
  • Domain 1 Bundle 3: Questions 51-75 (Focus: Data security)
  • Full Practice Test 1: Domain 1 questions (Mixed difficulty)

Expected score: 70%+ to proceed confidently

If you scored below 70%:

  • Review IAM policy evaluation logic
  • Focus on security groups vs NACLs differences
  • Study VPC endpoint types and use cases
  • Practice KMS encryption scenarios
  • Review AWS WAF and Shield features

Quick Reference Card

Copy this to your notes for quick review:

IAM Essentials:

  • Root User: Enable MFA, delete access keys, use only for account tasks
  • IAM Users: For individual people, enable MFA for console access
  • IAM Groups: Assign permissions to groups, add users to groups
  • IAM Roles: For AWS services, cross-account access, temporary credentials
  • IAM Policies: JSON documents, Effect (Allow/Deny), Action, Resource, Condition
  • Policy Evaluation: Explicit Deny > Explicit Allow > Implicit Deny

Network Security:

  • Security Groups: Stateful, allow rules only, instance-level
  • NACLs: Stateless, allow and deny rules, subnet-level
  • VPC Endpoints (Gateway): S3, DynamoDB, free, use route tables
  • VPC Endpoints (Interface): Most services, $0.01/hour, private IPs
  • NAT Gateway: Outbound internet for private subnets, $0.045/hour
  • Internet Gateway: Bidirectional internet for public subnets, free

Security Services:

  • AWS WAF: Web application firewall, $5/month + $1/rule + $0.60/million requests
  • AWS Shield Standard: Free, automatic DDoS protection
  • AWS Shield Advanced: $3,000/month, enhanced DDoS protection + DRT
  • GuardDuty: Threat detection, $4.50/million events
  • Macie: Sensitive data discovery, $1/GB scanned
  • Security Hub: Centralized findings, $0.0010/check/region/month
  • Inspector: Vulnerability scanning, $0.30/assessment

Encryption:

  • KMS CMK: $1/month + $0.03/10,000 requests
  • S3 SSE-S3: Free, AWS-managed keys
  • S3 SSE-KMS: CMK control, audit trail
  • S3 SSE-C: Customer-provided keys
  • EBS Encryption: Transparent, uses KMS
  • RDS Encryption: At-rest encryption, uses KMS
  • ACM: Free SSL/TLS certificates, automatic renewal

Secrets Management:

  • Secrets Manager: $0.40/secret/month, automatic rotation
  • Parameter Store: Free (Standard), $0.05/advanced parameter/month
  • KMS: Encrypt secrets, $1/CMK/month

Monitoring & Compliance:

  • CloudTrail: API logging, $2/100,000 events (after first copy)
  • Config: Compliance monitoring, $0.003/configuration item
  • VPC Flow Logs: Network traffic logging, CloudWatch Logs pricing
  • AWS Backup: Centralized backup, storage + restore costs

Common Security Patterns:

  • Cross-account access → IAM role with trust policy
  • Encrypt S3 bucket → Enable SSE-KMS with bucket policy
  • DDoS protection → Shield Standard + WAF rate limiting
  • Private AWS access → VPC endpoints (Gateway or Interface)
  • Audit API calls → CloudTrail with log file validation
  • Compliance monitoring → Config rules + Security Hub
  • Threat detection → GuardDuty + EventBridge automation
  • Credential rotation → Secrets Manager with Lambda
  • Secure instance access → Systems Manager Session Manager
  • Web application protection → WAF + Shield + CloudFront

Congratulations! You've completed Chapter 1: Design Secure Architectures. You now understand how to implement comprehensive security controls for AWS resources, workloads, and data.

Next Steps:

  1. Complete the self-assessment checklist above
  2. Practice with Domain 1 test bundles
  3. Review any weak areas identified
  4. When ready, proceed to Chapter 2: Resilient Architectures

Chapter Summary

What We Covered

Task 1.1: Design Secure Access to AWS Resources

  • āœ… IAM users, groups, roles, and policies
  • āœ… Multi-factor authentication (MFA) and root user security
  • āœ… Cross-account access and role switching
  • āœ… AWS Organizations and Service Control Policies (SCPs)
  • āœ… IAM Identity Center (AWS SSO) for centralized access
  • āœ… Federation with SAML and OIDC providers
  • āœ… Resource-based policies and permissions boundaries

Task 1.2: Design Secure Workloads and Applications

  • āœ… VPC security architecture (security groups, NACLs)
  • āœ… Network segmentation (public/private subnets)
  • āœ… AWS WAF, Shield, and DDoS protection
  • āœ… GuardDuty for threat detection
  • āœ… Secrets Manager for credential management
  • āœ… VPN and Direct Connect for secure connectivity
  • āœ… VPC endpoints for private AWS service access

Task 1.3: Determine Appropriate Data Security Controls

  • āœ… Encryption at rest (KMS, S3, EBS, RDS)
  • āœ… Encryption in transit (TLS, ACM certificates)
  • āœ… Key management and rotation strategies
  • āœ… Data backup and replication
  • āœ… Compliance monitoring with Config and CloudTrail
  • āœ… Data lifecycle and retention policies

Critical Takeaways

  1. Least Privilege: Always grant minimum permissions needed, use IAM policies with specific actions and resources
  2. Defense in Depth: Layer security controls (IAM + security groups + NACLs + encryption)
  3. Encryption Everywhere: Encrypt data at rest (KMS) and in transit (TLS/SSL)
  4. Audit Everything: Enable CloudTrail, Config, and VPC Flow Logs for comprehensive auditing
  5. Automate Security: Use GuardDuty, Security Hub, and EventBridge for automated threat response
  6. Secure by Default: Enable MFA, use IAM roles instead of access keys, rotate credentials regularly
  7. Network Isolation: Use private subnets, VPC endpoints, and PrivateLink to minimize internet exposure
  8. Compliance First: Use Config rules, AWS Artifact, and Audit Manager for compliance requirements

Self-Assessment Checklist

Test yourself before moving on:

IAM & Access Management

  • I can explain the difference between IAM users, groups, and roles
  • I understand when to use identity-based vs resource-based policies
  • I can design a cross-account access strategy using IAM roles
  • I know how to implement least privilege with IAM policies
  • I understand how SCPs work in AWS Organizations
  • I can explain when to use IAM Identity Center vs traditional IAM

Network Security

  • I can design a multi-tier VPC architecture with proper security
  • I understand the difference between security groups and NACLs
  • I know when to use VPC endpoints vs NAT gateways
  • I can explain how to protect against DDoS attacks
  • I understand how to implement WAF rules for web applications
  • I know how to secure VPN and Direct Connect connections

Data Protection

  • I can explain the difference between SSE-S3, SSE-KMS, and SSE-C
  • I understand how to implement encryption at rest for all AWS services
  • I know how to manage KMS keys and implement key rotation
  • I can design a backup and disaster recovery strategy
  • I understand how to implement data lifecycle policies
  • I know how to use CloudTrail for audit logging

Scenario-Based Questions

  • I can choose the right security service for a given scenario
  • I understand how to combine multiple security services
  • I can identify security vulnerabilities in architectures
  • I know how to implement compliance requirements
  • I can design secure hybrid cloud architectures

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle 1: Questions 1-50 (all security topics)
  • Domain 1 Bundle 2: Questions 51-100 (advanced security)
  • Domain 1 Bundle 3: Questions 101-150 (security scenarios)
  • Security Services Bundle: 50 questions focused on IAM, KMS, WAF, Shield, GuardDuty

Expected Score: 70%+ to proceed confidently

If you scored below 70%:

  • Review sections where you missed questions
  • Focus on understanding WHY wrong answers are incorrect
  • Practice with additional domain-focused bundles
  • Revisit diagrams and decision frameworks

Quick Reference Card

Copy this to your notes for quick review:

IAM Best Practices:

  • Enable MFA for root and privileged users
  • Use IAM roles for EC2, Lambda, and cross-account access
  • Implement least privilege with specific policies
  • Rotate credentials regularly (90 days)
  • Use IAM Access Analyzer to identify external access

Network Security:

  • Security groups: Stateful, allow rules only, instance-level
  • NACLs: Stateless, allow/deny rules, subnet-level
  • VPC endpoints: Private access to AWS services (no internet)
  • PrivateLink: Private access to third-party services
  • WAF: Protect against SQL injection, XSS, rate limiting

Encryption:

  • S3: SSE-S3 (free), SSE-KMS (auditable), SSE-C (customer-managed)
  • EBS: Encrypted by default, uses KMS
  • RDS: Encrypt at creation, can't encrypt existing DB
  • In-transit: Use TLS/SSL, ACM for certificate management

Monitoring & Compliance:

  • CloudTrail: API call logging (who did what when)
  • Config: Resource configuration tracking and compliance
  • GuardDuty: Threat detection using ML
  • Security Hub: Centralized security findings
  • Macie: Sensitive data discovery in S3

Decision Points:

  • Need to audit API calls? → CloudTrail
  • Need to detect threats? → GuardDuty
  • Need to protect web app? → WAF + Shield
  • Need to rotate secrets? → Secrets Manager
  • Need cross-account access? → IAM role with trust policy
  • Need to encrypt data? → KMS with appropriate key policy

Chapter Summary

What We Covered

This chapter covered the three critical task areas for designing secure architectures on AWS:

āœ… Task 1.1: Secure Access to AWS Resources

  • IAM fundamentals: users, groups, roles, and policies
  • Multi-factor authentication (MFA) and credential management
  • Cross-account access patterns and role switching
  • AWS Organizations and Service Control Policies (SCPs)
  • Federation with SAML and OIDC identity providers
  • AWS IAM Identity Center for centralized SSO
  • Least privilege principle and permissions boundaries

āœ… Task 1.2: Secure Workloads and Applications

  • VPC security architecture with security groups and NACLs
  • Network segmentation with public and private subnets
  • AWS WAF for application-layer protection
  • AWS Shield for DDoS protection
  • Amazon GuardDuty for threat detection
  • AWS Secrets Manager for credential rotation
  • VPN and Direct Connect for hybrid connectivity
  • VPC endpoints and PrivateLink for private AWS service access

āœ… Task 1.3: Data Security Controls

  • Encryption at rest with AWS KMS
  • Encryption in transit with TLS/SSL and ACM
  • S3 encryption options (SSE-S3, SSE-KMS, SSE-C)
  • EBS and RDS encryption
  • Data backup strategies with AWS Backup
  • Compliance frameworks and AWS Config
  • CloudTrail for audit logging
  • Data lifecycle and retention policies

Critical Takeaways

  1. IAM Best Practices: Always use IAM roles for applications, never embed credentials. Enable MFA on root and privileged accounts. Apply least privilege principle to all policies.

  2. Defense in Depth: Layer security controls - use security groups AND NACLs, encrypt data at rest AND in transit, implement WAF AND Shield for web applications.

  3. Encryption Everywhere: Encrypt all sensitive data. Use KMS for centralized key management. Enable encryption by default on new resources.

  4. Network Segmentation: Isolate resources in private subnets. Use VPC endpoints to avoid internet traffic. Implement bastion hosts or Systems Manager Session Manager for secure access.

  5. Monitoring and Compliance: Enable CloudTrail in all regions. Use Config for compliance tracking. Set up GuardDuty for threat detection. Centralize findings in Security Hub.

  6. Cross-Account Security: Use IAM roles with trust policies for cross-account access. Implement SCPs at the organization level. Use AWS Control Tower for multi-account governance.

  7. Secret Management: Never hardcode credentials. Use Secrets Manager or Parameter Store. Enable automatic rotation for database credentials.

  8. Compliance Automation: Use AWS Config rules to enforce compliance. Implement AWS Backup for automated backups. Use S3 Object Lock for WORM compliance.

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

IAM and Access Management:

  • Explain the difference between IAM users, groups, and roles
  • Describe how to implement cross-account access securely
  • Configure MFA for root and IAM users
  • Write IAM policies using least privilege principle
  • Explain when to use resource-based vs identity-based policies
  • Implement federation with SAML or OIDC
  • Configure AWS Organizations with SCPs
  • Use IAM Access Analyzer to identify external access

Network Security:

  • Design a multi-tier VPC architecture with security groups and NACLs
  • Explain the difference between security groups (stateful) and NACLs (stateless)
  • Configure VPC endpoints for S3 and DynamoDB
  • Implement AWS PrivateLink for third-party services
  • Set up AWS WAF rules to protect against common attacks
  • Configure AWS Shield Advanced for DDoS protection
  • Design hybrid connectivity with VPN or Direct Connect
  • Implement network segmentation with public and private subnets

Data Protection:

  • Configure S3 bucket encryption with SSE-S3, SSE-KMS, or SSE-C
  • Enable EBS encryption by default
  • Encrypt RDS databases at creation
  • Implement encryption in transit with TLS/SSL
  • Manage certificates with AWS Certificate Manager
  • Configure KMS key policies and grants
  • Implement automatic key rotation
  • Set up cross-region replication with encryption

Monitoring and Compliance:

  • Enable CloudTrail for API logging across all regions
  • Configure AWS Config rules for compliance checking
  • Set up Amazon GuardDuty for threat detection
  • Use Amazon Macie to discover sensitive data in S3
  • Centralize security findings in AWS Security Hub
  • Implement automated remediation with EventBridge and Lambda
  • Configure AWS Backup for automated backups
  • Use S3 Object Lock for compliance retention

Practice Questions

Try these from your practice test bundles:

Beginner Level (Target: 80%+ correct):

  • Domain 1 Bundle 1: Questions 1-20 (IAM basics, security groups, encryption fundamentals)
  • Security Services Bundle: Questions 1-15 (GuardDuty, WAF, Shield basics)

Intermediate Level (Target: 70%+ correct):

  • Domain 1 Bundle 2: Questions 21-40 (Cross-account access, federation, advanced networking)
  • Full Practice Test 1: Domain 1 questions (Mixed difficulty, realistic scenarios)

Advanced Level (Target: 60%+ correct):

  • Domain 1 Bundle 3: Questions 41-50 (Complex architectures, policy optimization, compliance)
  • Full Practice Test 2: Domain 1 questions (Advanced scenarios)

If You Scored Below Target

Below 60% on Beginner Questions:

  • Review sections: IAM Fundamentals, Security Groups vs NACLs, Basic Encryption
  • Focus on: Understanding IAM policy structure, stateful vs stateless firewalls, encryption at rest vs in transit
  • Practice: Create IAM policies in AWS console, configure security groups, enable S3 encryption

Below 60% on Intermediate Questions:

  • Review sections: Cross-Account Access, Federation, VPC Endpoints, KMS Key Policies
  • Focus on: IAM role trust policies, SAML/OIDC integration, PrivateLink architecture, envelope encryption
  • Practice: Set up cross-account role switching, configure VPC endpoints, create KMS keys with policies

Below 50% on Advanced Questions:

  • Review sections: Complex Multi-Account Architectures, Advanced IAM Policies, Compliance Frameworks
  • Focus on: SCP inheritance, attribute-based access control, zero-trust architecture, automated compliance
  • Practice: Design multi-account security architecture, optimize IAM policies, implement Config rules

Quick Reference Card

Copy this to your notes for quick review

IAM Essentials

  • Root User: Enable MFA, don't use for daily tasks, lock away credentials
  • IAM Users: For individual people, enable MFA, rotate access keys every 90 days
  • IAM Groups: Assign permissions to groups, add users to groups
  • IAM Roles: For applications and services, use for cross-account access
  • Policies: JSON documents, explicit deny overrides allow, least privilege principle

Network Security

  • Security Groups: Stateful, allow rules only, instance-level, default deny inbound
  • NACLs: Stateless, allow/deny rules, subnet-level, process rules in order
  • VPC Endpoints: Gateway (S3, DynamoDB) or Interface (other services)
  • PrivateLink: Private access to third-party SaaS, uses interface endpoints
  • WAF: Protect against SQL injection, XSS, rate limiting, geo-blocking
  • Shield: Standard (free, automatic) or Advanced (paid, 24/7 DDoS response)

Encryption

  • At Rest: KMS (managed keys), SSE-S3 (S3-managed), SSE-C (customer-managed)
  • In Transit: TLS/SSL, ACM for certificate management, automatic renewal
  • KMS: Customer Master Keys (CMK), automatic rotation, key policies, grants
  • Envelope Encryption: Encrypt data with data key, encrypt data key with CMK

Monitoring & Compliance

  • CloudTrail: API call logging, enable in all regions, log file validation
  • Config: Resource configuration tracking, compliance rules, automatic remediation
  • GuardDuty: Threat detection using ML, analyzes VPC Flow Logs, DNS logs, CloudTrail
  • Macie: Sensitive data discovery in S3, PII detection, data classification
  • Security Hub: Centralized security findings, compliance checks, automated remediation

Decision Points

Scenario Solution
Need to audit API calls CloudTrail
Need to detect threats GuardDuty
Need to protect web app WAF + Shield
Need to rotate secrets Secrets Manager
Need cross-account access IAM role with trust policy
Need to encrypt data KMS with key policy
Need private AWS service access VPC endpoint
Need to discover sensitive data Macie
Need compliance tracking Config
Need centralized security view Security Hub

Common Exam Traps

  • āŒ Using root user for daily tasks → āœ… Create IAM users/roles
  • āŒ Hardcoding credentials → āœ… Use IAM roles or Secrets Manager
  • āŒ Overly permissive policies → āœ… Apply least privilege
  • āŒ Not encrypting sensitive data → āœ… Enable encryption by default
  • āŒ Exposing resources to internet → āœ… Use private subnets + VPC endpoints
  • āŒ Not enabling MFA → āœ… Enable MFA on all privileged accounts
  • āŒ Not logging API calls → āœ… Enable CloudTrail in all regions
  • āŒ Manual security checks → āœ… Automate with Config rules

Next Chapter: 03_domain2_resilient_architectures - Learn how to design highly available and fault-tolerant architectures.


Chapter Summary

What We Covered

This chapter covered the three critical task areas for designing secure architectures on AWS:

āœ… Task 1.1: Secure Access to AWS Resources

  • IAM fundamentals: users, groups, roles, policies
  • Multi-factor authentication (MFA) and root user security
  • Cross-account access and role switching
  • AWS Organizations and Service Control Policies (SCPs)
  • Federation with SAML and OIDC
  • IAM Identity Center (AWS SSO) for centralized access
  • Least privilege principle and permissions boundaries

āœ… Task 1.2: Secure Workloads and Applications

  • VPC security architecture with security groups and NACLs
  • Network segmentation with public and private subnets
  • AWS WAF for application protection
  • AWS Shield for DDoS protection
  • GuardDuty for threat detection
  • Secrets Manager for credential management
  • VPN and Direct Connect for hybrid connectivity
  • VPC endpoints and PrivateLink for private connectivity

āœ… Task 1.3: Data Security Controls

  • Encryption at rest with AWS KMS
  • Encryption in transit with TLS/SSL and ACM
  • S3 encryption options (SSE-S3, SSE-KMS, SSE-C)
  • RDS and EBS encryption
  • Key rotation and certificate management
  • Data backup and replication strategies
  • CloudTrail for audit logging
  • AWS Config for compliance monitoring

Critical Takeaways

  1. IAM Best Practices: Always use IAM roles for applications, never embed credentials. Enable MFA on all accounts, especially root. Apply least privilege principle to all policies.

  2. Defense in Depth: Use multiple layers of security - security groups, NACLs, WAF, Shield. No single point of failure in security architecture.

  3. Encryption Everywhere: Encrypt data at rest with KMS, encrypt data in transit with TLS. Use envelope encryption for large data sets.

  4. Audit and Monitor: Enable CloudTrail in all regions, use Config for compliance, GuardDuty for threats, and Security Hub for centralized visibility.

  5. Shared Responsibility: AWS secures the infrastructure, you secure what you put in the cloud. Understand where your responsibilities begin.

  6. Network Isolation: Use VPC endpoints to keep traffic within AWS network. Use PrivateLink for private access to services. Segment networks with multiple subnets.

  7. Secrets Management: Never hardcode credentials. Use Secrets Manager or Parameter Store with automatic rotation.

  8. Cross-Account Access: Use IAM roles with trust policies, not IAM users. Implement SCPs at organization level for guardrails.

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain the difference between IAM users, groups, and roles
  • I understand when to use resource-based vs identity-based policies
  • I can design a multi-account architecture with Organizations and SCPs
  • I know how to implement cross-account access securely
  • I understand the difference between security groups and NACLs
  • I can design a VPC with proper network segmentation
  • I know when to use WAF, Shield, and GuardDuty
  • I understand the different S3 encryption options
  • I can explain how KMS works and when to use it
  • I know how to implement encryption in transit
  • I understand CloudTrail, Config, and their use cases
  • I can design a secure hybrid architecture with VPN or Direct Connect

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle 1: Questions 1-20 (IAM and access control)
  • Domain 1 Bundle 2: Questions 1-20 (Network security)
  • Domain 1 Bundle 3: Questions 1-20 (Data protection)
  • Security Services Bundle: Questions 1-25

Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • Review sections on IAM policies and evaluation logic
  • Focus on understanding security groups vs NACLs (stateful vs stateless)
  • Study KMS key policies and grants
  • Practice cross-account access scenarios

Quick Reference Card

IAM Essentials:

  • Users: Long-term credentials for people
  • Groups: Collection of users with common permissions
  • Roles: Temporary credentials for applications/services
  • Policies: JSON documents defining permissions
  • Trust Policy: Who can assume a role
  • Permissions Boundary: Maximum permissions limit

Network Security:

  • Security Groups: Stateful, allow only, instance-level
  • NACLs: Stateless, allow/deny, subnet-level
  • VPC Endpoints: Gateway (S3, DynamoDB) or Interface (other services)
  • PrivateLink: Private access to third-party services

Encryption:

  • At Rest: KMS (CMK), SSE-S3, SSE-KMS, SSE-C
  • In Transit: TLS/SSL, ACM for certificates
  • KMS: Customer Master Keys, automatic rotation, key policies

Monitoring:

  • CloudTrail: API call logging
  • Config: Resource configuration tracking
  • GuardDuty: Threat detection
  • Macie: Sensitive data discovery
  • Security Hub: Centralized security findings

Key Decision Points:

  • Need to audit API calls → CloudTrail
  • Need to detect threats → GuardDuty
  • Need to protect web app → WAF + Shield
  • Need to rotate secrets → Secrets Manager
  • Need cross-account access → IAM role with trust policy
  • Need to encrypt data → KMS for at rest, TLS for in transit
  • Need private connectivity → VPC endpoints or PrivateLink

Next Chapter: 03_domain2_resilient_architectures - Learn how to design highly available and fault-tolerant architectures.


Chapter 2: Design Resilient Architectures (26% of exam)

Chapter Overview

What you'll learn:

  • Scalable and loosely coupled architecture patterns
  • High availability and fault tolerance strategies
  • Multi-AZ and multi-region deployments
  • Disaster recovery planning and implementation
  • Auto scaling and load balancing
  • Microservices and event-driven architectures

Time to complete: 10-12 hours

Prerequisites: Chapter 0 (Fundamentals), Chapter 1 (Secure Architectures)

Exam Weight: 26% of exam questions (approximately 17 out of 65 questions)


Section 1: Scalable and Loosely Coupled Architectures

Introduction

The problem: Traditional monolithic applications are tightly coupled, making them difficult to scale, update, and maintain. When one component fails, the entire application can fail. When traffic increases, you must scale the entire application even if only one component needs more capacity.

The solution: Loosely coupled architectures separate components so they can scale independently, fail independently, and be updated independently. Components communicate through well-defined interfaces (APIs, message queues, event buses) rather than direct dependencies.

Why it's tested: This is a core principle of cloud architecture and represents 26% of the exam. Questions test your ability to design systems that scale automatically, handle failures gracefully, and minimize dependencies between components.

Core Concepts

Loose Coupling Fundamentals

What it is: Loose coupling is an architectural principle where components are designed to have minimal dependencies on each other. Components interact through standardized interfaces and don't need to know the internal implementation details of other components.

Why it exists: Tightly coupled systems are fragile. If Component A directly calls Component B, and B fails, A fails. If B needs to be updated, A might break. If B is overloaded, A must wait. Loose coupling solves these problems by introducing intermediaries (queues, load balancers, event buses) that buffer and route requests.

Real-world analogy: Think of a restaurant. In a tightly coupled system, customers would go directly into the kitchen and tell the chef what to cook. If the chef is busy, customers wait. If the chef is sick, no one eats. In a loosely coupled system, customers place orders with a waiter (queue), the waiter gives orders to the kitchen (producer), and the kitchen prepares food at its own pace (consumer). If one chef is busy, another chef can take the order. If a chef is sick, orders queue up until another chef is available.

How loose coupling works (Detailed step-by-step):

  1. Identify Components: Break your application into logical components (web tier, application tier, database tier, background processing, etc.).

  2. Define Interfaces: Each component exposes a well-defined interface (REST API, message format, event schema) that other components use to interact with it.

  3. Introduce Intermediaries: Place intermediaries between components:

    • Load Balancers: Distribute requests across multiple instances
    • Message Queues: Buffer requests between producers and consumers
    • Event Buses: Route events from publishers to subscribers
    • API Gateways: Provide a single entry point for multiple backend services
  4. Implement Asynchronous Communication: Instead of synchronous request-response (Component A waits for Component B), use asynchronous messaging (Component A sends message and continues, Component B processes when ready).

  5. Handle Failures Gracefully: Design components to handle failures of other components:

    • Retry with exponential backoff
    • Circuit breaker pattern (stop calling failing service)
    • Fallback to cached data or default responses
    • Dead letter queues for failed messages
  6. Scale Independently: Each component can scale based on its own load, not the load of other components.

Benefits of Loose Coupling:

  • Independent Scaling: Scale components based on their individual needs
  • Fault Isolation: Failure in one component doesn't cascade to others
  • Independent Deployment: Update components without affecting others
  • Technology Flexibility: Use different technologies for different components
  • Easier Testing: Test components in isolation
  • Better Resource Utilization: Don't over-provision entire application

Amazon SQS (Simple Queue Service)

What it is: Amazon SQS is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. SQS eliminates the complexity and overhead of managing message-oriented middleware.

Why it exists: When Component A produces work faster than Component B can process it, you need a buffer. Without a queue, Component A must either wait (wasting resources) or drop requests (losing data). SQS provides a reliable, scalable buffer that holds messages until consumers are ready to process them.

Real-world analogy: SQS is like a post office mailbox. You (producer) drop letters (messages) in the mailbox at any time, even if the mail carrier (consumer) isn't there. The mail carrier picks up letters when they're ready and delivers them. If you drop 100 letters at once, they wait in the mailbox until the carrier can handle them. If the carrier is sick, letters wait until another carrier is available.

How SQS works (Detailed step-by-step):

  1. Create Queue: You create an SQS queue with a name and configuration (standard or FIFO, visibility timeout, message retention period).

  2. Producer Sends Messages: Your application (producer) sends messages to the queue using the SQS SendMessage API. Each message can be up to 256 KB and contains:

    • Message Body: The actual data (JSON, XML, plain text)
    • Message Attributes: Metadata about the message (optional)
    • Message ID: Unique identifier assigned by SQS
  3. Messages Stored: SQS stores messages redundantly across multiple Availability Zones for durability. Messages are retained for 4 days by default (configurable from 1 minute to 14 days).

  4. Consumer Polls Queue: Your application (consumer) polls the queue using the SQS ReceiveMessage API. SQS returns up to 10 messages per request.

  5. Visibility Timeout: When a consumer receives a message, SQS makes it invisible to other consumers for a visibility timeout period (default 30 seconds, configurable up to 12 hours). This prevents multiple consumers from processing the same message simultaneously.

  6. Process Message: The consumer processes the message (e.g., resize image, send email, update database).

  7. Delete Message: After successfully processing, the consumer deletes the message using the SQS DeleteMessage API. If the consumer doesn't delete the message before the visibility timeout expires, the message becomes visible again and another consumer can process it.

  8. Failure Handling: If a consumer fails to process a message (crashes, throws exception), it doesn't delete the message. After the visibility timeout, the message becomes visible again for retry. After a configurable number of receive attempts (default 5), SQS can move the message to a Dead Letter Queue (DLQ) for investigation.

SQS Queue Types:

Standard Queue:

  • Throughput: Unlimited, nearly unlimited transactions per second
  • Ordering: Best-effort ordering (messages usually delivered in order, but not guaranteed)
  • Delivery: At-least-once delivery (message might be delivered more than once)
  • Use Case: High throughput, order doesn't matter, can handle duplicates

FIFO Queue:

  • Throughput: 300 transactions per second (3,000 with batching)
  • Ordering: Strict ordering (messages delivered in exact order sent)
  • Delivery: Exactly-once processing (no duplicates)
  • Use Case: Order matters, cannot handle duplicates (e.g., financial transactions)

Detailed Example 1: Image Processing Pipeline with SQS

Scenario: You're building a photo sharing application. Users upload photos that need to be resized into multiple sizes (thumbnail, medium, large) and have metadata extracted (location, date, camera model). Uploads are bursty - sometimes 10 photos per minute, sometimes 1,000 photos per minute.

Without SQS (Tightly Coupled):

  • Web server receives upload
  • Web server resizes images (CPU-intensive, takes 5 seconds per image)
  • Web server extracts metadata (takes 2 seconds per image)
  • User waits 7+ seconds for upload to complete
  • During traffic spikes, web servers become overloaded
  • Users experience timeouts and failed uploads

With SQS (Loosely Coupled):

Architecture:

  1. Upload Service: Web servers receive uploads, store original image in S3, send message to SQS queue
  2. SQS Queue: Buffers resize requests
  3. Resize Workers: Auto Scaling group of EC2 instances polls queue, processes images
  4. S3: Stores original and resized images

Step-by-Step Flow:

  1. User Uploads Photo:

    • User uploads photo to web server
    • Web server stores original in S3: s3://photos/originals/photo123.jpg
    • Web server sends message to SQS queue:
    {
      "photoId": "photo123",
      "userId": "user456",
      "s3Bucket": "photos",
      "s3Key": "originals/photo123.jpg",
      "sizes": ["thumbnail", "medium", "large"]
    }
    
    • Web server immediately returns success to user (< 100ms)
    • User doesn't wait for processing
  2. Message Queued:

    • SQS stores message redundantly across multiple AZs
    • Message is available for consumers to retrieve
    • If no consumers are available, message waits (up to 14 days)
  3. Resize Worker Polls Queue:

    • Resize worker (EC2 instance) polls SQS every 1 second
    • SQS returns message and makes it invisible for 5 minutes (visibility timeout)
    • Worker has 5 minutes to process before message becomes visible again
  4. Worker Processes Image:

    • Worker downloads original from S3
    • Worker resizes to thumbnail (200x200), medium (800x800), large (1600x1600)
    • Worker uploads resized images to S3:
      • s3://photos/thumbnails/photo123.jpg
      • s3://photos/medium/photo123.jpg
      • s3://photos/large/photo123.jpg
    • Worker extracts metadata and stores in DynamoDB
    • Processing takes 5 seconds
  5. Worker Deletes Message:

    • Worker calls SQS DeleteMessage API
    • Message is permanently removed from queue
    • Worker polls for next message
  6. Auto Scaling:

    • CloudWatch monitors queue depth (ApproximateNumberOfMessages metric)
    • If queue depth > 100, Auto Scaling adds more workers
    • If queue depth < 10, Auto Scaling removes workers
    • Workers scale from 2 (minimum) to 20 (maximum) based on load

Failure Scenarios:

Scenario 1: Worker Crashes During Processing:

  • Worker receives message, starts processing
  • Worker crashes before deleting message
  • After 5 minutes (visibility timeout), message becomes visible again
  • Another worker picks up the message and processes it
  • Result: Image eventually processed, no data loss

Scenario 2: Image Processing Fails:

  • Worker receives message, downloads image
  • Image is corrupted, processing fails
  • Worker doesn't delete message
  • Message becomes visible again after visibility timeout
  • After 5 failed attempts, SQS moves message to Dead Letter Queue
  • Operations team investigates DLQ messages

Scenario 3: Traffic Spike:

  • 1,000 photos uploaded in 1 minute
  • Web servers quickly send 1,000 messages to SQS (< 1 second)
  • SQS buffers all 1,000 messages
  • Auto Scaling detects high queue depth, adds 10 more workers
  • Workers process messages over 10 minutes
  • Users don't experience slowdowns (upload returns immediately)

Benefits of This Architecture:

  • Fast Response: Users get immediate response (< 100ms vs 7+ seconds)
  • Scalability: Workers scale independently of web servers
  • Fault Tolerance: Worker failures don't affect uploads
  • Cost Efficiency: Only pay for workers when processing images
  • Flexibility: Can add new processing steps (watermarking, face detection) without changing upload service

Cost Analysis:

  • SQS: $0.40 per million requests (1M uploads = $0.40)
  • EC2 Workers: t3.medium $0.0416/hour Ɨ 2 instances Ɨ 730 hours = $61/month (baseline)
  • Auto Scaling: Additional instances only during spikes
  • S3: Storage and transfer costs
  • Total: ~$100-200/month for millions of photos

Amazon SNS (Simple Notification Service)

What it is: Amazon SNS is a fully managed pub/sub (publish/subscribe) messaging service that enables you to decouple microservices, distributed systems, and event-driven serverless applications. SNS provides topics for high-throughput, push-based, many-to-many messaging.

Why it exists: Sometimes you need to send the same message to multiple recipients (fan-out pattern). With point-to-point messaging (like SQS), you'd need to send the message multiple times. SNS allows you to publish once and deliver to many subscribers simultaneously.

Real-world analogy: SNS is like a news broadcaster. The broadcaster (publisher) sends news (messages) to a channel (topic). Anyone interested (subscribers) can tune in to that channel. When news is broadcast, all subscribers receive it simultaneously. Subscribers can be TV viewers (Lambda functions), radio listeners (SQS queues), or newspaper readers (email addresses).

How SNS works (Detailed step-by-step):

  1. Create Topic: You create an SNS topic, which is a communication channel with a unique ARN (Amazon Resource Name).

  2. Subscribe Endpoints: You subscribe endpoints to the topic:

    • SQS Queue: Messages delivered to queue for processing
    • Lambda Function: Function invoked with message as input
    • HTTP/HTTPS Endpoint: POST request sent to your web server
    • Email/Email-JSON: Email sent to address
    • SMS: Text message sent to phone number
    • Mobile Push: Notification sent to mobile app
  3. Publish Message: Your application publishes a message to the topic using the SNS Publish API. The message contains:

    • Subject: Brief description (optional)
    • Message: The actual content (up to 256 KB)
    • Message Attributes: Metadata for filtering (optional)
  4. Fan-Out: SNS immediately delivers the message to all subscribed endpoints in parallel. Each subscriber receives a copy of the message.

  5. Retry Logic: If delivery fails (e.g., Lambda function throttled, HTTP endpoint unavailable), SNS retries with exponential backoff. After multiple failures, SNS can send failed messages to a Dead Letter Queue.

  6. Message Filtering: Subscribers can specify filter policies to receive only messages matching certain criteria. SNS evaluates filters and delivers only matching messages.

SNS vs SQS:

Feature SNS (Pub/Sub) SQS (Queue)
Pattern Publish/Subscribe (1-to-many) Point-to-Point (1-to-1)
Delivery Push (SNS pushes to subscribers) Pull (consumers poll queue)
Persistence No (messages not stored) Yes (messages stored up to 14 days)
Subscribers Multiple (fan-out) Single consumer per message
Use Case Notify multiple systems of event Decouple producer and consumer

SNS + SQS Fan-Out Pattern:

The most powerful pattern combines SNS and SQS: publish to SNS topic, which fans out to multiple SQS queues. Each queue has its own consumer that processes messages independently.

Detailed Example 2: Order Processing with SNS Fan-Out

Scenario: You're building an e-commerce platform. When a customer places an order, multiple systems need to be notified:

  • Inventory Service: Reduce stock levels
  • Shipping Service: Create shipping label
  • Email Service: Send confirmation email
  • Analytics Service: Record order for reporting
  • Fraud Detection Service: Check for suspicious activity

Architecture:

  1. Order Service: Publishes order event to SNS topic
  2. SNS Topic: "OrderPlaced" topic
  3. SQS Queues: One queue per service (5 queues total)
  4. Consumers: Each service has workers polling its queue

Step-by-Step Flow:

  1. Customer Places Order:

    • Order service validates order, charges credit card
    • Order service publishes message to SNS topic "OrderPlaced":
    {
      "orderId": "ORD-12345",
      "customerId": "CUST-789",
      "items": [
        {"productId": "PROD-001", "quantity": 2, "price": 29.99},
        {"productId": "PROD-002", "quantity": 1, "price": 49.99}
      ],
      "total": 109.97,
      "shippingAddress": {
        "street": "123 Main St",
        "city": "Seattle",
        "state": "WA",
        "zip": "98101"
      },
      "timestamp": "2024-01-15T10:30:00Z"
    }
    
  2. SNS Fans Out to Queues:

    • SNS delivers message to all 5 subscribed SQS queues simultaneously
    • Each queue receives a copy of the message
    • Delivery takes < 100ms
  3. Inventory Service Processes:

    • Inventory worker polls InventoryQueue
    • Receives order message
    • Reduces stock: PROD-001 quantity -2, PROD-002 quantity -1
    • Updates inventory database
    • Deletes message from queue
    • Processing time: 200ms
  4. Shipping Service Processes:

    • Shipping worker polls ShippingQueue
    • Receives order message
    • Calls shipping API to create label
    • Stores tracking number in database
    • Deletes message from queue
    • Processing time: 1 second (external API call)
  5. Email Service Processes:

    • Email worker polls EmailQueue
    • Receives order message
    • Generates confirmation email HTML
    • Sends email via Amazon SES
    • Deletes message from queue
    • Processing time: 500ms
  6. Analytics Service Processes:

    • Analytics worker polls AnalyticsQueue
    • Receives order message
    • Writes order data to data warehouse (Redshift)
    • Deletes message from queue
    • Processing time: 100ms
  7. Fraud Detection Processes:

    • Fraud worker polls FraudQueue
    • Receives order message
    • Runs fraud detection algorithms
    • If suspicious, creates alert
    • Deletes message from queue
    • Processing time: 2 seconds (ML inference)

Key Benefits:

Independent Processing:

  • Each service processes at its own pace
  • Slow fraud detection (2 seconds) doesn't delay fast inventory update (200ms)
  • If shipping service is down, other services continue processing

Independent Scaling:

  • Inventory service: 2 workers (fast processing)
  • Shipping service: 5 workers (slow external API)
  • Email service: 3 workers (moderate load)
  • Each service scales based on its queue depth

Fault Isolation:

  • If email service fails, order still processed by other services
  • Failed messages go to email service's Dead Letter Queue
  • Operations team fixes email service, reprocesses DLQ messages
  • Customer still gets order, just delayed email

Easy to Add Services:

  • Want to add loyalty points service? Subscribe new queue to SNS topic
  • No changes to order service or existing services
  • New service starts receiving order events immediately

Failure Scenarios:

Scenario 1: Shipping Service Down:

  • SNS delivers message to all queues
  • Shipping workers are down (deployment, crash)
  • Messages accumulate in ShippingQueue
  • Other services continue processing normally
  • When shipping service recovers, workers process backlog
  • Result: Orders processed, shipping delayed but not lost

Scenario 2: Fraud Detection Overloaded:

  • Traffic spike: 1,000 orders per minute
  • Fraud detection takes 2 seconds per order
  • FraudQueue depth increases to 2,000 messages
  • Auto Scaling adds more fraud workers
  • Fraud detection catches up over 10 minutes
  • Other services unaffected (processing in real-time)

Scenario 3: SNS Topic Unavailable (extremely rare):

  • Order service tries to publish to SNS
  • SNS returns error (service issue)
  • Order service retries with exponential backoff
  • After 3 retries, order service writes to local queue
  • When SNS recovers, order service publishes from local queue
  • Result: Temporary delay, no data loss

Message Filtering Example:

You can use SNS message filtering to send only relevant messages to each subscriber.

Scenario: High-value orders (>$1,000) need special fraud review. Low-value orders use automated fraud detection.

SNS Message with Attributes:

{
  "Message": "{order data}",
  "MessageAttributes": {
    "orderValue": {
      "Type": "Number",
      "Value": "1250.00"
    },
    "priority": {
      "Type": "String",
      "Value": "high"
    }
  }
}

Fraud Queue Subscription Filter:

{
  "orderValue": [{"numeric": [">=", 1000]}]
}

Result: Only orders >= $1,000 delivered to fraud queue. Low-value orders filtered out, reducing processing load.

Amazon EventBridge

What it is: Amazon EventBridge is a serverless event bus service that makes it easy to connect applications using events. EventBridge receives events from AWS services, custom applications, and SaaS applications, and routes them to targets based on rules.

Why it exists: Modern applications are event-driven - things happen (user signs up, file uploaded, payment processed) and other systems need to react. EventBridge provides a central event bus where all events flow, with powerful routing and filtering capabilities.

Real-world analogy: EventBridge is like a smart mail sorting facility. Letters (events) arrive from many sources (AWS services, your apps, SaaS apps). The facility reads the address and contents (event pattern matching), then routes each letter to the correct destination (targets) based on rules. Some letters might go to multiple destinations (fan-out).

How EventBridge works (Detailed step-by-step):

  1. Event Bus: You use the default event bus (receives AWS service events) or create custom event buses for your applications.

  2. Event Sources: Events come from:

    • AWS Services: EC2 state changes, S3 object uploads, CloudWatch alarms
    • Custom Applications: Your apps send events via PutEvents API
    • SaaS Partners: Zendesk, Datadog, Auth0, etc.
  3. Event Structure: Events are JSON documents with standard structure:

{
  "version": "0",
  "id": "unique-event-id",
  "detail-type": "EC2 Instance State-change Notification",
  "source": "aws.ec2",
  "account": "123456789012",
  "time": "2024-01-15T10:30:00Z",
  "region": "us-east-1",
  "resources": ["arn:aws:ec2:us-east-1:123456789012:instance/i-1234567890abcdef0"],
  "detail": {
    "instance-id": "i-1234567890abcdef0",
    "state": "running"
  }
}
  1. Rules: You create rules that match event patterns and route to targets:
{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["terminated"]
  }
}
  1. Targets: When an event matches a rule, EventBridge sends it to targets:

    • Lambda function
    • SQS queue
    • SNS topic
    • Step Functions state machine
    • Kinesis stream
    • ECS task
    • And 20+ other AWS services
  2. Transformation: EventBridge can transform events before sending to targets, extracting only needed fields or reformatting.

EventBridge vs SNS:

Feature EventBridge SNS
Pattern Matching Advanced (JSON path, content filtering) Basic (message attributes)
Event Schema Schema registry, validation No schema
Targets 20+ AWS services 6 endpoint types
Archive/Replay Yes (archive events, replay later) No
SaaS Integration Built-in (Zendesk, Datadog, etc.) No
Use Case Complex event routing, AWS service events Simple pub/sub, mobile push

Detailed Example 3: Automated Incident Response with EventBridge

Scenario: You need to automatically respond to security events. When GuardDuty detects a threat, you want to:

  1. Send alert to security team (Slack)
  2. Isolate affected EC2 instance (change security group)
  3. Create incident ticket (Jira)
  4. Capture forensics (create EBS snapshot)
  5. Log event for compliance (S3)

Architecture:

  1. GuardDuty: Detects threat, sends event to EventBridge
  2. EventBridge Rule: Matches GuardDuty findings
  3. Targets: Lambda functions for each response action

Step-by-Step Flow:

  1. GuardDuty Detects Threat:
    • GuardDuty detects EC2 instance communicating with known malicious IP
    • GuardDuty generates finding
    • GuardDuty sends event to EventBridge:
{
  "version": "0",
  "id": "finding-12345",
  "detail-type": "GuardDuty Finding",
  "source": "aws.guardduty",
  "account": "123456789012",
  "time": "2024-01-15T10:30:00Z",
  "region": "us-east-1",
  "detail": {
    "severity": 8,
    "type": "Backdoor:EC2/C&CActivity.B!DNS",
    "resource": {
      "instanceDetails": {
        "instanceId": "i-1234567890abcdef0"
      }
    },
    "service": {
      "action": {
        "networkConnectionAction": {
          "remoteIpDetails": {
            "ipAddressV4": "198.51.100.1"
          }
        }
      }
    }
  }
}
  1. EventBridge Matches Rule:
    • Rule pattern:
{
  "source": ["aws.guardduty"],
  "detail-type": ["GuardDuty Finding"],
  "detail": {
    "severity": [{"numeric": [">=", 7]}]
  }
}
  • Event matches (severity 8 >= 7)
  • EventBridge routes to 5 targets
  1. Target 1: Slack Notification Lambda:

    • Lambda receives event
    • Extracts: instance ID, threat type, severity
    • Formats Slack message:
    🚨 SECURITY ALERT
    Severity: HIGH (8/10)
    Instance: i-1234567890abcdef0
    Threat: C&C Activity Detected
    Action: Instance isolated, forensics captured
    
    • Posts to Slack webhook
    • Security team notified in < 5 seconds
  2. Target 2: Isolate Instance Lambda:

    • Lambda receives event
    • Extracts instance ID
    • Creates new security group "quarantine-sg" (no inbound/outbound rules)
    • Changes instance security group to quarantine-sg
    • Instance is now isolated (cannot communicate)
    • Takes 2 seconds
  3. Target 3: Create Jira Ticket Lambda:

    • Lambda receives event
    • Calls Jira API to create incident ticket
    • Ticket includes: instance ID, threat details, timeline
    • Assigns to security team
    • Takes 1 second
  4. Target 4: Forensics Lambda:

    • Lambda receives event
    • Creates EBS snapshot of instance volumes
    • Tags snapshot with incident ID
    • Snapshot preserved for investigation
    • Takes 5 seconds (snapshot creation is async)
  5. Target 5: Compliance Logging:

    • EventBridge sends event directly to S3 (no Lambda needed)
    • Event stored in S3: s3://security-logs/guardduty/2024/01/15/finding-12345.json
    • Retained for 7 years (compliance requirement)

Timeline:

  • T+0s: GuardDuty detects threat
  • T+1s: EventBridge receives event, matches rule
  • T+2s: Instance isolated
  • T+3s: Slack notification sent
  • T+4s: Jira ticket created
  • T+5s: Forensics snapshot initiated
  • T+5s: Event logged to S3

Total Response Time: 5 seconds (vs hours for manual response)

Benefits:

  • Fast Response: Automated response in seconds
  • Consistent: Same response every time, no human error
  • Comprehensive: Multiple actions in parallel
  • Auditable: All events logged to S3
  • Scalable: Handles 1 or 1,000 incidents identically

Cost:

  • EventBridge: $1 per million events (1,000 incidents = $0.001)
  • Lambda: $0.20 per million requests + compute time
  • Total: < $1/month for typical incident volume

AWS Lambda for Event-Driven Processing

What it is: AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the compute resources. You don't provision or manage servers - Lambda handles everything.

Why it exists: Traditional servers require provisioning, patching, scaling, and monitoring. For event-driven workloads (process file upload, respond to API call, handle queue message), you pay for idle time when no events occur. Lambda eliminates this waste by running code only when triggered and charging only for compute time used.

Real-world analogy: Lambda is like hiring a contractor instead of a full-time employee. You only pay when they're working on your project (per-request billing). You don't pay for their idle time, vacation, or benefits. When you need more work done, you hire more contractors (automatic scaling). When work is done, contractors leave (no idle resources).

How Lambda works (Detailed step-by-step):

  1. Create Function: You upload your code (Python, Node.js, Java, Go, etc.) and specify:

    • Runtime: Programming language and version
    • Handler: Function to invoke (e.g., lambda_function.lambda_handler)
    • Memory: 128 MB to 10,240 MB (CPU scales proportionally)
    • Timeout: Maximum execution time (1 second to 15 minutes)
    • IAM Role: Permissions for function to access AWS services
  2. Configure Trigger: You specify what invokes the function:

    • API Gateway: HTTP request
    • S3: Object upload
    • DynamoDB: Table update
    • SQS: Message in queue
    • EventBridge: Event pattern match
    • CloudWatch Events: Schedule (cron)
    • And 20+ other event sources
  3. Event Occurs: When the trigger event happens, AWS invokes your Lambda function.

  4. Lambda Execution:

    • Lambda finds an available execution environment (or creates new one)
    • Lambda loads your code into the environment
    • Lambda invokes your handler function with event data
    • Your code executes (processes event, calls AWS services, returns response)
    • Lambda captures logs and sends to CloudWatch Logs
  5. Scaling: If multiple events occur simultaneously, Lambda automatically creates multiple execution environments and runs them in parallel. Lambda can scale to thousands of concurrent executions.

  6. Billing: You pay for:

    • Requests: $0.20 per million requests
    • Compute Time: $0.0000166667 per GB-second (memory Ɨ duration)
    • Free Tier: 1 million requests and 400,000 GB-seconds per month

Detailed Example 4: Thumbnail Generation with Lambda

Scenario: Users upload images to S3. You need to automatically generate thumbnails (200x200) for each uploaded image.

Architecture:

  1. S3 Bucket: Users upload images
  2. S3 Event: Triggers Lambda on object creation
  3. Lambda Function: Generates thumbnail
  4. S3 Bucket: Stores thumbnail

Lambda Function Code (Python):

import boto3
import os
from PIL import Image
import io

s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Extract bucket and key from S3 event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Don't process thumbnails (avoid infinite loop)
    if key.startswith('thumbnails/'):
        return
    
    # Download image from S3
    response = s3.get_object(Bucket=bucket, Key=key)
    image_data = response['Body'].read()
    
    # Open image with Pillow
    image = Image.open(io.BytesIO(image_data))
    
    # Resize to thumbnail (200x200)
    image.thumbnail((200, 200))
    
    # Save to bytes buffer
    buffer = io.BytesIO()
    image.save(buffer, format=image.format)
    buffer.seek(0)
    
    # Upload thumbnail to S3
    thumbnail_key = f'thumbnails/{key}'
    s3.put_object(
        Bucket=bucket,
        Key=thumbnail_key,
        Body=buffer,
        ContentType=response['ContentType']
    )
    
    return {
        'statusCode': 200,
        'body': f'Thumbnail created: {thumbnail_key}'
    }

Step-by-Step Flow:

  1. User Uploads Image:

    • User uploads vacation.jpg to S3 bucket my-photos
    • S3 stores object: s3://my-photos/vacation.jpg
  2. S3 Triggers Lambda:

    • S3 sends event to Lambda:
{
  "Records": [{
    "s3": {
      "bucket": {"name": "my-photos"},
      "object": {"key": "vacation.jpg", "size": 2048000}
    }
  }]
}
  1. Lambda Execution Starts:

    • Lambda finds available execution environment (or creates new one)
    • Lambda loads function code
    • Lambda invokes handler with S3 event
  2. Function Processes Image:

    • Downloads vacation.jpg from S3 (2 MB)
    • Resizes to 200x200 thumbnail (20 KB)
    • Uploads thumbnail to s3://my-photos/thumbnails/vacation.jpg
    • Execution time: 500ms
  3. Lambda Completes:

    • Function returns success
    • Lambda logs execution to CloudWatch
    • Execution environment kept warm for 5-10 minutes (for next invocation)
  4. Billing:

    • Memory: 1024 MB
    • Duration: 500ms = 0.5 seconds
    • GB-seconds: 1 GB Ɨ 0.5 seconds = 0.5 GB-seconds
    • Cost: 0.5 Ɨ $0.0000166667 = $0.0000083 (less than 1 cent)

Scaling Example:

Scenario: 1,000 users upload images simultaneously.

  1. 1,000 S3 Events: S3 sends 1,000 events to Lambda
  2. Lambda Scales: Lambda creates 1,000 execution environments
  3. Parallel Processing: All 1,000 images processed simultaneously
  4. Total Time: 500ms (same as single image)
  5. Cost: 1,000 Ɨ $0.0000083 = $0.0083 (less than 1 cent)

Without Lambda (EC2 approach):

  • Need to provision enough EC2 instances to handle peak load (1,000 concurrent)
  • Instances idle most of the time (waste money)
  • Need to implement scaling, monitoring, patching
  • Cost: $100s/month for idle capacity

Lambda Benefits:

  • No Servers: No provisioning, patching, or management
  • Automatic Scaling: Handles 1 or 1,000,000 requests
  • Pay Per Use: Only pay for actual compute time
  • High Availability: Runs across multiple AZs automatically
  • Integrated: Native integration with 20+ AWS services

Section 2: High Availability and Fault Tolerance

Introduction

The problem: Hardware fails, software crashes, networks partition, and entire data centers can go offline. Traditional architectures with single points of failure experience downtime when components fail, resulting in lost revenue, poor user experience, and SLA violations.

The solution: High availability (HA) architectures eliminate single points of failure by deploying redundant components across multiple Availability Zones. When one component fails, traffic automatically shifts to healthy components. Fault tolerance goes further by ensuring the system continues operating correctly even during failures.

Why it's tested: This is a core AWS architectural principle and represents a significant portion of the exam. Questions test your ability to design systems that achieve 99.9%, 99.99%, or 99.999% availability using AWS services.

Core Concepts

Availability Zones and Regions

What they are: AWS Regions are geographic areas (e.g., us-east-1 in Virginia, eu-west-1 in Ireland) that contain multiple isolated Availability Zones (AZs). Each AZ is one or more discrete data centers with redundant power, networking, and connectivity.

Why they exist: A single data center can fail due to power outages, network issues, natural disasters, or human error. By distributing resources across multiple physically separated data centers (AZs), you can survive individual data center failures. Regions provide geographic diversity for disaster recovery and data residency requirements.

Real-world analogy: Think of a Region as a city (e.g., New York) and Availability Zones as different neighborhoods in that city (Manhattan, Brooklyn, Queens). Each neighborhood has its own power grid, water supply, and infrastructure. If Manhattan loses power, Brooklyn and Queens continue operating. If you need disaster recovery, you also have resources in a different city (e.g., Los Angeles).

How AZs work (Detailed):

  1. Physical Separation: AZs are physically separated by meaningful distances (miles apart) to reduce risk of simultaneous failure from natural disasters, power outages, or network issues.

  2. Independent Infrastructure: Each AZ has:

    • Independent power supply (multiple utility providers, backup generators)
    • Independent cooling systems
    • Independent network connectivity (multiple ISPs)
    • Independent physical security
  3. Low-Latency Interconnection: AZs are connected with high-bandwidth, low-latency private fiber networks. Latency between AZs in the same Region is typically < 2ms, enabling synchronous replication.

  4. Fault Isolation: Failures in one AZ don't affect other AZs. AWS designs services to isolate faults within a single AZ.

Availability Zone Naming:

  • AZ names are account-specific (your us-east-1a might be different from another account's us-east-1a)
  • This distributes load across physical AZs
  • Use AZ IDs (use1-az1, use1-az2) for consistent identification across accounts

Detailed Example 1: Multi-AZ RDS Deployment

Scenario: You're running a MySQL database for a critical e-commerce application. The database must be available 99.95% of the time (< 4.5 hours downtime per year). Single-AZ deployment doesn't meet this requirement because AZ failures occur occasionally.

Solution: RDS Multi-AZ deployment.

Architecture:

  • Primary DB Instance: In AZ-A (us-east-1a), handles all read and write operations
  • Standby DB Instance: In AZ-B (us-east-1b), synchronously replicates from primary
  • DNS Endpoint: Single endpoint (mydb.abc123.us-east-1.rds.amazonaws.com) that points to current primary

How Multi-AZ Works:

  1. Normal Operation:

    • Application connects to DNS endpoint
    • DNS resolves to primary instance IP in AZ-A
    • Application sends queries to primary
    • Primary processes queries and returns results
    • Primary synchronously replicates every transaction to standby in AZ-B
    • Standby acknowledges replication before primary commits transaction
    • This ensures zero data loss (RPO = 0)
  2. Synchronous Replication:

    • Application writes data: INSERT INTO orders VALUES (...)
    • Primary writes to its storage
    • Primary sends transaction to standby
    • Standby writes to its storage
    • Standby sends acknowledgment to primary
    • Primary commits transaction and returns success to application
    • Replication adds < 5ms latency (AZs are close)
  3. Failure Detection:

    • RDS continuously monitors primary instance health
    • Health checks every 1-2 seconds:
      • Network connectivity
      • Instance responsiveness
      • Storage availability
      • Database process status
    • If 3 consecutive health checks fail (3-6 seconds), RDS initiates failover
  4. Automatic Failover:

    • RDS detects primary failure (e.g., AZ-A power outage)
    • RDS promotes standby in AZ-B to primary
    • RDS updates DNS record to point to new primary IP
    • DNS TTL is 30 seconds, but RDS forces immediate update
    • Applications reconnect and resume operations
    • Total failover time: 60-120 seconds
  5. Post-Failover:

    • New primary (formerly standby) handles all traffic
    • RDS automatically creates new standby in another AZ (AZ-C)
    • Synchronous replication resumes
    • System returns to fully redundant state

Failure Scenarios:

Scenario 1: AZ-A Power Outage:

  • T+0s: Power outage in AZ-A, primary instance becomes unreachable
  • T+3s: RDS detects failure (3 failed health checks)
  • T+5s: RDS initiates failover, promotes standby
  • T+30s: DNS propagates to most clients
  • T+60s: Applications reconnect to new primary
  • T+120s: All applications operational
  • Downtime: 60-120 seconds
  • Data Loss: Zero (synchronous replication)

Scenario 2: Primary Instance Crash:

  • T+0s: Database process crashes on primary
  • T+2s: RDS detects failure
  • T+5s: RDS initiates failover
  • T+60s: Applications reconnect
  • Downtime: 60 seconds
  • Data Loss: Zero

Scenario 3: Storage Failure:

  • T+0s: EBS volume fails on primary
  • T+3s: RDS detects failure
  • T+5s: RDS initiates failover
  • T+60s: Applications operational on standby
  • Downtime: 60 seconds
  • Data Loss: Zero

Scenario 4: Planned Maintenance:

  • You need to upgrade database version
  • RDS performs maintenance on standby first
  • RDS fails over to upgraded standby (60 seconds downtime)
  • RDS upgrades old primary (now standby)
  • Downtime: 60 seconds (vs hours for single-AZ)

What You Get:

  • High Availability: 99.95% uptime SLA
  • Zero Data Loss: Synchronous replication (RPO = 0)
  • Fast Recovery: 60-120 second failover (RTO = 1-2 minutes)
  • Automatic: No manual intervention required
  • Transparent: Same endpoint before and after failover

Cost:

  • Multi-AZ doubles database cost (2 instances)
  • db.r5.large: $0.24/hour Ɨ 2 = $0.48/hour = $350/month
  • Worth it for production workloads requiring high availability

Important Notes:

  • Standby is not accessible for reads (use read replicas for read scaling)
  • Failover is automatic, but applications must handle reconnection
  • Use connection pooling with retry logic for seamless failover
  • Multi-AZ is within a single Region (use cross-region read replicas for DR)

Elastic Load Balancing

What it is: Elastic Load Balancing (ELB) automatically distributes incoming application traffic across multiple targets (EC2 instances, containers, IP addresses, Lambda functions) in multiple Availability Zones.

Why it exists: Without a load balancer, you'd need to manually distribute traffic across instances, handle instance failures, and manage scaling. Load balancers automate this, providing high availability, fault tolerance, and automatic scaling.

Real-world analogy: A load balancer is like a restaurant host who seats customers. Instead of customers choosing their own table (which could overload some servers while others are idle), the host distributes customers evenly across all servers. If a server is busy or unavailable, the host sends customers to other servers. If the restaurant gets crowded, the host calls in more servers.

Load Balancer Types:

Application Load Balancer (ALB) - Layer 7 (HTTP/HTTPS):

  • Routes based on content (URL path, hostname, headers, query parameters)
  • Supports WebSocket and HTTP/2
  • Integrates with AWS WAF for application security
  • Best for web applications and microservices

Network Load Balancer (NLB) - Layer 4 (TCP/UDP):

  • Ultra-high performance (millions of requests per second)
  • Static IP addresses (Elastic IPs)
  • Preserves source IP address
  • Best for TCP/UDP traffic, extreme performance requirements

Gateway Load Balancer (GWLB) - Layer 3 (IP):

  • Deploys, scales, and manages third-party virtual appliances
  • Transparent network gateway + load balancer
  • Best for firewalls, intrusion detection, deep packet inspection

How ALB Works (Detailed step-by-step):

  1. Create Load Balancer:

    • Choose subnets in multiple AZs (minimum 2)
    • ALB creates load balancer nodes in each subnet
    • Each node has its own IP address
    • DNS name resolves to all node IPs (round-robin)
  2. Configure Target Groups:

    • Target group is a logical grouping of targets (EC2 instances, IPs, Lambda functions)
    • Define health check: protocol, path, interval, timeout, thresholds
    • Example: HTTP GET /health every 30 seconds, timeout 5 seconds, 2 consecutive successes = healthy
  3. Register Targets:

    • Add EC2 instances to target group
    • ALB starts sending health checks to each target
    • Targets must pass health checks before receiving traffic
  4. Configure Listeners:

    • Listener checks for connection requests on specified protocol and port
    • Example: HTTPS listener on port 443
    • Listener rules route requests to target groups based on conditions
  5. Traffic Flow:

    • Client sends request to ALB DNS name
    • DNS resolves to ALB node IPs (multiple IPs for redundancy)
    • Client connects to ALB node
    • ALB terminates TLS connection (if HTTPS)
    • ALB selects healthy target using routing algorithm (round-robin, least outstanding requests)
    • ALB forwards request to target
    • Target processes request and returns response
    • ALB forwards response to client
  6. Health Checks:

    • ALB continuously sends health checks to all targets
    • If target fails health check (returns non-200 status, times out), ALB marks it unhealthy
    • ALB stops sending traffic to unhealthy targets
    • When target passes health checks again, ALB resumes sending traffic
  7. Auto Scaling Integration:

    • Auto Scaling group launches/terminates instances based on load
    • New instances automatically registered with target group
    • ALB starts health checking new instances
    • Once healthy, ALB sends traffic to new instances
    • Terminated instances automatically deregistered

Detailed Example 2: High-Availability Web Application with ALB

Scenario: You're deploying a web application that must handle 10,000 requests per second with 99.99% availability. The application runs on EC2 instances and must survive AZ failures.

Architecture:

  • ALB: In 3 AZs (us-east-1a, us-east-1b, us-east-1c)
  • Auto Scaling Group: Launches EC2 instances across 3 AZs
  • Target Group: Contains all EC2 instances
  • Minimum Instances: 6 (2 per AZ)
  • Maximum Instances: 30 (10 per AZ)

Step-by-Step Flow:

  1. Initial Deployment:

    • Auto Scaling launches 6 t3.medium instances (2 per AZ)
    • Instances install application, start web server
    • ALB health checks instances (GET /health)
    • After 2 successful health checks (60 seconds), instances marked healthy
    • ALB starts sending traffic
  2. Normal Traffic (1,000 req/sec):

    • Clients send requests to ALB DNS: myapp-123456.us-east-1.elb.amazonaws.com
    • DNS returns 3 IP addresses (one per AZ)
    • Clients connect to ALB nodes
    • ALB distributes traffic evenly: ~167 req/sec per instance
    • All instances healthy, handling load comfortably
  3. Traffic Spike (10,000 req/sec):

    • Traffic increases 10x
    • CloudWatch alarm triggers: CPU > 70%
    • Auto Scaling adds 12 instances (4 per AZ)
    • New instances launch, install application (5 minutes)
    • ALB health checks new instances
    • Once healthy, ALB includes in rotation
    • Traffic distributed across 18 instances: ~556 req/sec per instance
    • CPU drops to 50%, system stable
  4. AZ Failure (us-east-1a):

    • Power outage in us-east-1a
    • 6 instances in us-east-1a become unreachable
    • ALB health checks fail for us-east-1a instances
    • After 2 failed health checks (60 seconds), ALB marks them unhealthy
    • ALB stops sending traffic to us-east-1a
    • ALB redistributes traffic to us-east-1b and us-east-1c (12 instances)
    • Traffic per instance: ~833 req/sec
    • CPU increases to 65%, still acceptable
    • Auto Scaling detects high CPU, adds 6 more instances in us-east-1b and us-east-1c
    • System returns to normal load distribution
  5. AZ Recovery:

    • Power restored in us-east-1a
    • Instances in us-east-1a restart
    • ALB health checks pass
    • ALB resumes sending traffic to us-east-1a
    • Traffic redistributes across all 3 AZs

Failure Scenarios:

Scenario 1: Single Instance Failure:

  • Instance crashes (application bug, out of memory)
  • ALB health check fails
  • After 60 seconds, ALB marks instance unhealthy
  • ALB stops sending traffic to failed instance
  • Traffic redistributed to healthy instances
  • Auto Scaling detects failed instance, terminates it
  • Auto Scaling launches replacement instance
  • Impact: None (other instances handle traffic)
  • Recovery: 5 minutes (new instance launch time)

Scenario 2: Entire AZ Failure:

  • AZ-A fails (power, network, AWS issue)
  • All instances in AZ-A unreachable
  • ALB marks all AZ-A instances unhealthy
  • ALB sends traffic only to AZ-B and AZ-C
  • Impact: Minimal (60 seconds to detect, traffic redistributed)
  • Capacity: Reduced by 33%, but Auto Scaling adds instances
  • Recovery: Automatic when AZ recovers

Scenario 3: ALB Node Failure:

  • ALB node in AZ-A fails (extremely rare)
  • Clients connecting to that node experience errors
  • Clients retry, connect to ALB nodes in AZ-B or AZ-C
  • Impact: Minimal (clients retry automatically)
  • Recovery: Immediate (other ALB nodes available)

Scenario 4: Deployment Gone Wrong:

  • You deploy new application version
  • New version has bug, returns 500 errors
  • ALB health checks fail for new instances
  • ALB keeps sending traffic to old instances (still healthy)
  • You rollback deployment
  • Impact: None (ALB prevented bad deployment from affecting users)

ALB Features for High Availability:

Cross-Zone Load Balancing (enabled by default):

  • Distributes traffic evenly across all targets in all AZs
  • Without it: Traffic distributed evenly to AZs, then to targets within AZ
  • With it: Traffic distributed evenly to all targets regardless of AZ
  • Example: 2 instances in AZ-A, 4 instances in AZ-B
    • Without cross-zone: AZ-A instances get 25% each, AZ-B instances get 12.5% each
    • With cross-zone: All instances get 16.67% each

Connection Draining (deregistration delay):

  • When instance is deregistered (terminating, unhealthy), ALB stops sending new requests
  • ALB waits for in-flight requests to complete (default 300 seconds)
  • Prevents abrupt connection termination
  • Ensures graceful shutdown

Sticky Sessions (session affinity):

  • Routes requests from same client to same target
  • Uses cookie to track client-target mapping
  • Useful for applications that store session state locally
  • Duration: 1 second to 7 days

Slow Start Mode:

  • Gradually increases traffic to newly registered targets
  • Gives targets time to warm up (load caches, establish connections)
  • Duration: 30 to 900 seconds
  • Prevents overwhelming new instances

What You Get:

  • High Availability: 99.99% SLA (ALB itself is highly available)
  • Fault Tolerance: Survives instance and AZ failures
  • Automatic Scaling: Integrates with Auto Scaling
  • Health Checks: Automatic detection and removal of unhealthy targets
  • SSL Termination: Offloads TLS processing from instances
  • Content-Based Routing: Route based on URL, headers, etc.

Cost:

  • ALB: $0.0225/hour = $16.43/month
  • LCU (Load Balancer Capacity Unit): $0.008 per LCU-hour
  • LCU measures: new connections, active connections, processed bytes, rule evaluations
  • Typical cost: $50-200/month depending on traffic

Auto Scaling

What it is: Amazon EC2 Auto Scaling automatically adjusts the number of EC2 instances in response to changing demand. It ensures you have the right number of instances to handle your application load while minimizing costs.

Why it exists: Manual scaling is slow, error-prone, and inefficient. You either over-provision (waste money on idle instances) or under-provision (poor performance during spikes). Auto Scaling automates this, scaling out during high demand and scaling in during low demand.

Real-world analogy: Auto Scaling is like a restaurant manager who adjusts staffing based on customer volume. During lunch rush, the manager calls in more servers. During slow periods, the manager sends servers home. The manager monitors wait times (performance metrics) and adjusts staffing to maintain service quality while controlling labor costs.

How Auto Scaling Works (Detailed step-by-step):

  1. Create Launch Template:

    • Defines instance configuration: AMI, instance type, security groups, user data
    • Like a blueprint for launching instances
    • Can have multiple versions for easy updates
  2. Create Auto Scaling Group (ASG):

    • Specify launch template
    • Choose VPC subnets (multiple AZs for high availability)
    • Set capacity:
      • Minimum: Minimum number of instances (always running)
      • Desired: Target number of instances
      • Maximum: Maximum number of instances (cost control)
    • Example: Min=2, Desired=4, Max=10
  3. Configure Health Checks:

    • EC2 Health Check: Instance running and reachable
    • ELB Health Check: Instance passing load balancer health checks
    • Unhealthy instances automatically replaced
  4. Create Scaling Policies:

    • Target Tracking: Maintain metric at target value (e.g., CPU at 50%)
    • Step Scaling: Add/remove instances based on CloudWatch alarms
    • Scheduled Scaling: Scale at specific times (e.g., scale up at 9 AM)
    • Predictive Scaling: Use ML to predict future load and scale proactively
  5. Auto Scaling Monitors:

    • CloudWatch collects metrics (CPU, network, custom metrics)
    • Auto Scaling evaluates scaling policies every 60 seconds
    • When policy conditions met, Auto Scaling adjusts capacity
  6. Scale Out (add instances):

    • Policy triggers (e.g., CPU > 70%)
    • Auto Scaling launches new instances from launch template
    • Instances distributed across AZs for balance
    • Instances register with load balancer
    • After health checks pass, instances receive traffic
    • Launch time: 3-5 minutes
  7. Scale In (remove instances):

    • Policy triggers (e.g., CPU < 30%)
    • Auto Scaling selects instances to terminate (oldest launch template, closest to billing hour)
    • Auto Scaling deregisters instances from load balancer
    • Load balancer drains connections (waits for in-flight requests)
    • Auto Scaling terminates instances
    • Termination time: 5-10 minutes (connection draining)

Detailed Example 3: Auto Scaling for Variable Workload

Scenario: You're running a news website. Traffic patterns:

  • Overnight (12 AM - 6 AM): 100 req/sec (low)
  • Morning (6 AM - 12 PM): 1,000 req/sec (medium)
  • Afternoon (12 PM - 6 PM): 5,000 req/sec (high)
  • Evening (6 PM - 12 AM): 2,000 req/sec (medium)
  • Breaking News: Unpredictable spikes to 20,000 req/sec

Requirements:

  • Handle all traffic without performance degradation
  • Minimize cost (don't over-provision)
  • Survive AZ failures

Solution: Auto Scaling with multiple policies.

Configuration:

  • Launch Template: t3.medium instances, application AMI
  • Auto Scaling Group:
    • Min: 4 (2 per AZ, for high availability)
    • Desired: 8 (initial capacity)
    • Max: 40 (cost control)
    • Subnets: us-east-1a, us-east-1b
  • Scaling Policies:
    1. Target Tracking: Maintain average CPU at 50%
    2. Scheduled Scaling: Scale up at 11:30 AM (before lunch rush)
    3. Scheduled Scaling: Scale down at 6:30 PM (after afternoon peak)

Daily Scaling Pattern:

12 AM - 6 AM (Overnight):

  • Traffic: 100 req/sec
  • Instances: 4 (minimum)
  • CPU: 20% (low utilization)
  • Cost: 4 Ɨ $0.0416/hour = $0.17/hour

6 AM - 12 PM (Morning):

  • Traffic increases to 1,000 req/sec
  • CPU increases to 60%
  • Target tracking policy triggers (CPU > 50%)
  • Auto Scaling adds 4 instances (total: 8)
  • CPU drops to 40%
  • Cost: 8 Ɨ $0.0416/hour = $0.33/hour

11:30 AM (Scheduled Scale-Up):

  • Scheduled policy adds 8 instances (total: 16)
  • Proactive scaling before lunch rush
  • Instances ready when traffic increases at 12 PM

12 PM - 6 PM (Afternoon Peak):

  • Traffic increases to 5,000 req/sec
  • 16 instances handle load comfortably
  • CPU: 55%
  • If traffic exceeds expectations, target tracking adds more instances
  • Cost: 16 Ɨ $0.0416/hour = $0.67/hour

6:30 PM (Scheduled Scale-Down):

  • Scheduled policy reduces to 12 instances
  • Traffic decreasing, don't need full capacity

6 PM - 12 AM (Evening):

  • Traffic: 2,000 req/sec
  • Instances: 12
  • CPU: 45%
  • Cost: 12 Ɨ $0.0416/hour = $0.50/hour

Breaking News Spike:

  • Traffic suddenly spikes to 20,000 req/sec
  • CPU jumps to 90%
  • Target tracking policy triggers aggressively
  • Auto Scaling adds instances rapidly (every 60 seconds)
  • Scales to 40 instances (maximum) in 5 minutes
  • CPU drops to 50%
  • After spike ends, Auto Scaling gradually scales in

Cost Savings:

  • Without Auto Scaling: Need 40 instances 24/7 to handle peak
    • Cost: 40 Ɨ $0.0416 Ɨ 24 = $39.94/day = $1,198/month
  • With Auto Scaling: Average 10 instances
    • Cost: 10 Ɨ $0.0416 Ɨ 24 = $9.98/day = $299/month
  • Savings: $899/month (75% reduction)

High Availability Benefits:

  • Minimum 4 instances (2 per AZ) ensures service during AZ failure
  • Auto Scaling automatically replaces failed instances
  • Distributes instances evenly across AZs
  • Integrates with load balancer for seamless failover

Disaster Recovery Strategies

What it is: Disaster Recovery (DR) is the process of preparing for and recovering from events that negatively affect business operations. DR strategies define how quickly you can recover (RTO) and how much data you can afford to lose (RPO).

Why it exists: Disasters happen - natural disasters, cyber attacks, human errors, hardware failures. Without a DR plan, these events can cause permanent data loss, extended downtime, and business failure. DR strategies provide a roadmap for recovery.

Real-world analogy: DR is like having insurance and emergency plans for your house. You have smoke detectors (monitoring), fire extinguishers (immediate response), insurance (financial protection), and a plan for where your family will stay if the house burns down (recovery strategy). The level of preparation depends on risk tolerance and budget.

Key Metrics:

Recovery Time Objective (RTO):

  • How long can your business survive without the system?
  • Time from disaster to full recovery
  • Example: RTO = 4 hours means system must be operational within 4 hours

Recovery Point Objective (RPO):

  • How much data can your business afford to lose?
  • Time between last backup and disaster
  • Example: RPO = 1 hour means you can lose up to 1 hour of data

DR Strategies (from least to most expensive):

1. Backup and Restore (Lowest Cost, Highest RTO/RPO)

What it is: Regularly back up data to AWS (S3, Glacier). When disaster occurs, provision infrastructure and restore data from backups.

RTO: Hours to days (time to provision infrastructure + restore data)
RPO: Hours (time since last backup)
Cost: Very low (only pay for backup storage)

How it works:

  1. Normal Operation: Application runs on-premises or in primary AWS region
  2. Backup: Daily/hourly backups to S3 using AWS Backup, snapshots, or custom scripts
  3. Disaster: Primary site fails
  4. Recovery:
    • Provision infrastructure (EC2, RDS, etc.) using CloudFormation
    • Restore data from S3/Glacier
    • Update DNS to point to new infrastructure
    • Resume operations

Example:

  • Primary: On-premises data center
  • Backup: Daily database backups to S3, weekly full backups to Glacier
  • Disaster: Data center floods
  • Recovery:
    • Day 1: Provision EC2 instances and RDS in AWS (4 hours)
    • Day 1: Restore database from last night's backup (2 hours)
    • Day 1: Update DNS, test application (2 hours)
    • Total RTO: 8 hours
    • RPO: 24 hours (lost 1 day of data)

When to use:

  • āœ… Non-critical applications (can tolerate hours of downtime)
  • āœ… Budget-constrained (minimal ongoing cost)
  • āœ… Infrequent data changes (low RPO acceptable)
  • āœ… Compliance requires backups but not high availability

Cost: $50-500/month (backup storage only)

2. Pilot Light (Low Cost, Medium RTO/RPO)

What it is: Maintain minimal infrastructure in DR site (database replication only). When disaster occurs, quickly scale up remaining infrastructure.

RTO: Minutes to hours (infrastructure already exists, just needs scaling)
RPO: Minutes (continuous database replication)
Cost: Low (only critical components running)

How it works:

  1. Normal Operation: Full application in primary region
  2. Pilot Light: Database continuously replicates to DR region (RDS read replica, DynamoDB global tables)
  3. Disaster: Primary region fails
  4. Recovery:
    • Promote read replica to primary (minutes)
    • Launch application servers from AMIs (minutes)
    • Update DNS to point to DR region
    • Resume operations

Example:

  • Primary: us-east-1 (full application: ALB, EC2, RDS)
  • Pilot Light: us-west-2 (RDS read replica only)
  • Disaster: us-east-1 region failure
  • Recovery:
    • Minute 1: Promote us-west-2 read replica to primary
    • Minute 5: Launch EC2 instances from AMIs (Auto Scaling)
    • Minute 10: Create ALB, register instances
    • Minute 15: Update Route 53 to point to us-west-2 ALB
    • Total RTO: 15 minutes
    • RPO: 5 minutes (replication lag)

When to use:

  • āœ… Business-critical applications (need quick recovery)
  • āœ… Moderate budget (can afford database replication)
  • āœ… Data changes frequently (need low RPO)
  • āœ… Can tolerate brief downtime (minutes to hours)

Cost: $200-1,000/month (database replication + minimal infrastructure)

3. Warm Standby (Medium Cost, Low RTO/RPO)

What it is: Maintain scaled-down but fully functional version of production environment in DR site. When disaster occurs, scale up to production capacity.

RTO: Minutes (infrastructure running, just needs scaling)
RPO: Seconds to minutes (continuous replication)
Cost: Medium (running infrastructure at reduced capacity)

How it works:

  1. Normal Operation: Full production in primary region
  2. Warm Standby: Scaled-down version in DR region (e.g., 25% capacity)
    • Database replicating continuously
    • Application servers running (fewer instances)
    • Load balancer configured
  3. Disaster: Primary region fails
  4. Recovery:
    • Promote database to primary
    • Scale up application servers to 100% capacity
    • Update DNS to point to DR region
    • Resume operations

Example:

  • Primary: us-east-1 (20 EC2 instances, RDS Multi-AZ)
  • Warm Standby: us-west-2 (5 EC2 instances, RDS read replica)
  • Disaster: us-east-1 region failure
  • Recovery:
    • Minute 1: Promote us-west-2 read replica to primary
    • Minute 2: Auto Scaling increases from 5 to 20 instances
    • Minute 5: All instances healthy and receiving traffic
    • Minute 6: Update Route 53 to point to us-west-2
    • Total RTO: 6 minutes
    • RPO: 30 seconds (replication lag)

When to use:

  • āœ… Mission-critical applications (need fast recovery)
  • āœ… Can afford higher DR costs
  • āœ… Need to test DR regularly (environment always running)
  • āœ… Minimal data loss acceptable (seconds to minutes)

Cost: $1,000-5,000/month (25-50% of production cost)

4. Multi-Site Active-Active (Highest Cost, Lowest RTO/RPO)

What it is: Run full production capacity in multiple regions simultaneously. Traffic distributed across all regions. When disaster occurs, remaining regions absorb traffic.

RTO: Zero to seconds (no recovery needed, automatic failover)
RPO: Zero to seconds (synchronous or near-synchronous replication)
Cost: High (2x+ production cost)

How it works:

  1. Normal Operation: Full production in multiple regions
    • Route 53 distributes traffic (latency-based, geolocation, weighted)
    • Database replicates across regions (DynamoDB global tables, Aurora global database)
    • All regions actively serving traffic
  2. Disaster: One region fails
  3. Recovery:
    • Route 53 health checks detect failure
    • Route 53 automatically stops sending traffic to failed region
    • Remaining regions absorb traffic (may need to scale up)
    • Total RTO: 30-60 seconds (health check detection)
    • RPO: 0-1 second (near-synchronous replication)

Example:

  • Primary: us-east-1 (20 EC2 instances, DynamoDB global table)
  • Secondary: eu-west-1 (20 EC2 instances, DynamoDB global table)
  • Tertiary: ap-southeast-1 (20 EC2 instances, DynamoDB global table)
  • Normal: Each region handles 33% of global traffic
  • Disaster: us-east-1 region failure
  • Recovery:
    • Second 1: Route 53 health checks fail for us-east-1
    • Second 30: Route 53 stops sending traffic to us-east-1
    • Second 31: eu-west-1 and ap-southeast-1 each handle 50% of traffic
    • Minute 5: Auto Scaling adds instances in eu-west-1 and ap-southeast-1
    • Total RTO: 30 seconds
    • RPO: 1 second (DynamoDB global table replication)

When to use:

  • āœ… Zero-downtime requirement (financial trading, healthcare)
  • āœ… Global user base (low latency worldwide)
  • āœ… Can afford 2x+ infrastructure cost
  • āœ… Zero data loss requirement

Cost: $10,000-50,000+/month (2-3x production cost)

DR Strategy Comparison:

Strategy RTO RPO Cost Use Case
Backup & Restore Hours-Days Hours $ Non-critical, budget-constrained
Pilot Light Minutes-Hours Minutes $$ Business-critical, moderate budget
Warm Standby Minutes Seconds $$$ Mission-critical, need fast recovery
Active-Active Seconds Seconds $$$$ Zero-downtime, global applications

Amazon Route 53 for High Availability

What it is: Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service. Route 53 connects user requests to infrastructure running in AWS or on-premises.

Why it exists: DNS is critical infrastructure - if DNS fails, users can't reach your application even if it's running perfectly. Route 53 provides 100% availability SLA and advanced routing policies for high availability and disaster recovery.

Real-world analogy: Route 53 is like a GPS navigation system. When you want to go somewhere (access a website), GPS (Route 53) tells you the best route based on current conditions (traffic, road closures). If your usual route is blocked (server down), GPS automatically reroutes you to an alternate path (healthy server).

Route 53 Routing Policies:

1. Simple Routing:

  • Returns single resource (one IP address)
  • No health checks
  • Use case: Single server, no failover needed

2. Weighted Routing:

  • Distributes traffic across multiple resources based on weights
  • Example: 70% to us-east-1, 30% to us-west-2
  • Use case: A/B testing, gradual migration, traffic distribution

3. Latency-Based Routing:

  • Routes to resource with lowest latency for user
  • Route 53 measures latency from user's location to each region
  • Use case: Global applications, optimize user experience

4. Failover Routing:

  • Routes to primary resource, fails over to secondary if primary unhealthy
  • Requires health checks
  • Use case: Active-passive DR, simple failover

5. Geolocation Routing:

  • Routes based on user's geographic location
  • Example: EU users → eu-west-1, US users → us-east-1
  • Use case: Content localization, data residency compliance

6. Geoproximity Routing:

  • Routes based on geographic location with bias
  • Can shift traffic toward or away from resources
  • Use case: Gradual traffic migration, load balancing with geographic preference

7. Multi-Value Answer Routing:

  • Returns multiple IP addresses (up to 8)
  • Client chooses which to use
  • Health checks ensure only healthy IPs returned
  • Use case: Simple load balancing, multiple healthy resources

Health Checks:

Route 53 health checks monitor endpoint health and automatically route traffic away from unhealthy endpoints.

Health Check Types:

  1. Endpoint Health Check: Monitors specific IP or domain

    • Protocol: HTTP, HTTPS, TCP
    • Interval: 30 seconds (standard) or 10 seconds (fast)
    • Failure threshold: 3 consecutive failures = unhealthy
    • Success threshold: 3 consecutive successes = healthy
  2. Calculated Health Check: Combines multiple health checks with AND, OR, NOT logic

    • Example: Healthy if (us-east-1 healthy) OR (us-west-2 healthy)
  3. CloudWatch Alarm Health Check: Based on CloudWatch alarm state

    • Example: Healthy if ALB target count > 0

Detailed Example 4: Multi-Region Failover with Route 53

Scenario: You're running a global e-commerce platform. Requirements:

  • Serve users from nearest region (low latency)
  • Automatically failover if region fails
  • Zero manual intervention

Architecture:

  • Primary: us-east-1 (ALB, EC2, RDS)
  • Secondary: eu-west-1 (ALB, EC2, RDS read replica)
  • Route 53: Latency-based routing with health checks

Configuration:

  1. Create Health Checks:

  2. Create Route 53 Records:

    • Record 1: www.example.com → us-east-1 ALB
      • Type: A record (Alias to ALB)
      • Routing: Latency-based (us-east-1)
      • Health Check: Health Check 1
      • Evaluate Target Health: Yes
    • Record 2: www.example.com → eu-west-1 ALB
      • Type: A record (Alias to ALB)
      • Routing: Latency-based (eu-west-1)
      • Health Check: Health Check 2
      • Evaluate Target Health: Yes

Normal Operation:

  • User in New York queries www.example.com

  • Route 53 measures latency: us-east-1 (20ms), eu-west-1 (100ms)

  • Route 53 returns us-east-1 ALB IP (lowest latency)

  • User connects to us-east-1

  • User in London queries www.example.com

  • Route 53 measures latency: us-east-1 (80ms), eu-west-1 (15ms)

  • Route 53 returns eu-west-1 ALB IP (lowest latency)

  • User connects to eu-west-1

Disaster Scenario - us-east-1 Fails:

  • T+0s: us-east-1 region failure (all instances down)
  • T+30s: Route 53 health check fails (first failure)
  • T+60s: Route 53 health check fails (second failure)
  • T+90s: Route 53 health check fails (third failure)
  • T+90s: Route 53 marks us-east-1 unhealthy
  • T+90s: New York user queries www.example.com
  • T+90s: Route 53 skips unhealthy us-east-1, returns eu-west-1 IP
  • T+90s: User connects to eu-west-1 (higher latency but working)
  • RTO: 90 seconds (health check detection time)

Recovery:

  • us-east-1 recovers
  • Route 53 health checks pass (3 consecutive successes)
  • After 90 seconds, Route 53 marks us-east-1 healthy
  • New York users automatically routed back to us-east-1

Benefits:

  • Automatic Failover: No manual DNS updates
  • Low Latency: Users routed to nearest region
  • Fast Detection: 90 seconds to detect and failover
  • Transparent: Users don't notice failover (just slightly higher latency)

Cost:

  • Hosted Zone: $0.50/month
  • Queries: $0.40 per million queries
  • Health Checks: $0.50/month per health check
  • Total: ~$2-10/month depending on traffic

Chapter Summary

What We Covered

This chapter covered the "Design Resilient Architectures" domain, which represents 26% of the SAA-C03 exam. We explored two major areas:

āœ… Section 1: Scalable and Loosely Coupled Architectures

  • Loose coupling principles and benefits
  • Amazon SQS for message queuing (standard vs FIFO)
  • Amazon SNS for pub/sub messaging
  • SNS + SQS fan-out pattern
  • Amazon EventBridge for event-driven architectures
  • AWS Lambda for serverless event processing
  • Microservices and event-driven design patterns

āœ… Section 2: High Availability and Fault Tolerance

  • Availability Zones and Regions
  • Multi-AZ deployments (RDS, ELB, Auto Scaling)
  • Elastic Load Balancing (ALB, NLB, GWLB)
  • Auto Scaling strategies and policies
  • Disaster recovery strategies (Backup & Restore, Pilot Light, Warm Standby, Active-Active)
  • RTO and RPO concepts
  • Amazon Route 53 routing policies and health checks

Critical Takeaways

  1. Loose Coupling: Decouple components using queues (SQS), pub/sub (SNS), and event buses (EventBridge). This enables independent scaling, fault isolation, and easier maintenance.

  2. Multi-AZ for High Availability: Always deploy across multiple Availability Zones. Use RDS Multi-AZ for databases, ALB across multiple AZs, and Auto Scaling with minimum 2 instances per AZ.

  3. SQS vs SNS: Use SQS for point-to-point messaging (producer → queue → consumer). Use SNS for fan-out (publisher → topic → multiple subscribers). Combine them for powerful patterns.

  4. Auto Scaling: Use target tracking policies for dynamic scaling, scheduled policies for predictable patterns, and set appropriate min/max/desired capacity for cost control and availability.

  5. DR Strategy Selection: Choose based on RTO/RPO requirements and budget. Backup & Restore (cheapest, slowest), Pilot Light (moderate), Warm Standby (faster), Active-Active (fastest, most expensive).

  6. Health Checks: Always configure health checks for load balancers, Auto Scaling, and Route 53. Health checks enable automatic detection and recovery from failures.

  7. Route 53 Routing: Use latency-based routing for global applications, failover routing for DR, weighted routing for A/B testing, and geolocation for compliance.

  8. Lambda for Events: Use Lambda for event-driven processing (S3 uploads, SQS messages, EventBridge events). Lambda scales automatically and you only pay for execution time.

Self-Assessment Checklist

Test yourself before moving on:

  • I understand the difference between loose coupling and tight coupling
  • I can explain when to use SQS vs SNS
  • I know how SQS visibility timeout works
  • I understand the SNS + SQS fan-out pattern
  • I can describe how EventBridge routes events
  • I know when to use Lambda vs EC2
  • I understand Multi-AZ deployments for RDS
  • I can explain how ALB health checks work
  • I know how Auto Scaling policies work (target tracking, step, scheduled)
  • I understand the 4 DR strategies and when to use each
  • I can calculate RTO and RPO for different scenarios
  • I know Route 53 routing policies and their use cases
  • I understand how Route 53 health checks enable failover

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle 1: Questions 1-25 (Scalable architectures)
  • Domain 2 Bundle 2: Questions 26-50 (High availability)
  • Full Practice Test 1: Questions 21-37 (Domain 2 questions)

Expected score: 70%+ to proceed confidently

If you scored below 70%:

  • Review sections: Focus on areas where you missed questions
  • Key topics to strengthen:
    • SQS vs SNS use cases
    • Multi-AZ deployment patterns
    • Auto Scaling policies
    • DR strategy selection (RTO/RPO)
    • Route 53 routing policies

Quick Reference Card

Messaging Services:

  • SQS Standard: Unlimited throughput, best-effort ordering, at-least-once delivery
  • SQS FIFO: 300 TPS (3,000 with batching), strict ordering, exactly-once
  • SNS: Pub/sub, fan-out to multiple subscribers, push-based
  • EventBridge: Advanced event routing, schema registry, SaaS integration

High Availability:

  • Multi-AZ: Deploy across multiple AZs for fault tolerance
  • RDS Multi-AZ: Synchronous replication, automatic failover (60-120 seconds)
  • ALB: Distributes traffic across AZs, health checks, auto scaling integration
  • Auto Scaling: Dynamic scaling based on metrics, scheduled scaling for predictable patterns

DR Strategies:

  • Backup & Restore: RTO hours-days, RPO hours, lowest cost
  • Pilot Light: RTO minutes-hours, RPO minutes, low cost
  • Warm Standby: RTO minutes, RPO seconds, medium cost
  • Active-Active: RTO seconds, RPO seconds, highest cost

Route 53 Routing:

  • Simple: Single resource, no health checks
  • Weighted: Traffic distribution (A/B testing)
  • Latency: Route to lowest latency region
  • Failover: Active-passive DR
  • Geolocation: Route based on user location

Decision Points:

  • Decouple components → Use SQS (queue) or SNS (pub/sub)
  • Fan-out to multiple services → Use SNS + SQS
  • Event-driven processing → Use EventBridge + Lambda
  • High availability database → Use RDS Multi-AZ
  • Distribute traffic → Use ALB with health checks
  • Scale automatically → Use Auto Scaling with target tracking
  • DR with fast recovery → Use Warm Standby or Active-Active
  • Global application → Use Route 53 latency-based routing

Next Chapter: 04_domain3_high_performing_architectures - Design High-Performing Architectures (24% of exam)


Section 3: Advanced Messaging Patterns

SQS Message Flow Patterns

SQS Standard Queue Flow

šŸ“Š SQS Standard Message Flow Diagram:

sequenceDiagram
    participant P as Producer
    participant SQS as SQS Queue
    participant C1 as Consumer 1
    participant C2 as Consumer 2

    P->>SQS: Send Message 1
    P->>SQS: Send Message 2
    P->>SQS: Send Message 3
    
    Note over SQS: Messages stored<br/>redundantly across AZs
    
    C1->>SQS: Poll for messages
    SQS-->>C1: Return Message 1
    Note over C1: Processing...<br/>(Visibility timeout: 30s)
    
    C2->>SQS: Poll for messages
    SQS-->>C2: Return Message 2
    
    C1->>SQS: Delete Message 1
    Note over SQS: Message 1 removed
    
    C2->>SQS: Delete Message 2
    Note over SQS: Message 2 removed

See: diagrams/03_domain2_sqs_standard_flow.mmd

Diagram Explanation (Detailed):
This sequence diagram illustrates how SQS Standard queues handle message processing with multiple consumers. The Producer sends three messages to the SQS queue, which stores them redundantly across multiple Availability Zones for durability (99.999999999% durability). When Consumer 1 polls the queue, it receives Message 1, which immediately becomes invisible to other consumers for the visibility timeout period (default 30 seconds). This prevents duplicate processing. Meanwhile, Consumer 2 can poll and receive Message 2 simultaneously, enabling parallel processing. The visibility timeout gives each consumer time to process and delete the message. If a consumer fails to delete the message within the timeout, it becomes visible again for retry. After successful processing, consumers explicitly delete messages from the queue. This pattern enables horizontal scaling - you can add more consumers to process messages faster. The at-least-once delivery guarantee means messages might be delivered multiple times, so your processing logic should be idempotent. Standard queues provide unlimited throughput (thousands of messages per second) and best-effort ordering, making them ideal for high-throughput scenarios where strict ordering isn't required.

Detailed Example 1: E-commerce Order Processing
An e-commerce platform receives 10,000 orders per minute during Black Friday sales. Each order needs to be validated, charged, and fulfilled. The system uses an SQS Standard queue to decouple order submission from processing. When a customer places an order, the web application sends a message to the SQS queue containing order details (order ID, customer ID, items, total). The message is immediately acknowledged, and the customer sees "Order received" within 100ms. Behind the scenes, 50 EC2 instances running order processing workers continuously poll the queue using long polling (20-second wait time to reduce empty responses). Each worker receives a batch of up to 10 messages, processes them in parallel, and deletes successfully processed messages. If a worker crashes while processing, the visibility timeout (set to 5 minutes) ensures the message becomes visible again for another worker to retry. The system handles the traffic spike without losing orders, and customers don't experience delays because order submission is decoupled from processing.

Detailed Example 2: Image Processing Pipeline
A photo-sharing application allows users to upload images that need to be resized into multiple formats (thumbnail, medium, large). When a user uploads an image to S3, an S3 event notification sends a message to an SQS queue. The message contains the S3 bucket name and object key. A fleet of Lambda functions (configured with SQS as an event source) automatically polls the queue and processes images in parallel. Each Lambda function downloads the original image from S3, creates three resized versions using ImageMagick, uploads them back to S3, and deletes the message from the queue. If a Lambda function times out (15-minute limit), the message becomes visible again after the visibility timeout (10 minutes) and another Lambda function retries it. The system automatically scales based on queue depth - AWS Lambda can scale to 1,000 concurrent executions, processing 1,000 images simultaneously. This architecture handles traffic spikes without provisioning servers and only charges for actual processing time.

Detailed Example 3: Log Aggregation System
A distributed application running on 500 EC2 instances needs to centralize logs for analysis. Each instance sends log entries to an SQS queue (up to 256 KB per message). A log aggregation service with 10 consumer instances polls the queue, batches log entries, and writes them to S3 in compressed format every 5 minutes. The visibility timeout is set to 10 minutes to allow time for batching and S3 upload. If a consumer crashes, another consumer picks up the messages after the timeout. The system uses SQS's at-least-once delivery, so the log aggregation service deduplicates entries based on a unique log ID before writing to S3. This architecture handles 100,000 log entries per second without losing data, and the decoupled design allows the log aggregation service to be updated without affecting the application instances.

⭐ Must Know (Critical Facts):

  • Unlimited throughput: SQS Standard can handle thousands of messages per second per API action (SendMessage, ReceiveMessage, DeleteMessage)
  • At-least-once delivery: Messages are delivered at least once, but occasionally more than once (design for idempotency)
  • Best-effort ordering: Messages are generally delivered in the order sent, but not guaranteed (use FIFO for strict ordering)
  • Visibility timeout: Default 30 seconds, configurable 0 seconds to 12 hours (set based on processing time)
  • Message retention: Default 4 days, configurable 1 minute to 14 days (messages auto-delete after retention period)
  • Message size: Maximum 256 KB per message (use S3 for larger payloads with Extended Client Library)
  • Long polling: Reduces empty responses and costs by waiting up to 20 seconds for messages (recommended over short polling)
  • Dead Letter Queue: Automatically moves messages that fail processing after maxReceiveCount attempts (useful for debugging)

SQS FIFO Queue Flow

šŸ“Š SQS FIFO Message Flow Diagram:

sequenceDiagram
    participant P as Producer
    participant SQS as SQS FIFO Queue
    participant C as Consumer

    P->>SQS: Send Message 1 (Group A)
    P->>SQS: Send Message 2 (Group A)
    P->>SQS: Send Message 3 (Group B)
    P->>SQS: Send Message 4 (Group A)
    
    Note over SQS: Strict ordering<br/>within message groups
    
    C->>SQS: Poll for messages
    SQS-->>C: Message 1 (Group A)
    C->>SQS: Delete Message 1
    
    C->>SQS: Poll for messages
    SQS-->>C: Message 2 (Group A)
    Note over C: Must process in order<br/>within Group A
    C->>SQS: Delete Message 2
    
    C->>SQS: Poll for messages
    SQS-->>C: Message 3 (Group B)
    Note over SQS: Group B can be processed<br/>in parallel with Group A

See: diagrams/03_domain2_sqs_fifo_flow.mmd

Diagram Explanation (Detailed):
This sequence diagram demonstrates SQS FIFO (First-In-First-Out) queue behavior with message groups. The Producer sends four messages, with Messages 1, 2, and 4 belonging to Group A, and Message 3 belonging to Group B. FIFO queues guarantee strict ordering within each message group - Messages 1, 2, and 4 will be delivered to consumers in exactly that order. The Consumer must process and delete Message 1 before receiving Message 2 from Group A. However, Message 3 from Group B can be processed in parallel because it's in a different message group. This allows for parallelism while maintaining ordering where it matters. Message groups are defined by the MessageGroupId attribute set by the producer. FIFO queues also provide exactly-once processing using MessageDeduplicationId - if the same message is sent twice within the 5-minute deduplication interval, SQS automatically discards the duplicate. This is critical for financial transactions or inventory updates where duplicate processing would cause errors. FIFO queues have a throughput limit of 300 messages per second (3,000 with batching), which is lower than Standard queues but sufficient for most ordered processing scenarios. The queue name must end with .fifo suffix.

Detailed Example 1: Stock Trading Order Processing
A stock trading platform receives buy and sell orders that must be processed in the exact order received to ensure fair pricing. Each user's orders are assigned a MessageGroupId based on their user ID. When User A places three orders (Buy 100 shares, Sell 50 shares, Buy 25 shares), they're sent to an SQS FIFO queue with MessageGroupId="UserA". The order processing system polls the queue and receives orders in exact sequence. It processes "Buy 100" first, updating the user's portfolio, then "Sell 50", then "Buy 25". Meanwhile, User B's orders (MessageGroupId="UserB") are processed in parallel by another consumer, maintaining ordering per user while allowing concurrent processing across users. The exactly-once delivery guarantee ensures that if the producer retries due to a network error, duplicate orders aren't created. The system uses MessageDeduplicationId based on a hash of order details (user ID + timestamp + order type + quantity). This architecture ensures regulatory compliance (orders must be processed in sequence) while maintaining high throughput (thousands of users trading simultaneously).

Detailed Example 2: Banking Transaction Processing
A banking system processes account transactions (deposits, withdrawals, transfers) that must be applied in order to maintain accurate balances. Each account's transactions use MessageGroupId based on account number. When Account 12345 has three transactions (Deposit $1000, Withdraw $500, Deposit $200), they're sent to an SQS FIFO queue. The transaction processor receives them in exact order, updating the account balance sequentially: $0 → $1000 → $500 → $700. If the processor crashes after the first transaction, the visibility timeout ensures the second transaction isn't processed until the first is confirmed deleted. The exactly-once processing prevents duplicate transactions - if a deposit message is sent twice due to a retry, SQS deduplicates it using MessageDeduplicationId (transaction ID). This prevents the dreaded "double deposit" bug. The system processes 10,000 accounts concurrently (each account is a message group), achieving 300 transactions per second per account while maintaining strict ordering and exactly-once semantics.

⭐ Must Know (Critical Facts):

  • Strict ordering: Messages within a message group are delivered in exact FIFO order (guaranteed)
  • Exactly-once processing: Deduplication prevents duplicate messages within 5-minute window (use MessageDeduplicationId)
  • Message groups: Enable parallel processing while maintaining order within groups (use MessageGroupId)
  • Throughput limit: 300 messages per second (3,000 with batching of 10 messages) per FIFO queue
  • Queue naming: Must end with .fifo suffix (e.g., orders.fifo)
  • Content-based deduplication: Can auto-generate deduplication ID from message body SHA-256 hash
  • High throughput mode: Increases limit to 3,000 messages per second (30,000 with batching) but requires message groups

SNS Fan-Out Pattern

šŸ“Š SNS Fan-Out Architecture Diagram:

graph TB
    P[Producer Application] -->|Publish| SNS[SNS Topic]
    
    SNS -->|Subscribe| SQS1[SQS Queue 1<br/>Order Processing]
    SNS -->|Subscribe| SQS2[SQS Queue 2<br/>Inventory Update]
    SNS -->|Subscribe| Lambda[Lambda Function<br/>Email Notification]
    SNS -->|Subscribe| HTTP[HTTP Endpoint<br/>External System]
    
    SQS1 --> C1[Consumer 1]
    SQS2 --> C2[Consumer 2]
    
    style SNS fill:#ff9800
    style SQS1 fill:#4caf50
    style SQS2 fill:#4caf50
    style Lambda fill:#9c27b0
    style HTTP fill:#2196f3

See: diagrams/03_domain2_sns_fanout.mmd

Diagram Explanation (Detailed):
This architecture diagram illustrates the SNS fan-out pattern, where a single message published to an SNS topic is automatically delivered to multiple subscribers simultaneously. The Producer Application publishes one message to the SNS Topic (e.g., "Order Placed" event). SNS immediately fans out this message to all four subscribers: SQS Queue 1 for order processing, SQS Queue 2 for inventory updates, a Lambda function for sending email notifications, and an HTTP endpoint for an external system. Each subscriber receives the same message independently and processes it according to its own logic. This pattern decouples the producer from consumers - the producer doesn't need to know how many systems need the data or how they process it. If a new system needs order data, you simply add another subscription without changing the producer. SNS provides at-least-once delivery to each subscriber with automatic retries (up to 100,015 retries over 23 days for HTTP endpoints). The fan-out pattern is ideal for event-driven architectures where multiple systems need to react to the same event. SNS supports up to 12.5 million subscriptions per topic and 100,000 topics per account, enabling massive scale. Message filtering allows subscribers to receive only relevant messages based on message attributes, reducing unnecessary processing.

Detailed Example 1: E-commerce Order Workflow
When a customer places an order on an e-commerce website, multiple backend systems need to be notified simultaneously. The order service publishes an "OrderPlaced" message to an SNS topic containing order details (order ID, customer ID, items, total, shipping address). SNS fans out to five subscribers: (1) SQS queue for payment processing - charges the customer's credit card, (2) SQS queue for inventory management - reserves items and updates stock levels, (3) SQS queue for shipping - creates shipping label and schedules pickup, (4) Lambda function - sends order confirmation email to customer, (5) HTTP endpoint - notifies external analytics platform for business intelligence. Each system processes the order independently and at its own pace. If the email service is down, it doesn't affect payment or shipping. The SQS queues buffer messages, so if inventory management is slow, messages wait in the queue without blocking other systems. This architecture reduces order processing time from 5 seconds (sequential) to 1 second (parallel) and improves reliability - if one system fails, others continue working.

Detailed Example 2: IoT Sensor Data Distribution
An IoT platform collects temperature data from 10,000 sensors deployed in warehouses. Each sensor publishes temperature readings to an SNS topic every minute. SNS fans out to multiple subscribers: (1) Kinesis Data Firehose - stores all readings in S3 for long-term analysis, (2) Lambda function - checks for temperature anomalies and triggers alerts if temperature exceeds thresholds, (3) SQS queue - feeds real-time dashboard showing current temperatures, (4) HTTP endpoint - sends data to third-party monitoring service. The fan-out pattern allows adding new consumers without modifying sensor code. When the company adds a machine learning system to predict equipment failures, they simply add another subscription. SNS handles 10,000 messages per minute (167 per second) easily, and each subscriber processes data independently. Message filtering is used so the alert Lambda only receives messages where temperature > 80°F, reducing unnecessary invocations and costs.

⭐ Must Know (Critical Facts):

  • Fan-out pattern: One message published to SNS is delivered to all subscribers simultaneously (parallel processing)
  • Subscriber types: SQS, Lambda, HTTP/HTTPS, Email, SMS, Mobile push notifications (6 types)
  • Message filtering: Subscribers can filter messages based on message attributes (reduces unnecessary processing)
  • Delivery retries: Automatic retries with exponential backoff (up to 100,015 retries for HTTP)
  • Message size: Maximum 256 KB per message (same as SQS)
  • Throughput: Unlimited (can handle millions of messages per second)
  • Durability: Messages stored redundantly across multiple AZs (99.999999999% durability)
  • SNS + SQS pattern: Combine for reliable fan-out with buffering and retry logic (best practice)

EventBridge Event Routing

šŸ“Š EventBridge Event Routing Diagram:

graph TB
    subgraph "Event Sources"
        EC2[EC2 State Change]
        S3[S3 Object Created]
        Custom[Custom Application]
    end
    
    subgraph "EventBridge"
        Bus[Event Bus]
        Rule1[Rule 1: EC2 Stopped]
        Rule2[Rule 2: S3 Upload]
        Rule3[Rule 3: Custom Event]
    end
    
    subgraph "Targets"
        Lambda1[Lambda: Notify Team]
        Lambda2[Lambda: Process File]
        SQS[SQS: Queue for Processing]
        SNS[SNS: Alert Topic]
    end
    
    EC2 --> Bus
    S3 --> Bus
    Custom --> Bus
    
    Bus --> Rule1
    Bus --> Rule2
    Bus --> Rule3
    
    Rule1 --> Lambda1
    Rule1 --> SNS
    Rule2 --> Lambda2
    Rule3 --> SQS
    
    style Bus fill:#ff9800
    style Rule1 fill:#e1f5fe
    style Rule2 fill:#e1f5fe
    style Rule3 fill:#e1f5fe

See: diagrams/03_domain2_eventbridge_routing.mmd

Diagram Explanation (Detailed):
This diagram shows EventBridge's powerful event routing capabilities. EventBridge receives events from three sources: EC2 state changes (AWS service events), S3 object creation (AWS service events), and custom application events. All events flow into the Event Bus, which acts as a central router. EventBridge Rules evaluate each event against pattern matching criteria and route matching events to appropriate targets. Rule 1 matches EC2 "stopped" events and routes them to both a Lambda function (to notify the operations team) and an SNS topic (to send alerts). Rule 2 matches S3 "ObjectCreated" events and routes them to a Lambda function for file processing. Rule 3 matches custom application events and routes them to an SQS queue for asynchronous processing. EventBridge supports complex pattern matching using JSON-based event patterns, allowing you to filter events by specific attributes (e.g., only EC2 instances in production environment, only S3 uploads to specific bucket prefix). Each rule can have up to 5 targets, and EventBridge automatically retries failed deliveries with exponential backoff. EventBridge also provides schema registry to discover event structures and generate code bindings, making it easier to work with events. The service integrates with 90+ AWS services and SaaS applications (Salesforce, Zendesk, etc.), making it the central nervous system for event-driven architectures.

Detailed Example 1: Automated Security Response
A company uses EventBridge to automatically respond to security events. When an EC2 instance's security group is modified (CloudTrail event), EventBridge receives the event and evaluates it against a rule that matches "ModifySecurityGroup" actions. The rule routes the event to three targets: (1) Lambda function that checks if the change violates security policies (e.g., opening port 22 to 0.0.0.0/0) and automatically reverts unauthorized changes, (2) SNS topic that notifies the security team via email and Slack, (3) SQS queue that feeds a security audit dashboard. The entire response happens within 5 seconds of the security group change, preventing potential breaches. EventBridge's pattern matching allows filtering to only trigger on high-risk changes (e.g., only alert if port 22, 3389, or 3306 is opened to the internet). This automated response reduces security incident response time from hours (manual detection) to seconds (automated).

Detailed Example 2: Multi-Account Event Aggregation
An enterprise with 50 AWS accounts uses EventBridge to centralize monitoring. Each account has an Event Bus that forwards events to a central monitoring account's Event Bus using cross-account event routing. The central account has rules that process events from all accounts: (1) Rule for EC2 state changes routes to Lambda for inventory tracking, (2) Rule for RDS failures routes to SNS for immediate alerts, (3) Rule for S3 access denied events routes to SQS for security analysis. EventBridge's schema registry automatically discovers event structures from all accounts, making it easy to write rules. The central monitoring team can see events from all accounts in one place, reducing operational complexity. EventBridge handles 10,000 events per second across all accounts without performance degradation.

⭐ Must Know (Critical Facts):

  • Event pattern matching: JSON-based patterns filter events by attributes (more flexible than SNS filtering)
  • Multiple targets: Each rule can route to up to 5 targets simultaneously (Lambda, SQS, SNS, Step Functions, etc.)
  • Schema registry: Automatically discovers event structures and generates code bindings (reduces development time)
  • Cross-account routing: Events can be routed across AWS accounts (centralized monitoring)
  • SaaS integration: Built-in integration with 90+ SaaS applications (Salesforce, Zendesk, Datadog, etc.)
  • Archive and replay: Can archive events and replay them later (useful for debugging and testing)
  • Throughput: Handles millions of events per second (unlimited scale)
  • Event transformation: Can transform event structure before sending to target (using input transformers)

Section 4: Load Balancing and Traffic Distribution

Application Load Balancer Architecture

šŸ“Š ALB Multi-AZ Architecture Diagram:

graph TB
    subgraph "Users"
        U1[User 1]
        U2[User 2]
        U3[User 3]
    end
    
    subgraph "AWS Cloud"
        R53[Route 53<br/>DNS]
        
        subgraph "VPC"
            subgraph "Public Subnets"
                ALB[Application Load Balancer<br/>Layer 7]
            end
            
            subgraph "AZ-1a Private Subnet"
                TG1A[Target Group 1]
                EC2_1A[EC2 Instance]
                TG1A --> EC2_1A
            end
            
            subgraph "AZ-1b Private Subnet"
                TG1B[Target Group 1]
                EC2_1B[EC2 Instance]
                TG1B --> EC2_1B
            end
            
            subgraph "AZ-1c Private Subnet"
                TG1C[Target Group 1]
                EC2_1C[EC2 Instance]
                TG1C --> EC2_1C
            end
        end
    end
    
    U1 --> R53
    U2 --> R53
    U3 --> R53
    R53 --> ALB
    
    ALB -->|Health Check| TG1A
    ALB -->|Health Check| TG1B
    ALB -->|Health Check| TG1C
    
    ALB -->|Route Traffic| EC2_1A
    ALB -->|Route Traffic| EC2_1B
    ALB -->|Route Traffic| EC2_1C
    
    style ALB fill:#ff9800
    style R53 fill:#4caf50
    style EC2_1A fill:#2196f3
    style EC2_1B fill:#2196f3
    style EC2_1C fill:#2196f3

See: diagrams/03_domain2_alb_architecture.mmd

Diagram Explanation (Detailed):
This architecture diagram shows a highly available Application Load Balancer (ALB) deployment across three Availability Zones. Users access the application through Route 53, which resolves the domain name to the ALB's DNS name. The ALB is deployed in public subnets across all three AZs (us-east-1a, us-east-1b, us-east-1c), providing automatic failover if an entire AZ fails. Behind the ALB, EC2 instances run in private subnets (no direct internet access) across all three AZs, registered with a Target Group. The ALB continuously performs health checks on each instance (default: every 30 seconds, checking /health endpoint). If an instance fails two consecutive health checks (unhealthy threshold), the ALB stops routing traffic to it and marks it unhealthy. When the instance passes two consecutive health checks (healthy threshold), traffic resumes. The ALB uses round-robin or least outstanding requests algorithm to distribute traffic across healthy instances. If an entire AZ fails (e.g., power outage in us-east-1a), the ALB automatically routes all traffic to instances in the remaining two AZs within seconds. The ALB operates at Layer 7 (HTTP/HTTPS), allowing advanced routing based on URL path, hostname, HTTP headers, and query strings. It also provides SSL/TLS termination, reducing CPU load on backend instances. The ALB supports WebSocket and HTTP/2, making it suitable for modern web applications.

Detailed Example 1: Microservices Routing
A company runs a microservices application with three services: user service (/users/), order service (/orders/), and product service (/products/). A single ALB routes traffic to different target groups based on URL path. Requests to example.com/users/ route to the user service target group (5 EC2 instances), requests to /orders/* route to the order service target group (10 EC2 instances - higher traffic), and requests to /products/* route to the product service target group (3 EC2 instances). Each target group has instances across three AZs for high availability. The ALB performs health checks on each service's /health endpoint. When the order service deploys a new version, the ALB's connection draining feature (default 300 seconds) ensures in-flight requests complete before instances are terminated. The ALB handles 10,000 requests per second, automatically scaling its capacity without manual intervention. This architecture reduces costs (one ALB instead of three) and simplifies management (single entry point).

Detailed Example 2: Blue-Green Deployment
A company uses ALB for zero-downtime deployments. The production environment (blue) has 10 EC2 instances in one target group receiving 100% of traffic. When deploying a new version, they launch 10 new instances (green) in a second target group. The ALB is configured with weighted target groups: blue (100%), green (0%). After the green instances pass health checks, they gradually shift traffic: blue (90%), green (10%) for 10 minutes to monitor for errors. If metrics look good, they continue: blue (50%), green (50%), then blue (0%), green (100%). If errors occur, they instantly roll back by setting blue (100%), green (0%). The entire deployment takes 30 minutes with zero downtime. The ALB's health checks ensure only healthy instances receive traffic, and connection draining ensures no requests are dropped during the transition.

⭐ Must Know (Critical Facts):

  • Layer 7 load balancing: Routes based on HTTP/HTTPS content (URL path, hostname, headers, query strings)
  • Target types: EC2 instances, IP addresses, Lambda functions, containers (ECS/EKS)
  • Health checks: Configurable interval (5-300 seconds), timeout (2-120 seconds), thresholds (2-10 checks)
  • Connection draining: Completes in-flight requests before deregistering targets (0-3600 seconds, default 300)
  • Cross-zone load balancing: Enabled by default, distributes traffic evenly across all AZs (no extra charge)
  • SSL/TLS termination: Offloads encryption/decryption from backend instances (reduces CPU usage)
  • Sticky sessions: Routes requests from same client to same target (using cookies, duration 1 second to 7 days)
  • WebSocket support: Maintains persistent connections for real-time applications (chat, gaming)

Section 5: Disaster Recovery Strategies

DR Strategy Comparison

šŸ“Š DR Strategies Comparison Diagram:

graph TB
    subgraph "Backup & Restore"
        BR1[Production Region]
        BR2[S3 Backups]
        BR3[Restore on Failure]
        BR1 -.Backup.-> BR2
        BR2 -.Restore.-> BR3
        BRCost[Cost: $]
        BRRTO[RTO: Hours-Days]
        BRRPO[RPO: Hours]
    end
    
    subgraph "Pilot Light"
        PL1[Production Region<br/>Full Environment]
        PL2[DR Region<br/>Core Services Only]
        PL3[Scale Up on Failure]
        PL1 -.Replicate Data.-> PL2
        PL2 -.Scale.-> PL3
        PLCost[Cost: $$]
        PLRTO[RTO: Minutes-Hours]
        PLRPO[RPO: Minutes]
    end
    
    subgraph "Warm Standby"
        WS1[Production Region<br/>Full Capacity]
        WS2[DR Region<br/>Minimum Capacity]
        WS3[Scale to Full on Failure]
        WS1 -.Replicate.-> WS2
        WS2 -.Scale.-> WS3
        WSCost[Cost: $$$]
        WSRTO[RTO: Minutes]
        WSRPO[RPO: Seconds]
    end
    
    subgraph "Active-Active"
        AA1[Region 1<br/>Full Capacity]
        AA2[Region 2<br/>Full Capacity]
        AA3[Route 53<br/>Traffic Distribution]
        AA3 --> AA1
        AA3 --> AA2
        AA1 <-.Bidirectional Replication.-> AA2
        AACost[Cost: $$$$]
        AARTO[RTO: Seconds]
        AARPO[RPO: Seconds]
    end
    
    style BR1 fill:#e8f5e9
    style PL1 fill:#fff3e0
    style WS1 fill:#fff3e0
    style AA1 fill:#ffebee

See: diagrams/03_domain2_dr_strategies_comparison.mmd

Diagram Explanation (Detailed):
This comprehensive diagram compares four disaster recovery strategies, showing the trade-offs between cost, Recovery Time Objective (RTO), and Recovery Point Objective (RPO). Backup & Restore (green) is the most cost-effective strategy, where production data is regularly backed up to S3 in another region. During a disaster, you restore from backups and rebuild infrastructure using CloudFormation or Terraform. This approach has the highest RTO (hours to days) and RPO (hours) because you must restore data and provision resources. Cost is minimal - only S3 storage ($0.023/GB-month) and occasional data transfer. Pilot Light (light orange) maintains core infrastructure components (database with replication) in the DR region but keeps compute resources minimal or stopped. During a disaster, you scale up compute resources (launch EC2 instances, increase RDS capacity). RTO improves to minutes-hours, and RPO to minutes because data is continuously replicated. Cost is moderate - running a small RDS instance and minimal compute. Warm Standby (orange) runs a scaled-down but fully functional environment in the DR region. All components are running but at minimum capacity (e.g., 2 instances instead of 20). During a disaster, you scale up to full capacity using Auto Scaling. RTO is minutes, and RPO is seconds because data replication is real-time. Cost is higher - running all services at reduced capacity. Active-Active (red) runs full production capacity in both regions simultaneously, with Route 53 distributing traffic between them. Both regions serve production traffic, so there's no "failover" - if one region fails, the other continues serving 100% of traffic. RTO and RPO are both seconds. Cost is highest - running full infrastructure in two regions. The choice depends on business requirements: e-commerce might use Warm Standby (RTO < 1 hour), while banking might require Active-Active (RTO < 1 minute).

Detailed Example 1: E-commerce Platform - Warm Standby
An e-commerce company generates $10,000 per minute in revenue and can tolerate 15 minutes of downtime (RTO: 15 minutes, RPO: 1 minute). They implement Warm Standby DR strategy. Production Region (us-east-1): 50 EC2 instances behind ALB, RDS Multi-AZ database (db.r5.4xlarge), ElastiCache cluster (3 nodes), S3 for images. DR Region (us-west-2): 5 EC2 instances behind ALB (10% capacity), RDS read replica (db.r5.4xlarge) with automated promotion, ElastiCache cluster (1 node), S3 cross-region replication. The RDS read replica continuously replicates data from production (replication lag < 1 second). During normal operations, the DR region serves no traffic. When us-east-1 fails (detected by Route 53 health checks in 60 seconds), the company executes the DR plan: (1) Promote RDS read replica to primary (2 minutes), (2) Update Route 53 to point to us-west-2 ALB (1 minute), (3) Auto Scaling scales EC2 instances from 5 to 50 (10 minutes). Total RTO: 13 minutes. Data loss is minimal (RPO: 1 minute) because the read replica was nearly synchronized. Monthly DR cost: $2,000 (5 EC2 instances + RDS replica + ElastiCache + data transfer) vs $150,000 potential revenue loss from 15 minutes downtime.

Detailed Example 2: Financial Services - Active-Active
A stock trading platform requires zero downtime (RTO: 0 seconds) and zero data loss (RPO: 0 seconds) due to regulatory requirements. They implement Active-Active DR strategy. Region 1 (us-east-1): 100 EC2 instances, Aurora Global Database (primary), ElastiCache, S3. Region 2 (eu-west-1): 100 EC2 instances, Aurora Global Database (secondary with < 1 second replication lag), ElastiCache, S3. Route 53 uses latency-based routing to direct users to the nearest region. Both regions serve production traffic simultaneously. Aurora Global Database replicates data bidirectionally with conflict resolution. When us-east-1 fails, Route 53 health checks detect the failure within 30 seconds and automatically route all traffic to eu-west-1. Users experience no downtime - they're simply routed to the other region. The Aurora secondary is promoted to primary (< 1 minute), and the system continues operating. Data loss is zero because replication lag was < 1 second. Monthly cost: $50,000 (double infrastructure) vs potential $1 million regulatory fines and reputation damage from downtime.

Detailed Example 3: SaaS Application - Pilot Light
A SaaS company with 1,000 customers can tolerate 2 hours of downtime (RTO: 2 hours, RPO: 15 minutes). They implement Pilot Light DR strategy. Production Region (us-east-1): 20 EC2 instances, RDS Multi-AZ (db.m5.large), ElastiCache, S3. DR Region (us-west-2): RDS read replica (db.m5.large) continuously replicating, S3 cross-region replication, AMIs for EC2 instances, but no running EC2 instances. During normal operations, only the RDS read replica runs in DR region ($200/month). When us-east-1 fails, the DR plan executes: (1) Promote RDS read replica to primary (2 minutes), (2) Launch 20 EC2 instances from AMIs using CloudFormation (15 minutes), (3) Update Route 53 to point to new ALB (1 minute), (4) Warm up ElastiCache (30 minutes). Total RTO: 48 minutes. Data loss is 15 minutes (last RDS snapshot). Monthly DR cost: $200 vs $5,000 for Warm Standby - significant savings for acceptable RTO.

⭐ Must Know (Critical Facts):

  • RTO (Recovery Time Objective): Maximum acceptable downtime (how long to recover)
  • RPO (Recovery Point Objective): Maximum acceptable data loss (how much data can be lost)
  • Backup & Restore: Lowest cost ($), highest RTO (hours-days), highest RPO (hours)
  • Pilot Light: Low cost ($$), medium RTO (minutes-hours), medium RPO (minutes)
  • Warm Standby: Medium cost ($$$), low RTO (minutes), low RPO (seconds)
  • Active-Active: Highest cost ($$$$), lowest RTO (seconds), lowest RPO (seconds)
  • Aurora Global Database: Best for Active-Active, < 1 second replication lag across regions
  • RDS Cross-Region Read Replica: Good for Pilot Light/Warm Standby, asynchronous replication

Route 53 Failover Routing

šŸ“Š Route 53 Failover Diagram:

graph TB
    subgraph "Normal Operation"
        U1[Users] --> R53_1[Route 53]
        R53_1 -->|Primary Record| Primary[Primary Region<br/>us-east-1<br/>Active]
        R53_1 -.Health Check OK.-> Primary
        R53_1 -.Health Check.-> Secondary[Secondary Region<br/>us-west-2<br/>Standby]
    end
    
    subgraph "Failover Scenario"
        U2[Users] --> R53_2[Route 53]
        R53_2 -.Health Check FAIL.-> Primary2[Primary Region<br/>us-east-1<br/>Failed]
        R53_2 -->|Failover to Secondary| Secondary2[Secondary Region<br/>us-west-2<br/>Active]
        R53_2 -.Health Check OK.-> Secondary2
    end
    
    style Primary fill:#4caf50
    style Secondary fill:#ff9800
    style Primary2 fill:#f44336
    style Secondary2 fill:#4caf50
    style R53_1 fill:#2196f3
    style R53_2 fill:#2196f3

See: diagrams/03_domain2_route53_failover.mmd

Diagram Explanation (Detailed):
This diagram illustrates Route 53's failover routing policy for disaster recovery. During normal operation (top), Route 53 continuously performs health checks on the Primary Region (us-east-1) every 30 seconds. When health checks pass, Route 53 returns the primary record's IP address to users, directing all traffic to us-east-1. The Secondary Region (us-west-2) is on standby, also monitored by health checks but receiving no traffic. When the primary region fails (bottom), Route 53 detects the failure after missing consecutive health checks (configurable, typically 3 failures = 90 seconds). Route 53 automatically updates DNS responses to return the secondary record's IP address, directing all traffic to us-west-2. Users experience a brief interruption (DNS TTL duration, typically 60 seconds) as their DNS caches expire and refresh with the new IP. The failover is automatic - no manual intervention required. Route 53 continues monitoring both regions. When the primary region recovers and passes health checks, Route 53 can automatically fail back (if configured) or wait for manual failback. Health checks can monitor HTTP/HTTPS endpoints, TCP connections, or CloudWatch alarms, providing flexible failure detection. Route 53's global network of DNS servers ensures health check results are consistent worldwide, preventing split-brain scenarios where some users see the primary as healthy while others see it as failed.

Detailed Example 1: Web Application Failover
A media streaming company runs its application in us-east-1 (primary) and us-west-2 (secondary). Route 53 is configured with failover routing: Primary record points to us-east-1 ALB (priority 1), Secondary record points to us-west-2 ALB (priority 2). Health checks monitor the /health endpoint on both ALBs every 30 seconds. During normal operation, all 1 million users are routed to us-east-1. At 2 AM, a network issue causes us-east-1 to become unreachable. Route 53 health checks fail three consecutive times (90 seconds). Route 53 automatically updates DNS responses to return the us-west-2 ALB IP address. Users with expired DNS caches (TTL 60 seconds) immediately get the new IP and connect to us-west-2. Users with cached DNS entries experience errors for up to 60 seconds until their cache expires. Within 3 minutes, all users are successfully streaming from us-west-2. The company's monitoring team receives a CloudWatch alarm about the failover and investigates us-east-1. After fixing the network issue, they manually fail back to us-east-1 during a maintenance window to avoid another brief interruption.

⭐ Must Know (Critical Facts):

  • Failover routing: Automatically routes traffic to secondary when primary fails (active-passive DR)
  • Health check interval: 30 seconds (standard) or 10 seconds (fast), configurable
  • Failure threshold: Typically 3 consecutive failures before marking unhealthy (90 seconds with 30s interval)
  • DNS TTL impact: Users experience interruption equal to TTL duration (recommend 60 seconds for DR)
  • Health check types: HTTP/HTTPS endpoint, TCP connection, CloudWatch alarm, calculated health check
  • Automatic failback: Can be configured to automatically fail back when primary recovers (or manual)
  • Multi-region failover: Can chain multiple failover records (primary → secondary → tertiary)

Chapter Summary

What We Covered

  • āœ… High Availability Fundamentals: Multi-AZ deployments, Availability Zones, fault tolerance
  • āœ… Auto Scaling: Dynamic, predictive, and scheduled scaling policies for elastic capacity
  • āœ… Load Balancing: ALB, NLB, GWLB - when to use each type and their features
  • āœ… Decoupling Patterns: SQS, SNS, EventBridge for building loosely coupled architectures
  • āœ… Serverless Architectures: Lambda, Fargate, API Gateway for event-driven systems
  • āœ… Container Orchestration: ECS and EKS for managing containerized applications
  • āœ… Disaster Recovery: Four DR strategies (backup/restore, pilot light, warm standby, active-active)
  • āœ… RTO/RPO: Understanding recovery objectives and selecting appropriate DR strategies
  • āœ… Multi-Region Architectures: Global databases, cross-region replication, Route 53 failover
  • āœ… Monitoring & Observability: CloudWatch, X-Ray, Health Dashboard for system visibility

Critical Takeaways

  1. Multi-AZ is for HA, Read Replicas are for performance: Don't confuse these two concepts
  2. Auto Scaling requires proper health checks: ELB health checks can trigger instance replacement
  3. ALB for HTTP/HTTPS, NLB for TCP/UDP: Choose based on protocol and performance needs
  4. SQS for decoupling, SNS for fan-out, EventBridge for routing: Each has specific use cases
  5. Lambda scales automatically: No need to manage servers or capacity
  6. ECS for AWS-native, EKS for Kubernetes: Choose based on team expertise and requirements
  7. DR strategy depends on RTO/RPO: Lower RTO/RPO = higher cost
  8. Aurora Global Database for active-active: < 1 second replication lag across regions
  9. Route 53 failover for automatic DR: Health checks trigger automatic failover
  10. Monitoring is essential: Can't improve what you don't measure

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain the difference between Multi-AZ and Read Replicas
  • I can design Auto Scaling policies for different workload patterns
  • I can choose the appropriate load balancer type for a given scenario
  • I can design decoupled architectures using SQS, SNS, and EventBridge
  • I understand when to use Lambda vs Fargate vs EC2
  • I can explain the four DR strategies and their RTO/RPO characteristics
  • I can calculate appropriate RTO/RPO for business requirements
  • I can design multi-region architectures with automatic failover
  • I can implement monitoring and observability for distributed systems
  • I can troubleshoot common resilience issues (scaling, failover, health checks)

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle 1: Questions 1-50 (resilience focus)
  • Domain 2 Bundle 2: Questions 1-50 (resilience focus)
  • Expected score: 75%+ to proceed

If you scored below 75%:

  • Review sections: Focus on areas where you missed questions
  • Key topics to strengthen:
    • Multi-AZ vs Read Replicas (commonly confused)
    • Auto Scaling policies and health checks
    • Load balancer selection (ALB vs NLB vs GWLB)
    • Decoupling patterns (SQS vs SNS vs EventBridge)
    • DR strategy selection based on RTO/RPO
    • Multi-region architecture design

Quick Reference Card

[One-page summary of chapter - copy to your notes]

Key Services:

  • Auto Scaling: Automatic capacity adjustment based on demand
  • ELB: Application Load Balancer (HTTP/HTTPS), Network Load Balancer (TCP/UDP), Gateway Load Balancer (Layer 3)
  • SQS: Message queue for decoupling, standard (best-effort ordering) vs FIFO (guaranteed ordering)
  • SNS: Pub/sub messaging for fan-out patterns, supports multiple protocols
  • EventBridge: Event bus for routing events between AWS services and applications
  • Lambda: Serverless compute, event-driven, automatic scaling
  • ECS: Container orchestration, AWS-native, Fargate or EC2 launch types
  • EKS: Managed Kubernetes, portable across clouds
  • Route 53: DNS with health checks and failover routing for DR

Key Concepts:

  • High Availability: System remains operational despite component failures (Multi-AZ)
  • Fault Tolerance: System continues operating without interruption during failures
  • Scalability: Ability to handle increased load (horizontal or vertical scaling)
  • Loose Coupling: Components can fail independently without cascading failures
  • RTO: Recovery Time Objective - how long to recover after disaster
  • RPO: Recovery Point Objective - how much data loss is acceptable

Decision Points:

  • Need HA? → Multi-AZ deployment + Auto Scaling + Load Balancer
  • Need performance? → Read Replicas + Caching (ElastiCache) + CloudFront
  • Need decoupling? → SQS (queue) or SNS (pub/sub) or EventBridge (routing)
  • Need serverless? → Lambda (compute) + API Gateway (API) + DynamoDB (database)
  • Need containers? → ECS (AWS-native) or EKS (Kubernetes)
  • Need DR? → Choose strategy based on RTO/RPO requirements and budget

Next Chapter: 04_domain3_high_performing_architectures - Design High-Performing Architectures


Chapter Summary

What We Covered

This chapter covered Domain 2: Design Resilient Architectures (26% of the exam), the second highest-weighted domain. We explored two major task areas:

  • āœ… Task 2.1 - Scalable and Loosely Coupled Architectures: SQS, SNS, EventBridge, Lambda, API Gateway, ECS, EKS, Step Functions, load balancing, caching, microservices patterns, event-driven architectures
  • āœ… Task 2.2 - Highly Available and Fault-Tolerant Architectures: Multi-AZ deployments, multi-Region strategies, Route 53 routing policies, disaster recovery (backup/restore, pilot light, warm standby, active-active), RDS Multi-AZ, Aurora Global Database, automated failover

Critical Takeaways

  1. Loose Coupling is Essential for Resilience: Decouple components using queues (SQS), topics (SNS), and event buses (EventBridge). When one component fails, others continue operating independently.

  2. Design for Failure: Assume everything fails. Use multiple Availability Zones for high availability, multiple Regions for disaster recovery, and implement automatic failover mechanisms.

  3. Horizontal Scaling Over Vertical: Scale out (add more instances) rather than scale up (bigger instances). Use Auto Scaling groups with load balancers to distribute traffic across multiple instances.

  4. Choose the Right DR Strategy: Match your disaster recovery strategy to your RPO/RTO requirements:

    • Backup/Restore: Hours (cheapest)
    • Pilot Light: 10s of minutes
    • Warm Standby: Minutes
    • Active-Active: Seconds (most expensive)
  5. Leverage Managed Services: Use managed services like RDS Multi-AZ, Aurora, DynamoDB, and ECS Fargate to reduce operational overhead and increase resilience.

  6. Event-Driven Architectures Scale Better: Use asynchronous communication patterns (SQS, SNS, EventBridge) instead of synchronous (direct API calls) for better scalability and fault tolerance.

  7. Load Balancers are Critical: ALB for HTTP/HTTPS traffic with advanced routing, NLB for TCP/UDP with ultra-low latency, GLB for third-party virtual appliances.

Self-Assessment Checklist

Test yourself before moving to Domain 3. You should be able to:

Scalable and Loosely Coupled Architectures:

  • Design a queue-based architecture using SQS for decoupling
  • Implement pub/sub pattern using SNS for fanout
  • Configure EventBridge rules for event-driven workflows
  • Choose between SQS Standard (best-effort ordering) and FIFO (guaranteed ordering)
  • Design Lambda functions with proper concurrency limits
  • Implement API Gateway with caching and throttling
  • Choose between ALB (Layer 7) and NLB (Layer 4) for different use cases
  • Design microservices architecture using ECS or EKS
  • Implement Step Functions for workflow orchestration
  • Use ElastiCache (Redis or Memcached) for caching strategies

Highly Available and Fault-Tolerant Architectures:

  • Design multi-AZ deployments for high availability
  • Implement multi-Region architectures for disaster recovery
  • Configure Route 53 health checks and failover routing
  • Choose appropriate disaster recovery strategy based on RPO/RTO
  • Set up RDS Multi-AZ for automatic failover
  • Configure Aurora Global Database for cross-region replication
  • Implement DynamoDB Global Tables for multi-region active-active
  • Design Auto Scaling policies (target tracking, step scaling, scheduled)
  • Configure S3 Cross-Region Replication for data durability
  • Use CloudWatch alarms for automated recovery actions

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle 1: Questions 1-50 (scalability and loose coupling)
  • Domain 2 Bundle 2: Questions 1-50 (high availability and fault tolerance)
  • Integration Services Bundle: Questions 1-50 (SQS, SNS, EventBridge, Step Functions)
  • Compute Services Bundle: Questions 1-50 (Lambda, ECS, EKS, Auto Scaling)

Expected Score: 75%+ to proceed

If you scored below 75%:

  • Loose coupling weak: Review SQS vs. SNS, EventBridge patterns, API Gateway
  • High availability weak: Review multi-AZ deployments, Route 53 routing, disaster recovery strategies
  • Scaling weak: Review Auto Scaling policies, load balancer types, caching strategies
  • Revisit diagrams: SQS architecture, SNS fanout, DR strategies comparison, Auto Scaling lifecycle

Common Exam Traps

Watch out for these in Domain 2 questions:

  1. SQS Standard vs. FIFO: Standard has unlimited throughput but best-effort ordering; FIFO guarantees order but limited to 3,000 messages/sec
  2. ALB vs. NLB: ALB operates at Layer 7 (HTTP/HTTPS) with content-based routing; NLB operates at Layer 4 (TCP/UDP) with ultra-low latency
  3. RDS Multi-AZ vs. Read Replicas: Multi-AZ is for high availability (automatic failover); Read Replicas are for read scalability (manual promotion)
  4. Lambda Concurrency: Default limit is 1,000 concurrent executions per region; use reserved concurrency to guarantee capacity
  5. Auto Scaling Cooldown: Prevents Auto Scaling from launching/terminating instances too quickly; default is 300 seconds
  6. Route 53 Routing Policies: Failover (active-passive), Weighted (A/B testing), Latency (performance), Geolocation (compliance)
  7. DR Strategy Selection: Match RPO/RTO requirements to cost - don't over-engineer with active-active when warm standby suffices

Quick Reference Card

Decoupling Patterns:

  • SQS: Queue-based, pull model, at-least-once delivery (Standard) or exactly-once (FIFO)
  • SNS: Pub/sub, push model, fanout to multiple subscribers
  • EventBridge: Event bus, rule-based routing, integrates with 90+ AWS services
  • Step Functions: Workflow orchestration, visual workflows, error handling

Load Balancer Selection:

  • ALB: HTTP/HTTPS, Layer 7, path/host-based routing, WebSocket, Lambda targets
  • NLB: TCP/UDP/TLS, Layer 4, ultra-low latency, static IP, millions of requests/sec
  • GLB: Layer 3, third-party virtual appliances (firewalls, IDS/IPS)
  • CLB: Legacy, supports both Layer 4 and Layer 7 (use ALB or NLB instead)

Disaster Recovery Strategies (RPO/RTO):

  1. Backup and Restore: Hours / Hours (cheapest)
  2. Pilot Light: Minutes / 10s of minutes
  3. Warm Standby: Seconds / Minutes
  4. Active-Active: Real-time / Seconds (most expensive)

Auto Scaling Policies:

  • Target Tracking: Maintain metric at target value (e.g., 70% CPU)
  • Step Scaling: Scale based on CloudWatch alarm thresholds
  • Scheduled Scaling: Scale at specific times (predictable patterns)
  • Predictive Scaling: ML-based forecasting

High Availability Services:

  • RDS Multi-AZ: Synchronous replication, automatic failover (1-2 minutes)
  • Aurora: 6 copies across 3 AZs, automatic failover (<30 seconds)
  • DynamoDB: Multi-AZ by default, Global Tables for multi-region
  • S3: 99.999999999% durability, automatic replication across AZs
  • EFS: Multi-AZ by default, automatic replication

Decision Frameworks

When to use which messaging service:

  • SQS Standard: High throughput, order not critical, at-least-once delivery acceptable
  • SQS FIFO: Order matters, exactly-once processing required, up to 3,000 msg/sec
  • SNS: Fanout to multiple subscribers, push notifications, mobile/email alerts
  • EventBridge: Event-driven architecture, rule-based routing, AWS service integration
  • Kinesis: Real-time streaming data, ordered records, replay capability

When to use which compute service:

  • EC2: Full control, custom OS, long-running workloads
  • Lambda: Event-driven, short-duration (<15 min), serverless
  • ECS: Container orchestration, Docker, AWS-native
  • EKS: Kubernetes, multi-cloud portability, complex orchestration
  • Fargate: Serverless containers, no infrastructure management

When to use which DR strategy:

  • Backup/Restore: RPO hours, RTO hours, non-critical workloads, cost-sensitive
  • Pilot Light: RPO minutes, RTO 10s of minutes, core services only
  • Warm Standby: RPO seconds, RTO minutes, scaled-down replica running
  • Active-Active: RPO near-zero, RTO seconds, mission-critical, cost not primary concern

Integration with Other Domains

Resilience concepts from Domain 2 integrate with:

  • Domain 1 (Secure Architectures): Secure load balancers, encrypted queues, IAM roles for services
  • Domain 3 (High-Performing Architectures): Caching for performance, read replicas for scalability
  • Domain 4 (Cost-Optimized Architectures): Auto Scaling for cost efficiency, Spot Instances for fault-tolerant workloads

Key Metrics to Remember

SQS:

  • Visibility Timeout: 30 seconds (default), max 12 hours
  • Message Retention: 4 days (default), max 14 days
  • Message Size: Max 256 KB
  • FIFO Throughput: 3,000 messages/sec (300 with batching)

Lambda:

  • Timeout: Max 15 minutes
  • Memory: 128 MB to 10,240 MB
  • Concurrent Executions: 1,000 (default regional limit)
  • Deployment Package: 50 MB (zipped), 250 MB (unzipped)

Auto Scaling:

  • Cooldown Period: 300 seconds (default)
  • Health Check Grace Period: 300 seconds (default)
  • Scaling Adjustment: Min 1 instance, max depends on limits

RDS:

  • Multi-AZ Failover: 1-2 minutes
  • Read Replica Lag: Typically seconds, can be minutes
  • Backup Retention: 0-35 days (7 days default)

Next Steps

You're now ready for Domain 3: Design High-Performing Architectures (Chapter 4). This domain covers:

  • High-performing storage solutions (24% of exam weight)
  • Elastic compute solutions
  • High-performing databases
  • Network architectures
  • Data ingestion and transformation

Resilience principles from this chapter will be applied throughout Domain 3, especially in designing performant, scalable architectures.


Chapter 2 Complete āœ… | Next: Chapter 3 - Domain 3: High-Performing Architectures


Chapter Summary

What We Covered

  • āœ… Scalable and Loosely Coupled Architectures
    • Messaging: SQS, SNS, EventBridge
    • API Management: API Gateway
    • Serverless: Lambda, Fargate
    • Containers: ECS, EKS
    • Workflow Orchestration: Step Functions
    • Caching: ElastiCache, CloudFront
    • Load Balancing: ALB, NLB, GWLB
  • āœ… High Availability and Fault Tolerance
    • Multi-AZ deployments
    • Route 53 routing policies
    • RDS Multi-AZ and Aurora
    • Auto Scaling strategies
    • Disaster recovery patterns
    • Backup and restore strategies

Critical Takeaways

  1. Loose Coupling: Use SQS for asynchronous processing, SNS for pub/sub, EventBridge for event-driven architectures - decouple components to improve resilience
  2. Multi-AZ for HA: Deploy across multiple Availability Zones for fault tolerance - RDS Multi-AZ (1-2 min failover), Aurora (30 sec failover), ALB distributes traffic
  3. Disaster Recovery: Choose strategy based on RTO/RPO - Backup/Restore (cheapest, hours), Pilot Light (minutes), Warm Standby (seconds), Multi-Site (no downtime)
  4. Auto Scaling: Use dynamic scaling for variable workloads, predictive scaling for known patterns, scheduled scaling for predictable changes
  5. Serverless for Scalability: Lambda scales automatically (1000 concurrent default), Fargate removes server management, API Gateway handles millions of requests

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain the difference between SQS Standard and FIFO queues
  • I understand when to use SNS vs SQS vs EventBridge
  • I know how to design a loosely coupled architecture using queues
  • I can describe Multi-AZ deployment patterns for RDS and Aurora
  • I understand the four disaster recovery strategies and when to use each
  • I know how to configure Auto Scaling with different scaling policies
  • I can explain Route 53 routing policies (failover, weighted, latency, geolocation)
  • I understand Lambda concurrency and how to handle throttling
  • I know the difference between ALB, NLB, and GWLB
  • I can design a highly available, fault-tolerant architecture

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle 1: Questions 1-25 (Scalable architectures)
  • Domain 2 Bundle 2: Questions 1-25 (High availability)
  • Integration Services Bundle: Questions 1-25
  • Expected score: 75%+ to proceed

If you scored below 75%:

  • Review sections: SQS/SNS patterns, Multi-AZ deployments, Disaster recovery strategies
  • Focus on: Understanding when to use each messaging service and how to design for HA

Quick Reference Card

Messaging Services:

  • SQS Standard: At-least-once delivery, best-effort ordering, unlimited throughput
  • SQS FIFO: Exactly-once delivery, strict ordering, 3000 msg/sec (batching)
  • SNS: Pub/sub, fan-out to multiple subscribers, push-based
  • EventBridge: Event bus, schema registry, 100+ AWS service integrations

Load Balancers:

  • ALB: Layer 7 (HTTP/HTTPS), path/host routing, WebSocket, Lambda targets
  • NLB: Layer 4 (TCP/UDP), ultra-low latency, static IP, millions of requests/sec
  • GWLB: Layer 3 (IP), third-party appliances, transparent network gateway

Serverless Compute:

  • Lambda: Event-driven, 15 min max, 10GB memory max, pay per invocation
  • Fargate: Serverless containers, no EC2 management, pay per vCPU/memory

High Availability:

  • RDS Multi-AZ: Synchronous replication, 1-2 min failover, same region
  • Aurora: 6 copies across 3 AZs, 30 sec failover, 15 read replicas
  • Route 53: Health checks, failover routing, multi-region support

Disaster Recovery:

Strategy RTO RPO Cost Use Case
Backup/Restore Hours Hours $ Non-critical, cost-sensitive
Pilot Light 10-30 min Minutes $$ Core systems only
Warm Standby Minutes Seconds $$$ Business-critical
Multi-Site Real-time None $$$$ Mission-critical

Auto Scaling Policies:

  • Target Tracking: Maintain metric at target (e.g., 70% CPU)
  • Step Scaling: Scale based on CloudWatch alarm thresholds
  • Scheduled: Scale at specific times (e.g., business hours)
  • Predictive: ML-based forecasting for known patterns

Decision Points:

  • Need message queue? → SQS Standard (high throughput) or FIFO (ordering)
  • Need pub/sub? → SNS
  • Need event routing? → EventBridge
  • Need API management? → API Gateway
  • Need serverless compute? → Lambda (functions) or Fargate (containers)
  • Need load balancing? → ALB (HTTP) or NLB (TCP) or GWLB (appliances)
  • Need high availability? → Multi-AZ deployment + Auto Scaling
  • Need disaster recovery? → Choose based on RTO/RPO requirements


Chapter Summary

What We Covered

This chapter covered Domain 2: Design Resilient Architectures (26% of the exam), the second most heavily weighted domain. We explored two major task areas:

āœ… Task 2.1: Design Scalable and Loosely Coupled Architectures

  • Microservices design principles and patterns
  • Event-driven architectures with SNS, SQS, EventBridge
  • API management with API Gateway
  • Serverless technologies: Lambda, Fargate, Step Functions
  • Container orchestration: ECS, EKS
  • Load balancing strategies: ALB, NLB, GWLB
  • Caching strategies for performance and decoupling
  • Storage types and when to use each

āœ… Task 2.2: Design Highly Available and Fault-Tolerant Architectures

  • Multi-AZ and multi-region architectures
  • Disaster recovery strategies: Backup/Restore, Pilot Light, Warm Standby, Multi-Site
  • Auto Scaling for elasticity and availability
  • Route 53 health checks and failover routing
  • Database high availability: RDS Multi-AZ, Aurora, DynamoDB global tables
  • Immutable infrastructure and blue/green deployments
  • Monitoring and observability with CloudWatch and X-Ray

Critical Takeaways

  1. Design for failure: Assume everything will fail. Use Multi-AZ deployments, Auto Scaling, and health checks to automatically recover from failures.

  2. Loose coupling is essential: Decouple components with SQS queues, SNS topics, and EventBridge. This allows independent scaling and failure isolation.

  3. Horizontal scaling over vertical: Add more instances (scale out) rather than bigger instances (scale up). Use Auto Scaling groups and load balancers.

  4. Choose the right DR strategy: Match RTO/RPO requirements to cost. Backup/Restore is cheapest but slowest. Multi-Site is fastest but most expensive.

  5. Stateless applications scale better: Store session state in ElastiCache or DynamoDB, not on EC2 instances. This enables unlimited horizontal scaling.

  6. Use managed services: RDS Multi-AZ, Aurora, DynamoDB, and Lambda handle availability automatically. Don't build what AWS already provides.

  7. Health checks are critical: Use Route 53 health checks, ALB target health checks, and Auto Scaling health checks to detect and replace failed components.

  8. Async communication for resilience: Use SQS queues between components to handle traffic spikes and component failures gracefully.

  9. Multi-region for disaster recovery: Use Route 53 failover routing, S3 cross-region replication, and DynamoDB global tables for geographic redundancy.

  10. Monitor everything: Use CloudWatch metrics, alarms, and dashboards. Use X-Ray for distributed tracing. Set up automated responses to failures.

Key Services Quick Reference

Compute & Scaling:

  • EC2 Auto Scaling: Automatically adjust capacity based on demand
  • Lambda: Serverless functions, automatic scaling, pay per invocation
  • Fargate: Serverless containers, no server management
  • ECS: Container orchestration on EC2 or Fargate
  • EKS: Managed Kubernetes for complex container workloads
  • Elastic Beanstalk: PaaS for web applications, handles infrastructure

Load Balancing:

  • ALB: Layer 7 (HTTP/HTTPS), path-based routing, host-based routing
  • NLB: Layer 4 (TCP/UDP), ultra-low latency, static IP, millions RPS
  • GWLB: Layer 3, for third-party appliances (firewalls, IDS/IPS)
  • CLB: Legacy, supports EC2-Classic (avoid for new applications)

Messaging & Integration:

  • SQS Standard: High throughput, at-least-once delivery, best-effort ordering
  • SQS FIFO: Exactly-once processing, strict ordering, 300 TPS (3,000 with batching)
  • SNS: Pub/sub messaging, fan-out to multiple subscribers
  • EventBridge: Event bus for application integration, rule-based routing
  • Step Functions: Workflow orchestration, visual workflows, error handling
  • API Gateway: RESTful APIs, WebSocket APIs, throttling, caching

Storage:

  • S3: Object storage, 11 9's durability, lifecycle policies, versioning
  • EBS: Block storage for EC2, snapshots, encryption, multiple volume types
  • EFS: Shared file storage for Linux, NFS protocol, automatic scaling
  • FSx: Managed file systems (Windows, Lustre, NetApp, OpenZFS)

Database High Availability:

  • RDS Multi-AZ: Synchronous replication, automatic failover (1-2 min)
  • Aurora: 6 copies across 3 AZs, 30-second failover, 15 read replicas
  • DynamoDB: Multi-AZ by default, global tables for multi-region
  • ElastiCache: Redis (replication, persistence) or Memcached (multi-threaded)

Networking & DNS:

  • Route 53: DNS with health checks, failover routing, geolocation routing
  • CloudFront: CDN with edge caching, DDoS protection, custom SSL
  • Global Accelerator: Static anycast IPs, health-based routing, TCP/UDP
  • VPC: Isolated network, subnets, route tables, internet/NAT gateways
  • Transit Gateway: Hub-and-spoke for multiple VPCs and on-premises

Monitoring & Observability:

  • CloudWatch: Metrics, logs, alarms, dashboards, automatic actions
  • X-Ray: Distributed tracing, service maps, performance analysis
  • CloudTrail: API call logging, compliance, security analysis
  • AWS Health Dashboard: Service health, scheduled maintenance, notifications

Decision Frameworks

Choosing Compute Services:

Need to run code?
ā”œā”€ Containers?
│  ā”œā”€ Kubernetes? → EKS
│  ā”œā”€ Simple containers? → ECS on Fargate
│  └─ Need EC2 control? → ECS on EC2
ā”œā”€ Short-lived functions? → Lambda
ā”œā”€ Long-running processes?
│  ā”œā”€ Need full control? → EC2 with Auto Scaling
│  └─ Want managed platform? → Elastic Beanstalk
└─ Batch processing? → AWS Batch

Choosing Messaging Services:

Need to send messages?
ā”œā”€ One-to-many (pub/sub)? → SNS
ā”œā”€ Queue for decoupling?
│  ā”œā”€ Need ordering? → SQS FIFO
│  └─ High throughput? → SQS Standard
ā”œā”€ Event routing with rules? → EventBridge
ā”œā”€ Workflow orchestration? → Step Functions
└─ Real-time bidirectional? → API Gateway WebSocket

Choosing Load Balancer:

Requirement Solution Use Case
HTTP/HTTPS, path routing ALB Web applications, microservices
TCP/UDP, ultra-low latency NLB Gaming, IoT, financial applications
Static IP required NLB Whitelisting, DNS with A records
Third-party appliances GWLB Firewalls, IDS/IPS, DPI

Choosing Disaster Recovery Strategy:

Strategy RTO RPO Cost Complexity Use Case
Backup/Restore Hours Hours $ Low Non-critical, cost-sensitive
Pilot Light 10-30 min Minutes $$ Medium Core systems, moderate criticality
Warm Standby Minutes Seconds $$$ Medium Business-critical, low RTO
Multi-Site Real-time None $$$$ High Mission-critical, zero downtime

Choosing Storage Type:

Type Service Use Case Performance
Object S3 Static content, backups, data lakes High throughput
Block EBS Boot volumes, databases, high IOPS Up to 64,000 IOPS
File (Linux) EFS Shared access, content management Scalable throughput
File (Windows) FSx Windows Windows apps, Active Directory Up to 2 GB/s
File (HPC) FSx Lustre Machine learning, HPC Up to 1 TB/s

Common Exam Patterns

Pattern 1: "Highly Available" Questions

  • Look for: Multi-AZ, Auto Scaling, load balancers, health checks
  • Eliminate: Single AZ, single instance, no failover
  • Choose: Automated recovery with redundancy across AZs

Pattern 2: "Loosely Coupled" Questions

  • Look for: SQS, SNS, EventBridge, API Gateway, Lambda
  • Eliminate: Tight coupling, synchronous calls, single points of failure
  • Choose: Async messaging with queues and event-driven patterns

Pattern 3: "Scalable Architecture" Questions

  • Look for: Auto Scaling, load balancers, stateless design, caching
  • Eliminate: Vertical scaling only, stateful instances, no caching
  • Choose: Horizontal scaling with distributed state management

Pattern 4: "Disaster Recovery" Questions

  • Look for: RTO/RPO requirements, cost constraints, criticality
  • Eliminate: Solutions that don't meet RTO/RPO or exceed budget
  • Choose: DR strategy that balances requirements with cost

Pattern 5: "Decoupling Components" Questions

  • Look for: SQS between tiers, SNS for fan-out, EventBridge for routing
  • Eliminate: Direct synchronous calls, no buffering, tight dependencies
  • Choose: Async messaging with proper error handling and retries

Self-Assessment Checklist

Test yourself before moving to the next chapter:

Scalability & Loose Coupling:

  • I can design a loosely coupled architecture with SQS and SNS
  • I understand when to use Lambda vs Fargate vs ECS vs EKS
  • I know how to implement event-driven architectures with EventBridge
  • I can choose the right load balancer (ALB vs NLB vs GWLB)
  • I understand stateless vs stateful design patterns

High Availability:

  • I can design Multi-AZ architectures for high availability
  • I understand RDS Multi-AZ vs Aurora vs DynamoDB availability
  • I know how to use Route 53 health checks and failover routing
  • I can implement Auto Scaling with proper health checks
  • I understand how to use CloudWatch alarms for automated responses

Disaster Recovery:

  • I can explain all four DR strategies and when to use each
  • I understand RTO and RPO and how they affect DR strategy choice
  • I know how to implement backup and restore with AWS Backup
  • I can design pilot light and warm standby architectures
  • I understand multi-region failover with Route 53

Messaging & Integration:

  • I know when to use SQS Standard vs FIFO
  • I understand SNS fan-out patterns
  • I can design workflows with Step Functions
  • I know how to use API Gateway for API management
  • I understand EventBridge event routing and rules

Monitoring & Troubleshooting:

  • I can set up CloudWatch metrics, alarms, and dashboards
  • I understand how to use X-Ray for distributed tracing
  • I know how to analyze CloudWatch Logs for troubleshooting
  • I can implement automated remediation with CloudWatch Events
  • I understand AWS Health Dashboard notifications

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle 1: Questions 1-20 (Scalability and loose coupling)
  • Domain 2 Bundle 2: Questions 21-40 (High availability and fault tolerance)
  • Domain 2 Bundle 3: Questions 41-60 (Disaster recovery and monitoring)
  • Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • 60-74%: Review specific sections where you missed questions
  • Below 60%: Re-read the entire chapter and take detailed notes
  • Focus on:
    • SQS Standard vs FIFO differences and use cases
    • DR strategy selection based on RTO/RPO requirements
    • Load balancer types and when to use each
    • Auto Scaling policies and health check types
    • Multi-AZ vs multi-region architectures

Quick Reference Card

Copy this to your notes for quick review:

Auto Scaling Policies:

  • Target Tracking: Maintain metric at target (e.g., 70% CPU) - SIMPLEST
  • Step Scaling: Scale based on CloudWatch alarm thresholds - MORE CONTROL
  • Scheduled: Scale at specific times (e.g., business hours) - PREDICTABLE
  • Predictive: ML-based forecasting for known patterns - ADVANCED

SQS Comparison:

Feature Standard FIFO
Throughput Unlimited 300 TPS (3,000 with batching)
Ordering Best-effort Strict FIFO
Delivery At-least-once Exactly-once
Use Case High throughput Order matters

Load Balancer Comparison:

Feature ALB NLB GWLB
Layer 7 (HTTP/HTTPS) 4 (TCP/UDP) 3 (IP)
Routing Path, host, header IP, port Flow hash
Latency ~ms ~100μs ~ms
Static IP No Yes Yes
Use Case Web apps Gaming, IoT Firewalls

DR Strategies:

  1. Backup/Restore: Cheapest, slowest (hours RTO/RPO)
  2. Pilot Light: Core systems running, scale up on failover (10-30 min RTO)
  3. Warm Standby: Scaled-down replica, scale up on failover (minutes RTO)
  4. Multi-Site: Full capacity in multiple regions (real-time RTO, zero RPO)

Database HA Options:

  • RDS Multi-AZ: Synchronous replication, 1-2 min failover, same region
  • Aurora: 6 copies across 3 AZs, 30 sec failover, 15 read replicas
  • DynamoDB: Multi-AZ by default, global tables for multi-region
  • ElastiCache Redis: Replication, persistence, automatic failover

Must Memorize:

  • SQS Standard: Unlimited throughput, at-least-once delivery
  • SQS FIFO: 300 TPS (3,000 with batching), exactly-once, strict ordering
  • SQS message retention: 1 minute to 14 days (default 4 days)
  • SQS visibility timeout: 0 seconds to 12 hours (default 30 seconds)
  • Lambda timeout: Maximum 15 minutes
  • Lambda concurrent executions: 1,000 per region (soft limit)
  • ALB: Layer 7, path/host routing, WebSocket support
  • NLB: Layer 4, ultra-low latency, static IP, millions RPS
  • RDS Multi-AZ failover: 1-2 minutes
  • Aurora failover: 30 seconds

Congratulations! You've completed Domain 2 (26% of exam). Combined with Domain 1, you've now covered 56% of the exam content.

Next Chapter: 04_domain3_high_performing_architectures - Design High-Performing Architectures (24% of exam)


Chapter Summary

What We Covered

This chapter covered Domain 2: Design Resilient Architectures (26% of exam), the second most heavily weighted domain. You learned:

  • āœ… Scalability Patterns: Horizontal vs vertical scaling, Auto Scaling, and elastic architectures
  • āœ… Loose Coupling: SQS, SNS, EventBridge, and decoupling strategies
  • āœ… Microservices: Container orchestration (ECS, EKS), serverless (Lambda), and service mesh
  • āœ… Load Balancing: ALB, NLB, GWLB, and traffic distribution strategies
  • āœ… Caching: CloudFront, ElastiCache, and application-level caching
  • āœ… High Availability: Multi-AZ deployments, health checks, and automatic failover
  • āœ… Fault Tolerance: RDS Multi-AZ, Aurora, DynamoDB global tables, and data replication
  • āœ… Disaster Recovery: Backup & Restore, Pilot Light, Warm Standby, Multi-Site strategies
  • āœ… Monitoring: CloudWatch, X-Ray, Health Dashboard, and observability
  • āœ… Automation: CloudFormation, Systems Manager, and infrastructure as code

Critical Takeaways

  1. Loose Coupling: Use queues (SQS) and topics (SNS) to decouple components and improve resilience
  2. Auto Scaling: Configure dynamic, target tracking, and scheduled policies based on workload patterns
  3. Load Balancer Selection: ALB for HTTP/HTTPS (Layer 7), NLB for TCP/UDP (Layer 4), GWLB for appliances
  4. SQS Queue Types: Standard for high throughput, FIFO for strict ordering and exactly-once delivery
  5. Lambda Best Practices: Use layers for shared code, destinations for async results, provisioned concurrency for consistent latency
  6. RDS Multi-AZ: Synchronous replication, 1-2 minute failover, automatic DNS update
  7. Aurora Advantages: 5x MySQL performance, 15 read replicas, 30-second failover, storage auto-scaling
  8. DR Strategy Selection: Choose based on RTO/RPO requirements and budget constraints
  9. Route 53 Routing: Failover for DR, latency for performance, geolocation for compliance, weighted for A/B testing
  10. Monitoring Strategy: Use CloudWatch for metrics, X-Ray for tracing, CloudTrail for audit logs

Self-Assessment Checklist

Test yourself before moving on. Can you:

Scalability & Loose Coupling:

  • Design a loosely coupled architecture using SQS and SNS?
  • Configure Auto Scaling policies (dynamic, target tracking, scheduled)?
  • Explain the difference between SQS Standard and FIFO queues?
  • Implement SNS fanout pattern for multiple subscribers?
  • Use EventBridge for event-driven architectures?
  • Design microservices using containers (ECS/EKS) or serverless (Lambda)?

Load Balancing & Traffic Management:

  • Choose the right load balancer (ALB, NLB, GWLB) for different scenarios?
  • Configure ALB path-based and host-based routing?
  • Implement health checks and automatic failover?
  • Use Route 53 routing policies (failover, latency, geolocation, weighted)?
  • Configure CloudFront for content delivery and caching?

High Availability & Fault Tolerance:

  • Design Multi-AZ deployments for high availability?
  • Configure RDS Multi-AZ for automatic failover?
  • Explain Aurora's high availability features (6 copies, 15 replicas)?
  • Implement DynamoDB global tables for multi-region replication?
  • Use ElastiCache Redis replication for cache high availability?

Disaster Recovery:

  • Choose the right DR strategy based on RTO/RPO requirements?
  • Implement Backup & Restore strategy using AWS Backup?
  • Design Pilot Light architecture for cost-effective DR?
  • Configure Warm Standby for faster recovery?
  • Implement Multi-Site active-active for zero downtime?
  • Calculate RTO and RPO for different DR strategies?

Monitoring & Automation:

  • Configure CloudWatch alarms and dashboards?
  • Use X-Ray for distributed tracing and performance analysis?
  • Implement CloudFormation for infrastructure as code?
  • Use Systems Manager for automation and patch management?
  • Set up automated remediation using EventBridge and Lambda?

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle 1: Questions 1-50 (Expected score: 70%+ to proceed)
  • Domain 2 Bundle 2: Questions 51-100 (Expected score: 75%+ to proceed)

If you scored below 70%:

  • Review SQS queue types and use cases
  • Focus on Auto Scaling policies and configuration
  • Study load balancer selection criteria
  • Practice DR strategy selection based on RTO/RPO

If you scored 70-80%:

  • Review advanced topics: Lambda optimization, ECS/EKS orchestration
  • Study Route 53 routing policies in detail
  • Practice multi-region architecture design
  • Focus on monitoring and observability patterns

If you scored 80%+:

  • Excellent! You're ready to move to Domain 3
  • Continue practicing with full practice tests
  • Review any specific topics where you made mistakes

Progress Check: You've now completed 56% of the exam content (Domains 1 + 2). Keep up the great work!

Next Steps: Proceed to 04_domain3_high_performing_architectures to learn about designing high-performing architectures (24% of exam).


Chapter Summary

What We Covered

This chapter explored designing resilient architectures on AWS, representing 26% of the SAA-C03 exam. We covered two major task areas:

Task 2.1: Design Scalable and Loosely Coupled Architectures

  • āœ… Messaging services: SQS, SNS, EventBridge for decoupling
  • āœ… API Gateway for RESTful and WebSocket APIs
  • āœ… Serverless compute: Lambda, Fargate for event-driven architectures
  • āœ… Container orchestration: ECS and EKS for microservices
  • āœ… Load balancing: ALB, NLB, GLB for traffic distribution
  • āœ… Caching strategies: CloudFront, ElastiCache for performance
  • āœ… Step Functions for workflow orchestration
  • āœ… Storage solutions: S3, EBS, EFS with appropriate characteristics

Task 2.2: Design Highly Available and Fault-Tolerant Architectures

  • āœ… Multi-AZ deployments for high availability
  • āœ… Multi-region architectures for disaster recovery
  • āœ… Route 53 routing policies for failover and traffic management
  • āœ… RDS Multi-AZ and Aurora for database resilience
  • āœ… Disaster recovery strategies: backup/restore, pilot light, warm standby, active-active
  • āœ… Auto Scaling for elasticity and fault tolerance
  • āœ… CloudWatch for monitoring and automated responses
  • āœ… AWS Backup for centralized backup management

Critical Takeaways

Loose Coupling Principles:

  1. Use Queues for Asynchronous Processing: SQS decouples producers from consumers, handles traffic spikes
  2. Pub/Sub for Fan-Out: SNS distributes messages to multiple subscribers simultaneously
  3. Event-Driven Architecture: EventBridge routes events based on rules, enables reactive systems
  4. API Gateway as Front Door: Centralized entry point, throttling, caching, authentication
  5. Stateless Applications: Store session data externally (ElastiCache, DynamoDB) for horizontal scaling

High Availability Essentials:

  • Deploy across multiple Availability Zones (minimum 2, preferably 3)
  • Use load balancers with health checks for automatic failover
  • Enable RDS Multi-AZ for synchronous replication and automatic failover
  • Implement Auto Scaling to replace failed instances automatically
  • Use Route 53 health checks for DNS-level failover

Disaster Recovery Strategies (RTO/RPO trade-offs):

  • Backup and Restore: Lowest cost, highest RTO/RPO (hours to days)
  • Pilot Light: Minimal running resources, medium RTO/RPO (minutes to hours)
  • Warm Standby: Scaled-down replica running, low RTO/RPO (minutes)
  • Active-Active: Full capacity in multiple regions, lowest RTO/RPO (seconds)

Scalability Patterns:

  • Horizontal Scaling: Add more instances (preferred for cloud, use Auto Scaling)
  • Vertical Scaling: Increase instance size (limited by instance type, requires downtime)
  • Read Replicas: Offload read traffic from primary database
  • Caching: Reduce database load with ElastiCache or CloudFront
  • Asynchronous Processing: Use queues to handle variable workloads

Messaging Service Selection:

  • SQS Standard: High throughput, at-least-once delivery, best-effort ordering
  • SQS FIFO: Exactly-once processing, strict ordering, 300 TPS limit
  • SNS: Pub/sub, fan-out to multiple subscribers, push-based
  • EventBridge: Event routing with rules, schema registry, third-party integrations
  • Kinesis: Real-time streaming, ordered records, multiple consumers

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

Loose Coupling and Microservices:

  • Design an event-driven architecture using SQS, SNS, and Lambda
  • Explain when to use SQS Standard vs FIFO queues
  • Implement SNS fan-out pattern for multiple subscribers
  • Configure EventBridge rules for event routing
  • Design a microservices architecture with ECS or EKS
  • Implement API Gateway with caching and throttling
  • Use Step Functions to orchestrate multi-step workflows
  • Choose appropriate storage (S3, EBS, EFS) based on requirements

High Availability and Fault Tolerance:

  • Design a multi-AZ architecture for high availability
  • Configure RDS Multi-AZ for automatic failover
  • Implement Aurora Global Database for multi-region replication
  • Set up Route 53 failover routing with health checks
  • Configure Auto Scaling with appropriate health checks
  • Design a load balancing strategy (ALB vs NLB vs GLB)
  • Implement CloudWatch alarms for automated responses
  • Use AWS Backup for centralized backup management

Disaster Recovery:

  • Calculate RTO and RPO for different DR strategies
  • Design a backup and restore strategy
  • Implement pilot light DR architecture
  • Configure warm standby for faster recovery
  • Design active-active multi-region architecture
  • Set up cross-region replication for S3 and DynamoDB
  • Test DR procedures regularly
  • Document and automate failover processes

Scalability and Performance:

  • Configure Auto Scaling policies (target tracking, step, scheduled)
  • Implement read replicas for database scaling
  • Use ElastiCache for application caching
  • Configure CloudFront for content delivery
  • Design stateless applications for horizontal scaling
  • Implement connection pooling with RDS Proxy
  • Use SQS for buffering and load leveling
  • Monitor performance with CloudWatch and X-Ray

Practice Questions

Try these from your practice test bundles:

Beginner Level (Target: 80%+ correct):

  • Domain 2 Bundle 1: Questions 1-20 (SQS, SNS, Multi-AZ, Auto Scaling basics)
  • Integration Services Bundle: Questions 1-15 (messaging and orchestration)

Intermediate Level (Target: 70%+ correct):

  • Domain 2 Bundle 2: Questions 21-40 (microservices, DR strategies, advanced scaling)
  • Full Practice Test 1: Domain 2 questions (mixed difficulty)

Advanced Level (Target: 60%+ correct):

  • Full Practice Test 2: Domain 2 questions (complex architectures)
  • Full Practice Test 3: Domain 2 questions (multi-region scenarios)

If you scored below target:

  • Below 60%: Review messaging services, Multi-AZ concepts, and basic DR strategies
  • 60-70%: Focus on microservices patterns, advanced scaling, and DR implementation
  • 70-80%: Study complex multi-region architectures and event-driven patterns
  • Above 80%: Excellent! Move to next domain

Quick Reference Card

Copy this to your notes for quick review:

Messaging Services Comparison

Service Use Case Ordering Delivery Throughput
SQS Standard Decoupling, high throughput Best-effort At-least-once Unlimited
SQS FIFO Strict ordering required Guaranteed Exactly-once 300 TPS (3000 with batching)
SNS Fan-out, pub/sub No guarantee At-least-once High
EventBridge Event routing, integrations No guarantee At-least-once High
Kinesis Real-time streaming Per shard At-least-once 1 MB/s per shard

Load Balancer Comparison

Type Layer Use Case Features
ALB Layer 7 (HTTP/HTTPS) Web applications Path/host routing, WebSocket, Lambda targets
NLB Layer 4 (TCP/UDP) High performance, static IP Ultra-low latency, millions of requests/sec
GLB Layer 3 (IP) Third-party appliances Transparent network gateway, GENEVE protocol

Disaster Recovery Strategies

Strategy RTO RPO Cost Running Resources
Backup/Restore Hours-Days Hours Lowest None
Pilot Light Minutes-Hours Minutes Low Minimal (data layer)
Warm Standby Minutes Minutes Medium Scaled-down replica
Active-Active Seconds Seconds Highest Full capacity

High Availability Checklist

  • āœ… Deploy across multiple AZs (minimum 2, preferably 3)
  • āœ… Use load balancers with health checks
  • āœ… Enable RDS Multi-AZ or Aurora with replicas
  • āœ… Implement Auto Scaling with appropriate policies
  • āœ… Use Route 53 health checks for DNS failover
  • āœ… Store session data externally (ElastiCache, DynamoDB)
  • āœ… Design stateless applications
  • āœ… Monitor with CloudWatch and set up alarms

Common Exam Scenarios

  • Scenario: Decouple components → Solution: Use SQS between producer and consumer
  • Scenario: Fan-out to multiple subscribers → Solution: SNS topic with multiple subscriptions
  • Scenario: Strict message ordering → Solution: SQS FIFO queue with message group ID
  • Scenario: Database high availability → Solution: RDS Multi-AZ or Aurora with replicas
  • Scenario: Multi-region failover → Solution: Route 53 failover routing + cross-region replication
  • Scenario: Handle traffic spikes → Solution: Auto Scaling + SQS for buffering
  • Scenario: Minimize RTO/RPO → Solution: Active-active multi-region architecture
  • Scenario: Stateless application → Solution: Store sessions in ElastiCache or DynamoDB

Next Chapter: 04_domain3_high_performing_architectures - Design High-Performing Architectures (24% of exam)

Chapter Summary

What We Covered

This chapter covered Domain 2: Design Resilient Architectures (26% of the exam), focusing on two critical task areas:

āœ… Task 2.1: Design scalable and loosely coupled architectures

  • Decoupling with SQS, SNS, and EventBridge
  • Serverless architectures with Lambda and Fargate
  • Microservices design patterns
  • Container orchestration with ECS and EKS
  • API Gateway for API management
  • Caching strategies with CloudFront and ElastiCache
  • Load balancing with ALB, NLB, and GLB
  • Auto Scaling for elasticity
  • Event-driven architectures
  • Workflow orchestration with Step Functions

āœ… Task 2.2: Design highly available and/or fault-tolerant architectures

  • Multi-AZ deployments for high availability
  • Multi-region architectures for disaster recovery
  • Route 53 routing policies and health checks
  • RDS Multi-AZ and Aurora high availability
  • Disaster recovery strategies (backup/restore, pilot light, warm standby, active-active)
  • Failover automation and testing
  • Data replication and synchronization
  • Immutable infrastructure patterns
  • Monitoring and observability with CloudWatch and X-Ray

Critical Takeaways

Resilience is about designing for failure:

  • Assume everything fails: Design systems that continue operating when components fail
  • Eliminate single points of failure: Use redundancy across multiple AZs and regions
  • Automate recovery: Use Auto Scaling, health checks, and automated failover
  • Test failure scenarios: Use AWS Fault Injection Simulator to validate resilience

Key Resilience Principles:

  1. Loose Coupling: Components can fail independently without cascading failures
  2. Horizontal Scaling: Add more instances rather than bigger instances
  3. Stateless Design: Store state externally (ElastiCache, DynamoDB) for easy scaling
  4. Graceful Degradation: System continues with reduced functionality during failures
  5. Idempotency: Operations can be retried safely without side effects

Most Important Services to Master:

  • SQS: Decoupling with message queues (standard and FIFO)
  • SNS: Fan-out messaging to multiple subscribers
  • Lambda: Serverless compute for event-driven architectures
  • Auto Scaling: Automatic capacity adjustment based on demand
  • Route 53: DNS-based failover and traffic routing
  • RDS Multi-AZ: Automatic database failover
  • Aurora: High availability with up to 15 read replicas

Common Exam Patterns:

  • Questions about decoupling → Use SQS between components
  • Questions about fan-out → Use SNS with multiple subscriptions
  • Questions about message ordering → Use SQS FIFO queues
  • Questions about high availability → Multi-AZ deployment + load balancer
  • Questions about disaster recovery → Choose strategy based on RTO/RPO requirements
  • Questions about scaling → Auto Scaling + CloudWatch metrics
  • Questions about stateless apps → Store sessions in ElastiCache or DynamoDB

Self-Assessment Checklist

Test yourself before moving to the next chapter. You should be able to:

Loose Coupling and Decoupling

  • Explain when to use SQS vs SNS vs EventBridge
  • Configure SQS FIFO queues for message ordering
  • Implement SNS fan-out pattern with SQS subscriptions
  • Design event-driven architectures with EventBridge
  • Use dead letter queues for failed message handling
  • Configure SQS visibility timeout and long polling
  • Implement message filtering with SNS

Serverless and Containers

  • Design Lambda functions with appropriate triggers
  • Configure Lambda concurrency and provisioned concurrency
  • Choose between ECS and EKS for container orchestration
  • Decide when to use Fargate vs EC2 launch type
  • Implement Step Functions for workflow orchestration
  • Design API Gateway with caching and throttling
  • Use Lambda layers for code reuse

Load Balancing and Auto Scaling

  • Choose between ALB, NLB, and GLB for different use cases
  • Configure ALB target groups and health checks
  • Implement path-based and host-based routing with ALB
  • Design Auto Scaling policies (target tracking, step, scheduled)
  • Configure Auto Scaling lifecycle hooks
  • Use Auto Scaling warm pools for faster scaling

High Availability

  • Design multi-AZ architectures for high availability
  • Configure RDS Multi-AZ for automatic failover
  • Implement Aurora with read replicas across AZs
  • Use Route 53 health checks for DNS failover
  • Configure ELB cross-zone load balancing
  • Design stateless applications with external session storage
  • Implement immutable infrastructure patterns

Disaster Recovery

  • Choose appropriate DR strategy based on RTO/RPO
  • Implement backup and restore strategy
  • Configure pilot light architecture
  • Design warm standby environment
  • Implement active-active multi-region architecture
  • Set up cross-region replication for S3 and RDS
  • Configure Aurora Global Database for multi-region
  • Test failover procedures regularly

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle 1: Questions 1-30 (Loose coupling and scalability)
  • Domain 2 Bundle 2: Questions 31-65 (High availability and disaster recovery)
  • Integration Services Bundle: All questions (SQS, SNS, EventBridge, Step Functions)
  • Compute Services Bundle: Questions on Lambda, ECS, EKS, Fargate

Expected Score: 75%+ to proceed confidently

If you scored below 75%:

  • 60-74%: Review specific sections where you struggled, then retry
  • Below 60%: Re-read this entire chapter, focusing on diagrams and examples
  • Focus on understanding trade-offs between different approaches

Quick Reference Card

Copy this to your notes for quick review:

Messaging Quick Facts

  • SQS Standard: At-least-once delivery, best-effort ordering, unlimited throughput
  • SQS FIFO: Exactly-once delivery, strict ordering, 3,000 msg/s (batching: 30,000)
  • SNS: Pub/sub, fan-out to multiple subscribers, push-based
  • EventBridge: Event bus, rule-based routing, 100+ AWS service integrations
  • Dead Letter Queue: Capture failed messages for analysis

Serverless Quick Facts

  • Lambda: Event-driven, 15-minute timeout, 10GB memory max, pay per invocation
  • Fargate: Serverless containers, no EC2 management, pay per vCPU/memory
  • Step Functions: Workflow orchestration, visual workflows, error handling
  • API Gateway: REST/WebSocket APIs, caching, throttling, authorization

Load Balancing Quick Facts

  • ALB: Layer 7, HTTP/HTTPS, path/host routing, WebSocket support
  • NLB: Layer 4, TCP/UDP, static IP, ultra-low latency, millions of requests/s
  • GLB: Layer 3, third-party appliances, transparent network gateway
  • Cross-Zone: Distribute traffic evenly across all AZs (enabled by default for ALB)

Auto Scaling Quick Facts

  • Target Tracking: Maintain metric at target value (e.g., 70% CPU)
  • Step Scaling: Scale based on CloudWatch alarm thresholds
  • Scheduled: Scale at specific times (e.g., business hours)
  • Predictive: ML-based forecasting for proactive scaling
  • Warm Pools: Pre-initialized instances for faster scaling

High Availability Quick Facts

  • Multi-AZ: Deploy across 2+ AZs (preferably 3)
  • RDS Multi-AZ: Synchronous replication, automatic failover (1-2 min)
  • Aurora: 6 copies across 3 AZs, up to 15 read replicas, <30s failover
  • Route 53: Health checks, failover routing, latency-based routing
  • Stateless: Store sessions in ElastiCache or DynamoDB

Disaster Recovery Quick Facts

  • Backup/Restore: Lowest cost, hours-days RTO, hours RPO
  • Pilot Light: Minimal running resources, minutes-hours RTO, minutes RPO
  • Warm Standby: Scaled-down replica, minutes RTO, minutes RPO
  • Active-Active: Full capacity in multiple regions, seconds RTO, seconds RPO
  • RTO: Recovery Time Objective (how long to recover)
  • RPO: Recovery Point Objective (how much data loss acceptable)

Decision Points

  • Decouple components → Use SQS between producer and consumer
  • Fan-out to multiple subscribers → SNS topic with multiple subscriptions
  • Strict message ordering → SQS FIFO queue with message group ID
  • Database high availability → RDS Multi-AZ or Aurora with replicas
  • Multi-region failover → Route 53 failover routing + cross-region replication
  • Handle traffic spikes → Auto Scaling + SQS for buffering
  • Minimize RTO/RPO → Active-active multi-region architecture
  • Stateless application → Store sessions in ElastiCache or DynamoDB

Congratulations! You've completed Domain 2: Design Resilient Architectures. This is the second-largest domain (26% of the exam), and mastering resilience patterns is essential for real-world AWS architectures.

Next Chapter: 04_domain3_high_performing_architectures - Design High-Performing Architectures (24% of exam)


Chapter Summary

What We Covered

This chapter covered the two major task areas of Domain 2: Design Resilient Architectures (26% of exam):

Task 2.1: Design Scalable and Loosely Coupled Architectures

  • āœ… Messaging services (SQS, SNS, EventBridge)
  • āœ… Serverless architectures (Lambda, Fargate)
  • āœ… Container orchestration (ECS, EKS)
  • āœ… API Gateway for API management
  • āœ… Load balancing strategies (ALB, NLB, GLB)
  • āœ… Auto Scaling for elastic compute
  • āœ… Caching strategies (CloudFront, ElastiCache)
  • āœ… Microservices and event-driven patterns
  • āœ… Step Functions for workflow orchestration

Task 2.2: Design Highly Available and/or Fault-Tolerant Architectures

  • āœ… Multi-AZ deployments for high availability
  • āœ… Multi-region architectures for disaster recovery
  • āœ… Route 53 routing policies and health checks
  • āœ… RDS Multi-AZ and Aurora high availability
  • āœ… S3 cross-region replication
  • āœ… DynamoDB global tables
  • āœ… Disaster recovery strategies (backup/restore, pilot light, warm standby, active-active)
  • āœ… RTO and RPO considerations
  • āœ… Automated failover and recovery

Critical Takeaways

  1. Decouple Everything: Use SQS, SNS, and EventBridge to decouple components. This allows independent scaling and prevents cascading failures.

  2. Design for Failure: Assume everything will fail. Use Multi-AZ deployments, health checks, and automatic failover to handle failures gracefully.

  3. Stateless Applications: Store session data in ElastiCache or DynamoDB, not on EC2 instances. This enables horizontal scaling and instance replacement.

  4. Choose the Right DR Strategy: Match your DR strategy to your RTO/RPO requirements. Active-active costs more but provides seconds of downtime.

  5. Use Managed Services: Services like RDS Multi-AZ, Aurora, and DynamoDB handle replication and failover automatically, reducing operational burden.

  6. Health Checks Everywhere: Implement health checks at every layer (Route 53, ELB, Auto Scaling) to detect and route around failures.

  7. Async Communication: Use message queues (SQS) for asynchronous processing to handle traffic spikes and prevent system overload.

  8. Multi-Region for Critical Workloads: For mission-critical applications, deploy across multiple regions with Route 53 failover routing.

Self-Assessment Checklist

Test yourself before moving on. Can you:

Scalability and Loose Coupling

  • Explain when to use SQS vs SNS vs EventBridge?
  • Design a decoupled architecture using message queues?
  • Implement fan-out pattern with SNS and SQS?
  • Choose between SQS Standard and FIFO queues?
  • Configure Lambda with appropriate event sources?
  • Design a microservices architecture with containers?
  • Implement API Gateway with caching and throttling?
  • Choose between ALB, NLB, and GLB for different use cases?
  • Configure Auto Scaling with appropriate policies?
  • Use Step Functions to orchestrate complex workflows?

High Availability and Fault Tolerance

  • Design a Multi-AZ architecture for high availability?
  • Explain RDS Multi-AZ vs Read Replicas?
  • Configure Aurora for maximum availability?
  • Set up Route 53 failover routing with health checks?
  • Implement S3 cross-region replication?
  • Configure DynamoDB global tables for multi-region?
  • Choose the appropriate DR strategy for given RTO/RPO?
  • Calculate RTO and RPO for different scenarios?
  • Implement automated failover using Route 53?
  • Design an active-active multi-region architecture?

Resilience Patterns

  • Implement circuit breaker pattern for fault tolerance?
  • Use dead letter queues for failed message handling?
  • Configure retry logic with exponential backoff?
  • Implement health checks at multiple layers?
  • Design for graceful degradation during failures?

Practice Questions

Try these from your practice test bundles:

Beginner Level (Build Confidence):

  • Domain 2 Bundle 1: Questions 1-20
  • Integration Services Bundle: Questions 1-15
  • Expected score: 70%+ to proceed

Intermediate Level (Test Understanding):

  • Domain 2 Bundle 2: Questions 1-20
  • Full Practice Test 1: Domain 2 questions
  • Expected score: 75%+ to proceed

Advanced Level (Challenge Yourself):

  • Full Practice Test 2: Domain 2 questions
  • Expected score: 70%+ to proceed

If you scored below target:

  • Below 60%: Review messaging patterns and HA architectures
  • 60-70%: Focus on DR strategies and Multi-AZ concepts
  • 70-80%: Review quick facts and decision points
  • 80%+: Excellent! Move to next domain

Quick Reference Card

Copy this to your notes for quick review:

Messaging Services

  • SQS Standard: At-least-once delivery, best-effort ordering, unlimited throughput
  • SQS FIFO: Exactly-once processing, strict ordering, 3,000 msg/s (batching: 30,000)
  • SNS: Pub/sub, fan-out to multiple subscribers, push-based
  • EventBridge: Event bus, rule-based routing, 100+ AWS service integrations

Serverless Compute

  • Lambda: Event-driven, 15-minute max, 10 GB memory, 512 MB /tmp
  • Fargate: Serverless containers, no EC2 management, pay per vCPU/memory
  • Step Functions: Workflow orchestration, visual workflows, error handling

Container Orchestration

  • ECS: AWS-native, simpler, tight AWS integration
  • EKS: Kubernetes, portable, complex, larger ecosystem
  • Both: Support EC2 and Fargate launch types

Load Balancers

  • ALB: Layer 7, HTTP/HTTPS, path/host routing, WebSocket
  • NLB: Layer 4, TCP/UDP, ultra-low latency, static IP
  • GLB: Layer 3, third-party appliances, transparent proxy

High Availability

  • Multi-AZ: 2+ Availability Zones, automatic failover
  • RDS Multi-AZ: Synchronous replication, 1-2 min failover
  • Aurora: 6 copies across 3 AZs, <30s failover, 15 read replicas
  • Route 53: Health checks, failover routing, latency-based routing

Disaster Recovery

Strategy RTO RPO Cost Use Case
Backup/Restore Hours-Days Hours Lowest Non-critical, cost-sensitive
Pilot Light Minutes-Hours Minutes Low Important, moderate budget
Warm Standby Minutes Minutes Medium Business-critical, quick recovery
Active-Active Seconds Seconds Highest Mission-critical, zero downtime

Key Decision Points

Scenario Solution
Decouple components SQS queue between producer/consumer
Fan-out to multiple subscribers SNS topic with multiple subscriptions
Strict message ordering SQS FIFO with message group ID
Database high availability RDS Multi-AZ or Aurora with replicas
Multi-region failover Route 53 failover + cross-region replication
Handle traffic spikes Auto Scaling + SQS buffering
Minimize RTO/RPO Active-active multi-region
Stateless application Store sessions in ElastiCache/DynamoDB

Chapter Summary

What We Covered

This chapter explored Design Resilient Architectures (26% of the exam), covering two major task areas:

āœ… Task 2.1: Design scalable and loosely coupled architectures

  • Decoupling with SQS, SNS, and EventBridge
  • Serverless architectures with Lambda and Fargate
  • Container orchestration with ECS and EKS
  • API Gateway for RESTful and WebSocket APIs
  • Load balancing with ALB, NLB, and GLB
  • Caching strategies with CloudFront and ElastiCache
  • Microservices patterns and event-driven design
  • Step Functions for workflow orchestration

āœ… Task 2.2: Design highly available and/or fault-tolerant architectures

  • Multi-AZ deployments for high availability
  • Multi-region architectures for disaster recovery
  • Route 53 routing policies and health checks
  • RDS Multi-AZ and Aurora high availability
  • Disaster recovery strategies (backup/restore, pilot light, warm standby, active-active)
  • Auto Scaling for elasticity and fault tolerance
  • Backup strategies with AWS Backup
  • Monitoring with CloudWatch and X-Ray

Critical Takeaways

  1. Loose Coupling: Decouple components with queues (SQS) and topics (SNS) to improve resilience and scalability.

  2. Stateless Design: Store session state externally (ElastiCache, DynamoDB) to enable horizontal scaling and fault tolerance.

  3. Multi-AZ by Default: Always deploy across multiple Availability Zones for production workloads (RDS Multi-AZ, Auto Scaling groups).

  4. Disaster Recovery Planning: Choose DR strategy based on RTO/RPO requirements - backup/restore (hours), pilot light (minutes), warm standby (minutes), active-active (seconds).

  5. Auto Scaling: Use Auto Scaling for elasticity - target tracking for predictable patterns, step scaling for rapid changes, scheduled for known peaks.

  6. Load Balancing: ALB for HTTP/HTTPS with path/host routing, NLB for TCP/UDP with ultra-low latency, GLB for third-party appliances.

  7. Serverless First: Consider Lambda and Fargate for event-driven workloads to eliminate server management and improve scalability.

  8. Health Checks: Implement health checks at multiple layers (Route 53, ELB, Auto Scaling) for automatic failover.

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain the difference between SQS Standard and FIFO queues
  • I understand when to use SNS vs SQS vs EventBridge
  • I can design a loosely coupled microservices architecture
  • I know how to implement Multi-AZ high availability
  • I understand the four disaster recovery strategies and when to use each
  • I can configure Auto Scaling policies for different scenarios
  • I know the differences between ALB, NLB, and GLB
  • I understand Lambda concurrency and scaling behavior
  • I can design a multi-region failover architecture
  • I know how to implement caching at multiple layers

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle 1: Questions 1-50 (Expected score: 70%+)
  • Domain 2 Bundle 2: Questions 1-50 (Expected score: 70%+)
  • Integration Services Bundle: Questions 1-50 (Expected score: 70%+)
  • Full Practice Test 1: Domain 2 questions (Expected score: 75%+)

If you scored below 70%:

  • Review SQS/SNS patterns and when to use each
  • Focus on disaster recovery strategy selection
  • Study Auto Scaling policies and triggers
  • Practice designing multi-tier architectures

Quick Reference Card

Decoupling Services:

  • SQS Standard: At-least-once delivery, best-effort ordering, unlimited throughput
  • SQS FIFO: Exactly-once delivery, strict ordering, 3,000 msg/s (batching)
  • SNS: Pub/sub, fan-out to multiple subscribers, push-based
  • EventBridge: Event bus, schema registry, 100+ AWS sources

Serverless Compute:

  • Lambda: Event-driven, 15-min max, 10 GB memory, 1,000 concurrent executions default
  • Fargate: Serverless containers, pay per vCPU/memory, no server management
  • ECS: Container orchestration, EC2 or Fargate launch types
  • EKS: Managed Kubernetes, multi-AZ control plane

Load Balancers:

  • ALB: Layer 7, HTTP/HTTPS, path/host routing, WebSocket, $0.0225/hour
  • NLB: Layer 4, TCP/UDP, ultra-low latency, static IP, $0.0225/hour
  • GLB: Layer 3, third-party appliances, transparent proxy

High Availability:

  • Multi-AZ: 2+ Availability Zones, automatic failover
  • RDS Multi-AZ: Synchronous replication, 1-2 min failover
  • Aurora: 6 copies across 3 AZs, <30s failover, 15 read replicas
  • Route 53: Health checks, failover routing, latency-based routing

Disaster Recovery:

  • Backup/Restore: Hours-Days RTO, Hours RPO, Lowest cost
  • Pilot Light: Minutes-Hours RTO, Minutes RPO, Low cost
  • Warm Standby: Minutes RTO, Minutes RPO, Medium cost
  • Active-Active: Seconds RTO, Seconds RPO, Highest cost

Decision Points:

  • Need to decouple? → SQS queue between components
  • Need fan-out? → SNS topic with multiple subscriptions
  • Need strict ordering? → SQS FIFO with message group ID
  • Need high availability? → Multi-AZ deployment
  • Need disaster recovery? → Choose strategy based on RTO/RPO
  • Need to handle spikes? → Auto Scaling + SQS buffering
  • Need stateless app? → Store sessions in ElastiCache/DynamoDB

Next Chapter: Proceed to 04_domain3_high_performing_architectures to learn about designing high-performing architectures.

Chapter Summary

What We Covered

This chapter covered resilient architecture design, representing 26% of the exam content. You learned:

  • āœ… Loose Coupling: SQS, SNS, EventBridge, and decoupling patterns
  • āœ… Microservices: Containers (ECS/EKS), serverless (Lambda), and orchestration (Step Functions)
  • āœ… Scalability: Auto Scaling, load balancing, and horizontal/vertical scaling strategies
  • āœ… High Availability: Multi-AZ deployments, RDS Multi-AZ, Aurora, and failover mechanisms
  • āœ… Disaster Recovery: Backup/restore, pilot light, warm standby, and active-active strategies
  • āœ… Fault Tolerance: Health checks, automatic recovery, and immutable infrastructure

Critical Takeaways

  1. Decouple Everything: Use queues (SQS) and topics (SNS) to break dependencies between components, enabling independent scaling and failure isolation
  2. Design for Failure: Assume everything fails - use Multi-AZ, health checks, auto-recovery, and graceful degradation
  3. Scale Horizontally: Add more instances rather than bigger instances for better fault tolerance and cost efficiency
  4. Choose the Right DR Strategy: Match RTO/RPO requirements to cost - backup/restore for non-critical, active-active for mission-critical
  5. Leverage Managed Services: Use RDS Multi-AZ, Aurora, and managed load balancers to get built-in resilience without operational overhead
  6. Automate Recovery: Use Auto Scaling, Route 53 health checks, and CloudWatch alarms to detect and recover from failures automatically

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

Loose Coupling & Messaging:

  • Design event-driven architectures using SQS, SNS, and EventBridge
  • Choose between SQS Standard and FIFO based on requirements
  • Implement dead letter queues for failed message handling
  • Configure SNS fan-out patterns for multiple subscribers
  • Use EventBridge for complex event routing and filtering

Microservices & Containers:

  • Compare ECS vs EKS and choose based on requirements
  • Design serverless architectures with Lambda and API Gateway
  • Implement Step Functions for workflow orchestration
  • Configure Fargate for serverless container execution
  • Use service discovery for microservices communication

Scalability & Load Balancing:

  • Configure Auto Scaling with dynamic, scheduled, and predictive policies
  • Choose between ALB, NLB, and GLB based on use case
  • Implement path-based and host-based routing with ALB
  • Design caching strategies with CloudFront and ElastiCache
  • Optimize application for horizontal scaling

High Availability:

  • Design Multi-AZ architectures for critical workloads
  • Configure RDS Multi-AZ with automatic failover
  • Implement Aurora Global Database for multi-region HA
  • Set up Route 53 health checks and failover routing
  • Use DynamoDB Global Tables for multi-region replication

Disaster Recovery:

  • Calculate RTO and RPO for business requirements
  • Choose appropriate DR strategy (backup/restore, pilot light, warm standby, active-active)
  • Implement cross-region backup and replication
  • Design and test failover procedures
  • Use AWS Backup for centralized backup management

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle 1: Questions 1-25 (Loose coupling and microservices)
  • Domain 2 Bundle 2: Questions 26-50 (High availability and disaster recovery)
  • Integration Services Bundle: All questions (SQS, SNS, EventBridge, Step Functions)
  • Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • Review SQS FIFO vs Standard and when to use each
  • Practice designing Multi-AZ architectures with proper failover
  • Focus on understanding DR strategies and RTO/RPO tradeoffs
  • Revisit Auto Scaling policies and when to use each type

Quick Reference Card

Messaging Services:

  • SQS Standard: At-least-once, best-effort ordering, unlimited throughput
  • SQS FIFO: Exactly-once, strict ordering, 3,000 msg/s (300 msg/s per group)
  • SNS: Pub/sub, fan-out, push-based, 100,000 topics/account
  • EventBridge: Event bus, rules, targets, schema registry

Container Services:

  • ECS: AWS-native, simpler, tight AWS integration
  • EKS: Kubernetes, portable, complex, larger ecosystem
  • Fargate: Serverless containers, no EC2 management
  • ECR: Container registry, integrated with ECS/EKS

Load Balancers:

  • ALB: Layer 7, HTTP/HTTPS, path/host routing, WebSocket
  • NLB: Layer 4, TCP/UDP, ultra-low latency, static IP
  • GLB: Layer 3, third-party appliances, transparent proxy

High Availability Patterns:

  • Multi-AZ: 2+ Availability Zones, automatic failover
  • RDS Multi-AZ: Synchronous replication, 1-2 min failover
  • Aurora: 6 copies across 3 AZs, <30s failover
  • Route 53: Health checks, failover routing

Disaster Recovery Strategies:

Strategy RTO RPO Cost Use Case
Backup/Restore Hours-Days Hours Lowest Non-critical
Pilot Light Minutes-Hours Minutes Low Important
Warm Standby Minutes Minutes Medium Business-critical
Active-Active Seconds Seconds Highest Mission-critical

Auto Scaling Policies:

  • Target Tracking: Maintain metric at target (e.g., 70% CPU)
  • Step Scaling: Scale based on CloudWatch alarm thresholds
  • Scheduled: Scale at specific times (e.g., business hours)
  • Predictive: ML-based forecasting for future demand

Common Exam Scenarios:

  • Need to decouple? → SQS queue between components
  • Need fan-out? → SNS topic with multiple subscriptions
  • Need strict ordering? → SQS FIFO with message group ID
  • Need high availability? → Multi-AZ deployment
  • Need disaster recovery? → Choose strategy based on RTO/RPO
  • Need to handle spikes? → Auto Scaling + SQS buffering
  • Need stateless app? → Store sessions in ElastiCache/DynamoDB
  • Need workflow orchestration? → Step Functions
  • Need serverless containers? → Fargate

You're ready to proceed when you can:

  • Design loosely coupled architectures with proper messaging
  • Choose the right load balancer and Auto Scaling strategy
  • Implement Multi-AZ deployments with automatic failover
  • Select appropriate DR strategy based on RTO/RPO requirements
  • Troubleshoot scaling and availability issues

Next: Move to Chapter 3: High-Performing Architectures to learn about performance optimization.


Chapter Summary

What We Covered

This chapter covered the essential concepts for designing resilient architectures on AWS, which accounts for 26% of the SAA-C03 exam. We explored two major task areas:

Task 2.1: Scalable and Loosely Coupled Architectures

  • āœ… Messaging services (SQS, SNS, EventBridge) for decoupling components
  • āœ… Serverless compute (Lambda, Fargate) for elastic scaling
  • āœ… Container orchestration (ECS, EKS) for microservices
  • āœ… API Gateway for RESTful and WebSocket APIs
  • āœ… Load balancing strategies (ALB, NLB, GLB)
  • āœ… Auto Scaling policies and lifecycle management
  • āœ… Caching strategies (CloudFront, ElastiCache)
  • āœ… Step Functions for workflow orchestration
  • āœ… Event-driven architecture patterns

Task 2.2: Highly Available and Fault-Tolerant Architectures

  • āœ… Multi-AZ deployments for high availability
  • āœ… Multi-region architectures for disaster recovery
  • āœ… Route 53 routing policies and health checks
  • āœ… RDS Multi-AZ and Aurora Global Database
  • āœ… DynamoDB Global Tables for multi-region replication
  • āœ… S3 Cross-Region Replication for data durability
  • āœ… Disaster recovery strategies (Backup/Restore, Pilot Light, Warm Standby, Active-Active)
  • āœ… Backup and restore strategies using AWS Backup
  • āœ… Monitoring and observability with CloudWatch and X-Ray

Critical Takeaways

  1. Loose Coupling: Always decouple components using SQS queues, SNS topics, or EventBridge to prevent cascading failures and enable independent scaling.

  2. Message Ordering: Use SQS FIFO queues when strict ordering is required; use Standard queues for maximum throughput when order doesn't matter.

  3. Fan-Out Pattern: SNS + SQS fan-out enables one message to trigger multiple independent processing workflows without tight coupling.

  4. Multi-AZ vs Multi-Region: Multi-AZ protects against AZ failures (automatic failover in minutes); Multi-Region protects against region failures (requires manual or automated failover).

  5. RTO and RPO: Recovery Time Objective (how long to recover) and Recovery Point Objective (how much data loss acceptable) determine your DR strategy choice.

  6. Auto Scaling Policies: Target Tracking for steady-state metrics, Step Scaling for threshold-based scaling, Scheduled for predictable patterns, Predictive for ML-based forecasting.

  7. Load Balancer Selection: ALB for HTTP/HTTPS with advanced routing, NLB for TCP/UDP with ultra-low latency, GLB for third-party appliances.

  8. Serverless Benefits: Lambda and Fargate eliminate server management, scale automatically, and charge only for actual usage (no idle costs).

  9. State Management: Store session state in ElastiCache or DynamoDB (not on EC2 instances) to enable stateless application design and horizontal scaling.

  10. Health Checks: Implement health checks at multiple layers (Route 53, ELB, Auto Scaling) to detect and route around failures automatically.

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

Messaging and Decoupling:

  • Explain the difference between SQS Standard and FIFO queues
  • Describe when to use SQS vs SNS vs EventBridge
  • Design an SNS + SQS fan-out architecture
  • Configure SQS visibility timeout and dead-letter queues
  • Implement long polling to reduce costs

Serverless and Containers:

  • Explain when to use Lambda vs Fargate vs ECS on EC2
  • Configure Lambda concurrency limits and reserved concurrency
  • Design Step Functions workflows with error handling
  • Choose between ECS and EKS for container orchestration
  • Implement API Gateway with Lambda integration

Load Balancing and Auto Scaling:

  • Select the appropriate load balancer type (ALB vs NLB vs GLB)
  • Configure ALB path-based and host-based routing
  • Design Auto Scaling policies for different workload patterns
  • Implement lifecycle hooks for graceful instance termination
  • Configure cross-zone load balancing

High Availability:

  • Design Multi-AZ deployments for RDS, EFS, and ALB
  • Explain RDS Multi-AZ automatic failover process
  • Configure Aurora read replicas for read scaling
  • Implement Route 53 health checks and failover routing
  • Design stateless applications with external session storage

Disaster Recovery:

  • Calculate RTO and RPO for different DR strategies
  • Choose appropriate DR strategy based on business requirements
  • Design Backup and Restore strategy with AWS Backup
  • Implement Pilot Light architecture for critical systems
  • Configure Aurora Global Database for multi-region DR
  • Set up DynamoDB Global Tables for active-active replication
  • Design S3 Cross-Region Replication for data durability

Monitoring and Troubleshooting:

  • Configure CloudWatch alarms for Auto Scaling triggers
  • Use X-Ray for distributed tracing and bottleneck identification
  • Implement CloudWatch Logs for centralized logging
  • Monitor service quotas and request limit increases
  • Design retry strategies with exponential backoff

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle 1: Questions 1-25 (Focus: Messaging and decoupling)
  • Domain 2 Bundle 2: Questions 26-50 (Focus: High availability and DR)
  • Full Practice Test 1: Domain 2 questions (Mixed difficulty)

Expected score: 70%+ to proceed confidently

If you scored below 70%:

  • Review sections on messaging patterns (SQS, SNS, EventBridge)
  • Focus on Multi-AZ vs Multi-Region differences
  • Study DR strategy selection based on RTO/RPO
  • Practice Auto Scaling policy configuration
  • Review load balancer type selection criteria

Quick Reference Card

Copy this to your notes for quick review:

Messaging Services:

  • SQS Standard: Best-effort ordering, unlimited throughput, at-least-once delivery
  • SQS FIFO: Strict ordering, 3,000 msg/s (batched), exactly-once processing
  • SNS: Pub/sub, fan-out to multiple subscribers, push-based
  • EventBridge: Event bus, schema registry, 100+ AWS service integrations

Serverless Compute:

  • Lambda: Event-driven, 15-min max, 10 GB memory, pay per invocation
  • Fargate: Serverless containers, no EC2 management, pay per vCPU/memory
  • Step Functions: Workflow orchestration, visual workflows, error handling

Load Balancers:

  • ALB: Layer 7, HTTP/HTTPS, path/host routing, WebSocket, $0.0225/hour
  • NLB: Layer 4, TCP/UDP, static IP, ultra-low latency, $0.0225/hour
  • GLB: Layer 3, third-party appliances, transparent proxy

High Availability:

  • Multi-AZ: Same region, different AZs, automatic failover (1-2 min)
  • Multi-Region: Different regions, manual/automated failover, global reach
  • RDS Multi-AZ: Synchronous replication, automatic failover, zero data loss
  • Aurora: 6 copies across 3 AZs, 15 read replicas, <30s failover

Disaster Recovery:

  • Backup/Restore: Lowest cost, hours RTO/RPO, use AWS Backup
  • Pilot Light: Core systems running, minutes RTO, moderate cost
  • Warm Standby: Scaled-down replica, minutes RTO, higher cost
  • Active-Active: Full capacity both regions, seconds RTO, highest cost

Auto Scaling:

  • Target Tracking: Maintain metric at target (e.g., 70% CPU)
  • Step Scaling: Scale based on alarm thresholds
  • Scheduled: Scale at specific times
  • Predictive: ML-based forecasting

Common Patterns:

  • Decouple → SQS queue between components
  • Fan-out → SNS + multiple SQS subscriptions
  • Ordering → SQS FIFO with message group ID
  • Workflow → Step Functions state machine
  • Stateless → Store sessions in ElastiCache/DynamoDB
  • Global → Route 53 + CloudFront + Multi-Region

Congratulations! You've completed Chapter 2: Design Resilient Architectures. You now understand how to build scalable, loosely coupled, highly available, and fault-tolerant systems on AWS.

Next Steps:

  1. Complete the self-assessment checklist above
  2. Practice with Domain 2 test bundles
  3. Review any weak areas identified
  4. When ready, proceed to Chapter 3: High-Performing Architectures

Chapter Summary

What We Covered

Task 2.1: Design Scalable and Loosely Coupled Architectures

  • āœ… Messaging services (SQS, SNS, EventBridge)
  • āœ… API Gateway for RESTful and WebSocket APIs
  • āœ… Serverless compute (Lambda, Fargate)
  • āœ… Container orchestration (ECS, EKS)
  • āœ… Load balancing (ALB, NLB, GLB)
  • āœ… Caching strategies (CloudFront, ElastiCache, DAX)
  • āœ… Workflow orchestration (Step Functions)
  • āœ… Microservices and event-driven architectures

Task 2.2: Design Highly Available and Fault-Tolerant Architectures

  • āœ… Multi-AZ and multi-region deployments
  • āœ… Route 53 routing policies and health checks
  • āœ… Disaster recovery strategies (backup/restore, pilot light, warm standby, active-active)
  • āœ… RDS Multi-AZ and Aurora Global Database
  • āœ… Auto Scaling and lifecycle hooks
  • āœ… Backup strategies with AWS Backup
  • āœ… Monitoring and observability (CloudWatch, X-Ray)
  • āœ… Chaos engineering with Fault Injection Simulator

Critical Takeaways

  1. Loose Coupling: Use queues (SQS) and pub/sub (SNS) to decouple components
  2. Stateless Design: Store session data externally (ElastiCache, DynamoDB)
  3. Horizontal Scaling: Scale out with Auto Scaling, not up with larger instances
  4. Multi-AZ by Default: Always deploy across multiple Availability Zones
  5. Caching Layers: Implement caching at multiple levels (CloudFront, API Gateway, ElastiCache)
  6. Async Processing: Use queues for background tasks, Step Functions for workflows
  7. Health Checks: Implement health checks at load balancer and Route 53 levels
  8. DR Planning: Choose DR strategy based on RTO/RPO requirements and budget

Self-Assessment Checklist

Test yourself before moving on:

Scalability & Loose Coupling

  • I can explain when to use SQS vs SNS vs EventBridge
  • I understand the difference between SQS Standard and FIFO queues
  • I know how to implement fan-out patterns with SNS and SQS
  • I can design event-driven architectures with EventBridge
  • I understand when to use Lambda vs Fargate vs ECS on EC2
  • I know how to implement API Gateway caching and throttling
  • I can explain the benefits of Step Functions for workflows

High Availability & Fault Tolerance

  • I can design multi-AZ architectures for high availability
  • I understand Route 53 routing policies (failover, weighted, latency)
  • I know the difference between RDS Multi-AZ and read replicas
  • I can explain Aurora Global Database benefits
  • I understand the four DR strategies and when to use each
  • I know how to implement Auto Scaling with proper health checks
  • I can design cross-region failover architectures

Load Balancing & Caching

  • I understand when to use ALB vs NLB vs GLB
  • I know how to configure ALB target groups and health checks
  • I can explain CloudFront caching behaviors and TTLs
  • I understand ElastiCache Redis vs Memcached use cases
  • I know when to use DynamoDB DAX for caching

Monitoring & Resilience

  • I can design CloudWatch alarms for critical metrics
  • I understand how to use X-Ray for distributed tracing
  • I know how to implement automated remediation with EventBridge
  • I can explain chaos engineering principles with FIS

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle 1: Questions 1-50 (scalability and loose coupling)
  • Domain 2 Bundle 2: Questions 51-100 (high availability and fault tolerance)
  • Integration Services Bundle: 50 questions on SQS, SNS, EventBridge, Step Functions, API Gateway
  • Compute Services Bundle: 50 questions on Lambda, ECS, EKS, Fargate

Expected Score: 70%+ to proceed confidently

If you scored below 70%:

  • Review messaging patterns (SQS, SNS, EventBridge)
  • Practice designing multi-AZ and multi-region architectures
  • Focus on understanding DR strategies and RTO/RPO
  • Revisit load balancing and caching strategies

Quick Reference Card

Copy this to your notes for quick review:

Messaging Patterns:

  • SQS Standard: At-least-once delivery, best-effort ordering, unlimited throughput
  • SQS FIFO: Exactly-once processing, strict ordering, 3,000 msg/sec (batching)
  • SNS: Pub/sub, fan-out to multiple subscribers, push-based
  • EventBridge: Event bus, rule-based routing, 100+ AWS service integrations

Serverless Compute:

  • Lambda: Event-driven, 15-min max, pay per invocation
  • Fargate: Serverless containers, no EC2 management, pay per vCPU/memory
  • ECS: Container orchestration, EC2 or Fargate launch types
  • EKS: Managed Kubernetes, multi-cloud portability

Load Balancers:

  • ALB: Layer 7 (HTTP/HTTPS), path/host routing, WebSocket support
  • NLB: Layer 4 (TCP/UDP), ultra-low latency, static IP, millions of requests/sec
  • GLB: Layer 3 (IP), third-party appliances, transparent network gateway

Disaster Recovery:

  • Backup/Restore: Lowest cost, hours RTO/RPO, use AWS Backup
  • Pilot Light: Core systems running, minutes RTO, moderate cost
  • Warm Standby: Scaled-down replica, minutes RTO, higher cost
  • Active-Active: Full capacity both regions, seconds RTO, highest cost

Auto Scaling:

  • Target Tracking: Maintain metric at target (e.g., 70% CPU)
  • Step Scaling: Scale based on alarm thresholds
  • Scheduled: Scale at specific times
  • Predictive: ML-based forecasting

Common Patterns:

  • Decouple → SQS queue between components
  • Fan-out → SNS + multiple SQS subscriptions
  • Ordering → SQS FIFO with message group ID
  • Workflow → Step Functions state machine
  • Stateless → Store sessions in ElastiCache/DynamoDB
  • Global → Route 53 + CloudFront + Multi-Region

Chapter Summary

What We Covered

This chapter covered the two critical task areas for designing resilient architectures on AWS:

āœ… Task 2.1: Scalable and Loosely Coupled Architectures

  • Decoupling patterns with SQS, SNS, and EventBridge
  • Serverless architectures with Lambda and Fargate
  • Container orchestration with ECS and EKS
  • API Gateway for RESTful and WebSocket APIs
  • Load balancing with ALB, NLB, and GLB
  • Caching strategies with CloudFront and ElastiCache
  • Microservices design patterns
  • Event-driven architectures
  • Auto Scaling for elastic compute
  • Step Functions for workflow orchestration

āœ… Task 2.2: Highly Available and Fault-Tolerant Architectures

  • Multi-AZ deployments for high availability
  • Multi-region architectures for disaster recovery
  • Route 53 routing policies for failover and load distribution
  • RDS Multi-AZ and Aurora Global Database
  • DynamoDB Global Tables for multi-region replication
  • S3 Cross-Region Replication (CRR)
  • Disaster recovery strategies (Backup/Restore, Pilot Light, Warm Standby, Active-Active)
  • Health checks and automated failover
  • Backup strategies with AWS Backup
  • Monitoring and observability with CloudWatch and X-Ray

Critical Takeaways

  1. Decouple Everything: Use queues (SQS) and topics (SNS) to decouple components. This prevents cascading failures and enables independent scaling.

  2. Design for Failure: Assume everything will fail. Implement health checks, automatic failover, and retry logic. Use multiple Availability Zones.

  3. Scale Horizontally: Add more instances rather than bigger instances. Use Auto Scaling groups with target tracking policies.

  4. Choose the Right DR Strategy: Match your RTO/RPO requirements to cost. Backup/Restore is cheapest but slowest. Active-Active is fastest but most expensive.

  5. Use Managed Services: Let AWS handle the heavy lifting. RDS Multi-AZ, Aurora, DynamoDB, and S3 provide built-in high availability.

  6. Implement Caching: Cache at every layer - CloudFront for edge, ElastiCache for application, DAX for DynamoDB, RDS read replicas for databases.

  7. Stateless Applications: Store session state externally (ElastiCache, DynamoDB). This enables easy horizontal scaling and failover.

  8. Monitor Everything: Use CloudWatch for metrics and alarms. Use X-Ray for distributed tracing. Set up composite alarms for complex failure scenarios.

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

Decoupling and Messaging:

  • Explain when to use SQS Standard vs FIFO queues
  • Design a fan-out pattern with SNS and SQS
  • Configure SQS visibility timeout and dead-letter queues
  • Implement event-driven architecture with EventBridge
  • Use Step Functions to orchestrate complex workflows
  • Design asynchronous processing with Lambda and SQS
  • Implement message filtering with SNS
  • Handle ordering requirements with SQS FIFO

Serverless and Containers:

  • Design serverless applications with Lambda and API Gateway
  • Configure Lambda concurrency limits and reserved capacity
  • Choose between ECS and EKS for container orchestration
  • Decide when to use Fargate vs EC2 launch type
  • Implement service discovery in ECS
  • Configure Lambda event source mappings
  • Use Lambda layers for code reuse
  • Design Lambda destinations for success/failure handling

Load Balancing and Auto Scaling:

  • Choose between ALB, NLB, and GLB for different use cases
  • Configure ALB path-based and host-based routing
  • Set up health checks for load balancers
  • Design Auto Scaling policies (target tracking, step, scheduled)
  • Implement lifecycle hooks for graceful shutdown
  • Configure cross-zone load balancing
  • Use NLB for ultra-low latency requirements
  • Implement sticky sessions with ALB

High Availability:

  • Design multi-AZ architectures for high availability
  • Configure RDS Multi-AZ for automatic failover
  • Implement Aurora Global Database for multi-region
  • Set up DynamoDB Global Tables
  • Configure S3 Cross-Region Replication
  • Use Route 53 health checks and failover routing
  • Implement EFS for shared file storage across AZs
  • Design for no single points of failure

Disaster Recovery:

  • Calculate RTO and RPO for business requirements
  • Choose appropriate DR strategy (Backup/Restore, Pilot Light, Warm Standby, Active-Active)
  • Implement automated backups with AWS Backup
  • Configure cross-region backup replication
  • Design pilot light architecture with minimal running resources
  • Implement warm standby with scaled-down replica
  • Design active-active multi-region architecture
  • Test DR procedures regularly

Practice Questions

Try these from your practice test bundles:

Beginner Level (Target: 80%+ correct):

  • Domain 2 Bundle 1: Questions 1-20 (SQS/SNS basics, Multi-AZ, Auto Scaling fundamentals)
  • Integration Services Bundle: Questions 1-15 (Messaging patterns, event-driven basics)

Intermediate Level (Target: 70%+ correct):

  • Domain 2 Bundle 2: Questions 21-40 (Advanced decoupling, DR strategies, container orchestration)
  • Full Practice Test 1: Domain 2 questions (Mixed difficulty, realistic scenarios)

Advanced Level (Target: 60%+ correct):

  • Full Practice Test 2: Domain 2 questions (Complex architectures, multi-region patterns)
  • Full Practice Test 3: Domain 2 questions (Advanced resilience patterns)

If You Scored Below Target

Below 60% on Beginner Questions:

  • Review sections: SQS/SNS Basics, Multi-AZ Deployments, Auto Scaling Fundamentals
  • Focus on: Queue types, pub/sub patterns, AZ concepts, basic scaling policies
  • Practice: Create SQS queues, configure SNS topics, set up Auto Scaling groups

Below 60% on Intermediate Questions:

  • Review sections: Event-Driven Architectures, DR Strategies, Container Orchestration
  • Focus on: EventBridge rules, RTO/RPO calculations, ECS vs EKS, Lambda patterns
  • Practice: Design fan-out patterns, calculate DR costs, deploy containers to ECS

Below 50% on Advanced Questions:

  • Review sections: Multi-Region Architectures, Complex Workflows, Microservices Patterns
  • Focus on: Active-active failover, Step Functions, saga pattern, circuit breaker
  • Practice: Design multi-region architecture, implement complex workflows, optimize for resilience

Quick Reference Card

Copy this to your notes for quick review

Messaging Services

  • SQS Standard: At-least-once delivery, best-effort ordering, unlimited throughput
  • SQS FIFO: Exactly-once processing, strict ordering, 300 TPS (3000 with batching)
  • SNS: Pub/sub, fan-out to multiple subscribers, push-based
  • EventBridge: Event bus, rule-based routing, 100+ AWS service integrations
  • Kinesis: Real-time streaming, ordered records, replay capability

Serverless Compute

  • Lambda: Event-driven, 15-minute max execution, pay per invocation
  • Fargate: Serverless containers, no EC2 management, pay per vCPU/memory
  • API Gateway: RESTful APIs, WebSocket APIs, throttling, caching

Container Orchestration

  • ECS: AWS-native, simpler, tight AWS integration
  • EKS: Kubernetes, multi-cloud portability, complex but powerful
  • Fargate: Serverless launch type for ECS/EKS
  • EC2 Launch Type: More control, lower cost, manage instances

Load Balancers

  • ALB: Layer 7 (HTTP/HTTPS), path/host routing, WebSocket, Lambda targets
  • NLB: Layer 4 (TCP/UDP), ultra-low latency, static IP, millions req/sec
  • GLB: Layer 3 (IP), third-party appliances, transparent gateway

Auto Scaling

  • Target Tracking: Maintain metric at target (e.g., 70% CPU)
  • Step Scaling: Scale based on alarm thresholds
  • Scheduled: Scale at specific times (predictable patterns)
  • Predictive: ML-based forecasting

High Availability

  • Multi-AZ: Deploy across multiple Availability Zones in same region
  • RDS Multi-AZ: Synchronous replication, automatic failover (1-2 min)
  • Aurora: Up to 15 read replicas, 6 copies across 3 AZs
  • DynamoDB: Automatically replicated across 3 AZs
  • S3: 99.999999999% durability, automatically replicated

Disaster Recovery

Strategy RTO RPO Cost Use Case
Backup/Restore Hours Hours $ Non-critical, cost-sensitive
Pilot Light 10s of minutes Minutes $$ Core systems only
Warm Standby Minutes Seconds $$$ Business-critical
Active-Active Seconds None $$$$ Mission-critical, zero downtime

Caching Strategies

  1. CloudFront: Edge caching (global), static/dynamic content
  2. API Gateway: Response caching (regional), API responses
  3. ElastiCache: Application caching (AZ), session state, database queries
  4. DAX: DynamoDB caching, microsecond latency
  5. RDS Read Replicas: Read scaling, up to 15 replicas

Decision Points

Scenario Solution
Need to decouple components SQS queue
Need to fan-out to multiple targets SNS topic
Need strict message ordering SQS FIFO
Need event-driven architecture EventBridge
Need to orchestrate workflows Step Functions
Need serverless compute Lambda
Need serverless containers Fargate
Need ultra-low latency LB NLB
Need path-based routing ALB
Need automatic database failover RDS Multi-AZ
Need multi-region database Aurora Global or DynamoDB Global Tables
Need to replicate S3 data Cross-Region Replication

Common Exam Traps

  • āŒ Tight coupling between components → āœ… Use queues/topics to decouple
  • āŒ Single AZ deployment → āœ… Deploy across multiple AZs
  • āŒ No health checks → āœ… Implement health checks and automatic failover
  • āŒ Stateful applications → āœ… Store state externally (ElastiCache, DynamoDB)
  • āŒ No caching → āœ… Implement caching at multiple layers
  • āŒ Manual scaling → āœ… Use Auto Scaling with appropriate policies
  • āŒ No DR plan → āœ… Implement appropriate DR strategy for RTO/RPO
  • āŒ Not testing failover → āœ… Regularly test DR procedures

Next Chapter: 04_domain3_high_performing_architectures - Learn how to design high-performing and scalable solutions.


Chapter Summary

What We Covered

This chapter covered the two critical task areas for designing resilient architectures on AWS:

āœ… Task 2.1: Scalable and Loosely Coupled Architectures

  • Microservices vs monolithic architectures
  • Event-driven architectures with EventBridge
  • Message queuing with SQS (Standard and FIFO)
  • Pub/sub messaging with SNS
  • API Gateway for RESTful and WebSocket APIs
  • Serverless compute with Lambda
  • Container orchestration with ECS and EKS
  • Workflow orchestration with Step Functions
  • Caching strategies with CloudFront and ElastiCache
  • Load balancing with ALB, NLB, and GLB
  • Auto Scaling for elastic capacity

āœ… Task 2.2: Highly Available and Fault-Tolerant Architectures

  • Multi-AZ deployments for high availability
  • Multi-region architectures for disaster recovery
  • Route 53 routing policies for failover and load distribution
  • RDS Multi-AZ and Aurora for database resilience
  • DynamoDB Global Tables for multi-region replication
  • S3 Cross-Region Replication for data durability
  • Disaster recovery strategies (Backup/Restore, Pilot Light, Warm Standby, Active-Active)
  • RTO and RPO considerations
  • Health checks and monitoring with CloudWatch
  • Automated failover and recovery

Critical Takeaways

  1. Loose Coupling is Key: Decouple components using queues (SQS), topics (SNS), and event buses (EventBridge). This allows independent scaling and failure isolation.

  2. Stateless Design: Design applications to be stateless. Store session state in ElastiCache or DynamoDB, not on EC2 instances. This enables horizontal scaling.

  3. Multi-AZ by Default: Always deploy across multiple Availability Zones. Use RDS Multi-AZ, Aurora with multiple replicas, ALB across AZs, and Auto Scaling groups spanning AZs.

  4. Choose the Right DR Strategy: Match your DR strategy to your RTO/RPO requirements. Backup/Restore is cheapest but slowest. Active-Active is fastest but most expensive.

  5. Automate Everything: Use Auto Scaling, health checks, and automated failover. Don't rely on manual intervention during failures.

  6. Cache Aggressively: Use CloudFront for edge caching, ElastiCache for application caching, and DAX for DynamoDB. Caching reduces load and improves performance.

  7. Message Ordering Matters: Use SQS FIFO when order matters (e.g., financial transactions). Use Standard SQS when order doesn't matter and you need maximum throughput.

  8. Serverless for Scalability: Lambda and Fargate automatically scale to handle load. No need to provision capacity in advance.

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain the difference between SQS Standard and FIFO
  • I understand when to use SNS vs SQS vs EventBridge
  • I can design a microservices architecture with loose coupling
  • I know how to implement event-driven patterns
  • I understand Lambda concurrency and scaling limits
  • I can design a multi-AZ architecture for high availability
  • I know the four disaster recovery strategies and when to use each
  • I understand RTO and RPO and how to calculate them
  • I can configure Route 53 for failover routing
  • I know the difference between RDS Multi-AZ and read replicas
  • I understand Aurora's high availability features
  • I can design a caching strategy for different use cases

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle 1: Questions 1-25 (Scalability and loose coupling)
  • Domain 2 Bundle 2: Questions 1-25 (High availability and fault tolerance)
  • Integration Services Bundle: Questions 1-30 (SQS, SNS, EventBridge, Step Functions)
  • Compute Services Bundle: Questions 20-40 (Lambda, ECS, EKS)

Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • Review SQS visibility timeout and dead letter queues
  • Focus on understanding disaster recovery strategies and RTO/RPO
  • Study Lambda concurrency limits and provisioned concurrency
  • Practice designing multi-AZ architectures

Quick Reference Card

Messaging Services:

  • SQS Standard: At-least-once delivery, best-effort ordering, unlimited throughput
  • SQS FIFO: Exactly-once delivery, strict ordering, 3,000 msg/sec (batching: 30,000)
  • SNS: Pub/sub, fan-out to multiple subscribers, push-based
  • EventBridge: Event bus, rule-based routing, 100+ AWS service integrations

Compute Options:

  • Lambda: Serverless, event-driven, 15-minute max execution, auto-scaling
  • Fargate: Serverless containers, no EC2 management, pay per task
  • ECS: Container orchestration, EC2 or Fargate launch types
  • EKS: Managed Kubernetes, multi-cloud portability

High Availability:

  • Multi-AZ: Deploy across 2+ AZs in same region
  • RDS Multi-AZ: Synchronous replication, automatic failover (1-2 min)
  • Aurora: 6 copies across 3 AZs, up to 15 read replicas
  • ALB: Automatically distributes across AZs, health checks

Disaster Recovery:

Strategy RTO RPO Cost Use Case
Backup/Restore Hours Hours $ Non-critical
Pilot Light 10s of min Minutes $$ Core systems
Warm Standby Minutes Seconds $$$ Business-critical
Active-Active Seconds None $$$$ Mission-critical

Caching Layers:

  • CloudFront: Edge caching (global), static/dynamic content
  • API Gateway: Response caching (regional), API responses
  • ElastiCache: Application caching (AZ), session state, database queries
  • DAX: DynamoDB caching, microsecond latency

Key Decision Points:

  • Need to decouple components → SQS queue
  • Need to fan-out to multiple targets → SNS topic
  • Need strict message ordering → SQS FIFO
  • Need event-driven architecture → EventBridge
  • Need to orchestrate workflows → Step Functions
  • Need serverless compute → Lambda
  • Need serverless containers → Fargate
  • Need high availability → Multi-AZ deployment
  • Need disaster recovery → Choose strategy based on RTO/RPO

Next Chapter: 04_domain3_high_performing_architectures - Learn how to design high-performing and scalable solutions.


Chapter 3: Design High-Performing Architectures (24% of exam)

Chapter Overview

What you'll learn:

  • High-performing storage solutions (S3, EBS, EFS, FSx)
  • Elastic compute solutions (EC2, Lambda, containers)
  • High-performing database solutions (RDS, DynamoDB, caching)
  • Scalable network architectures (CloudFront, Global Accelerator)
  • Data ingestion and transformation (Kinesis, Glue, Athena)

Time to complete: 10-12 hours

Prerequisites: Chapter 0 (Fundamentals), Chapter 1 (Secure Architectures), Chapter 2 (Resilient Architectures)

Exam Weight: 24% of exam questions (approximately 16 out of 65 questions)


Section 1: High-Performing Storage Solutions

Introduction

The problem: Different workloads have vastly different storage requirements. A database needs low-latency block storage with high IOPS. A data lake needs cost-effective object storage for petabytes of data. A shared file system needs concurrent access from multiple servers. Using the wrong storage type results in poor performance, high costs, or both.

The solution: AWS provides multiple storage services optimized for different use cases. Understanding the characteristics of each service (performance, durability, cost, access patterns) enables you to choose the right storage for each workload.

Why it's tested: Storage performance directly impacts application performance. This domain represents 24% of the exam and tests your ability to select and configure storage services for optimal performance and cost.

Core Concepts

Amazon S3 Performance Optimization

What it is: Amazon S3 is object storage built to store and retrieve any amount of data from anywhere. S3 automatically scales to handle high request rates and provides 99.999999999% (11 9's) durability.

Why it exists: Traditional file systems don't scale to petabytes of data or millions of requests per second. S3 provides virtually unlimited scalability with built-in redundancy, versioning, and lifecycle management.

Real-world analogy: S3 is like a massive warehouse with infinite capacity. You can store anything (objects), organize with labels (metadata and tags), and retrieve items instantly. The warehouse automatically expands as you add more items, and items are replicated to multiple locations for safety.

S3 Performance Characteristics:

Request Rate Limits (per prefix):

  • GET/HEAD: 5,500 requests per second per prefix
  • PUT/COPY/POST/DELETE: 3,500 requests per second per prefix
  • Prefix: Any string between bucket name and object name
    • Example: s3://my-bucket/folder1/subfolder/object.jpg
    • Prefix: folder1/subfolder/

Throughput:

  • Single PUT: Up to 5 GB per object
  • Multipart Upload: Up to 5 TB per object (recommended for >100 MB)
  • Transfer Acceleration: Up to 50-500% faster uploads over long distances

Latency:

  • First Byte: 100-200ms (typical)
  • Subsequent Bytes: Limited by network bandwidth

How to Optimize S3 Performance:

1. Use Multiple Prefixes for High Request Rates:

If you need more than 5,500 GET requests per second, distribute objects across multiple prefixes.

Example:

  • Bad: All objects in single prefix

    • s3://my-bucket/images/img001.jpg
    • s3://my-bucket/images/img002.jpg
    • Limit: 5,500 GET/sec
  • Good: Objects distributed across multiple prefixes

    • s3://my-bucket/images/2024/01/15/img001.jpg
    • s3://my-bucket/images/2024/01/16/img002.jpg
    • Each date prefix: 5,500 GET/sec
    • 10 date prefixes: 55,000 GET/sec total

2. Use Multipart Upload for Large Objects:

For objects >100 MB, use multipart upload to:

  • Upload parts in parallel (faster)
  • Resume failed uploads (reliability)
  • Upload while creating object (streaming)

Example:

import boto3
from boto3.s3.transfer import TransferConfig

s3 = boto3.client('s3')

# Configure multipart upload
config = TransferConfig(
    multipart_threshold=100 * 1024 * 1024,  # 100 MB
    max_concurrency=10,  # 10 parallel uploads
    multipart_chunksize=10 * 1024 * 1024,  # 10 MB per part
)

# Upload large file
s3.upload_file(
    'large-file.zip',  # 5 GB file
    'my-bucket',
    'uploads/large-file.zip',
    Config=config
)

Performance:

  • Without multipart: 5 GB / 100 Mbps = 400 seconds (6.7 minutes)
  • With multipart (10 parallel): 5 GB / 1 Gbps = 40 seconds
  • Speedup: 10x faster

3. Use S3 Transfer Acceleration for Long-Distance Uploads:

Transfer Acceleration uses CloudFront edge locations to accelerate uploads. Data is routed over AWS's optimized network instead of public internet.

How it works:

  1. Enable Transfer Acceleration on bucket
  2. Use accelerated endpoint: my-bucket.s3-accelerate.amazonaws.com
  3. Upload to nearest edge location
  4. AWS routes data to S3 bucket over optimized network

Performance Improvement:

  • US to US: 0-20% faster (already close)
  • US to Asia: 50-200% faster
  • Asia to US: 100-500% faster

Example:

s3 = boto3.client('s3', endpoint_url='https://s3-accelerate.amazonaws.com')
s3.upload_file('file.zip', 'my-bucket', 'uploads/file.zip')

Cost: $0.04 per GB transferred (in addition to standard transfer costs)

4. Use S3 Select to Retrieve Subset of Data:

S3 Select allows you to retrieve only the data you need from an object using SQL expressions, reducing data transfer and improving performance.

Example:

  • Without S3 Select: Download entire 1 GB CSV file, filter locally

    • Data transferred: 1 GB
    • Time: 80 seconds (at 100 Mbps)
    • Cost: $0.09 (data transfer out)
  • With S3 Select: Filter on S3, download only matching rows (10 MB)

    • Data transferred: 10 MB
    • Time: 1 second
    • Cost: $0.002 (S3 Select) + $0.0009 (data transfer) = $0.003
    • Savings: 97% cost reduction, 80x faster

Example Query:

response = s3.select_object_content(
    Bucket='my-bucket',
    Key='data/sales.csv',
    Expression='SELECT * FROM S3Object WHERE amount > 1000',
    ExpressionType='SQL',
    InputSerialization={'CSV': {'FileHeaderInfo': 'USE'}},
    OutputSerialization={'CSV': {}}
)

5. Use CloudFront for Frequently Accessed Objects:

CloudFront caches objects at edge locations worldwide, reducing latency and S3 request costs.

Performance:

  • Direct S3: 100-200ms latency (from distant region)
  • CloudFront: 10-50ms latency (from edge location)
  • Improvement: 2-10x faster

Cost Savings:

  • S3 GET: $0.0004 per 1,000 requests
  • CloudFront: $0.0075 per 10,000 requests (after first 10 TB)
  • Savings: 81% reduction in request costs

Detailed Example 1: High-Performance Image Serving

Scenario: You're building a photo sharing app with 10 million users. Users upload and view photos. Requirements:

  • Handle 100,000 uploads per hour (28 uploads/sec)
  • Handle 1 million views per hour (278 views/sec)
  • Low latency worldwide
  • Cost-effective

Architecture:

  1. S3 Bucket: Store original images
  2. Lambda: Resize images on upload
  3. CloudFront: Cache and serve images
  4. Transfer Acceleration: Fast uploads from anywhere

Implementation:

Step 1: Configure S3 Bucket:

# Create bucket
aws s3 mb s3://photo-app-images

# Enable Transfer Acceleration
aws s3api put-bucket-accelerate-configuration \
  --bucket photo-app-images \
  --accelerate-configuration Status=Enabled

# Enable versioning (for accidental deletes)
aws s3api put-bucket-versioning \
  --bucket photo-app-images \
  --versioning-configuration Status=Enabled

Step 2: Organize with Prefixes:

s3://photo-app-images/
  uploads/
    2024/01/15/user123/photo1.jpg
    2024/01/15/user456/photo2.jpg
  thumbnails/
    2024/01/15/user123/photo1.jpg
  medium/
    2024/01/15/user123/photo1.jpg
  large/
    2024/01/15/user123/photo1.jpg

Benefits:

  • Date-based prefixes distribute load (365 prefixes per year)
  • Each prefix handles 3,500 PUT/sec = 1.2M uploads/hour capacity
  • User ID in path enables easy querying

Step 3: Upload with Transfer Acceleration:

# Mobile app upload code
import boto3

s3 = boto3.client('s3', 
    endpoint_url='https://s3-accelerate.amazonaws.com')

def upload_photo(user_id, photo_data):
    from datetime import datetime
    date_prefix = datetime.now().strftime('%Y/%m/%d')
    key = f'uploads/{date_prefix}/{user_id}/{photo_id}.jpg'
    
    s3.upload_fileobj(
        photo_data,
        'photo-app-images',
        key,
        ExtraArgs={
            'ContentType': 'image/jpeg',
            'Metadata': {
                'user-id': user_id,
                'upload-time': datetime.now().isoformat()
            }
        }
    )

Performance:

  • User in Tokyo uploads to us-east-1 bucket
  • Without acceleration: 2-3 seconds (slow internet path)
  • With acceleration: 0.5-1 second (optimized AWS network)
  • Improvement: 2-3x faster

Step 4: Automatic Resizing with Lambda:

# Lambda function triggered by S3 upload
import boto3
from PIL import Image
import io

s3 = boto3.client('s3')

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Download original
    response = s3.get_object(Bucket=bucket, Key=key)
    image = Image.open(response['Body'])
    
    # Create sizes
    sizes = {
        'thumbnails': (200, 200),
        'medium': (800, 800),
        'large': (1600, 1600)
    }
    
    for size_name, dimensions in sizes.items():
        # Resize
        resized = image.copy()
        resized.thumbnail(dimensions)
        
        # Upload
        buffer = io.BytesIO()
        resized.save(buffer, format='JPEG', quality=85)
        buffer.seek(0)
        
        new_key = key.replace('uploads/', f'{size_name}/')
        s3.upload_fileobj(buffer, bucket, new_key)

Step 5: Serve with CloudFront:

# Create CloudFront distribution
aws cloudfront create-distribution \
  --origin-domain-name photo-app-images.s3.amazonaws.com \
  --default-cache-behavior '{
    "TargetOriginId": "S3-photo-app-images",
    "ViewerProtocolPolicy": "redirect-to-https",
    "AllowedMethods": ["GET", "HEAD"],
    "CachedMethods": ["GET", "HEAD"],
    "Compress": true,
    "DefaultTTL": 86400
  }'

Step 6: Application Serves Images:

# Web app code
CLOUDFRONT_DOMAIN = 'd123456.cloudfront.net'

def get_image_url(photo_id, size='medium'):
    # Construct CloudFront URL
    date_prefix = get_date_prefix(photo_id)
    user_id = get_user_id(photo_id)
    return f'https://{CLOUDFRONT_DOMAIN}/{size}/{date_prefix}/{user_id}/{photo_id}.jpg'

Performance Results:

Uploads:

  • Capacity: 3,500 PUT/sec per date prefix Ɨ 365 dates = 1.2M uploads/hour
  • Actual load: 28 uploads/sec = 100K uploads/hour
  • Headroom: 12x capacity

Views:

  • Without CloudFront: 278 GET/sec Ɨ $0.0004/1K = $0.11/sec = $396/hour
  • With CloudFront: 90% cache hit rate
    • CloudFront: 250 GET/sec (cached, no S3 cost)
    • S3: 28 GET/sec (cache misses) Ɨ $0.0004/1K = $0.01/hour
    • CloudFront cost: $0.0075/10K requests = $0.07/hour
    • Total: $0.08/hour vs $396/hour
    • Savings: 99.98% cost reduction

Latency:

  • Direct S3 (Tokyo → us-east-1): 150ms
  • CloudFront (Tokyo → Tokyo edge): 20ms
  • Improvement: 7.5x faster

Amazon EBS Performance Optimization

What it is: Amazon Elastic Block Store (EBS) provides block-level storage volumes for EC2 instances. EBS volumes are network-attached storage that persist independently of instance lifetime.

Why it exists: Instance store (ephemeral storage) is lost when instance stops. Applications need persistent storage that survives instance failures, can be backed up (snapshots), and can be attached to different instances.

Real-world analogy: EBS is like an external hard drive that you can plug into different computers. The drive retains data even when unplugged. You can make copies (snapshots) and create new drives from those copies.

EBS Volume Types:

General Purpose SSD (gp3) - Balanced price/performance:

  • Baseline: 3,000 IOPS, 125 MB/s throughput
  • Configurable: Up to 16,000 IOPS, 1,000 MB/s
  • Size: 1 GB - 16 TB
  • Cost: $0.08/GB-month + $0.005/provisioned IOPS (above 3,000) + $0.04/MB/s (above 125)
  • Use Case: Boot volumes, dev/test, low-latency apps

General Purpose SSD (gp2) - Previous generation:

  • Performance: 3 IOPS per GB (min 100, max 16,000)
  • Burst: Up to 3,000 IOPS using burst credits
  • Size: 1 GB - 16 TB
  • Cost: $0.10/GB-month
  • Use Case: Legacy workloads (gp3 is better for new workloads)

Provisioned IOPS SSD (io2 Block Express) - Highest performance:

  • IOPS: Up to 256,000 IOPS per volume
  • Throughput: Up to 4,000 MB/s
  • Size: 4 GB - 64 TB
  • Latency: Sub-millisecond
  • Durability: 99.999% (5 9's)
  • Cost: $0.125/GB-month + $0.065/provisioned IOPS
  • Use Case: Mission-critical databases, high-performance workloads

Provisioned IOPS SSD (io2) - High performance:

  • IOPS: Up to 64,000 IOPS per volume (256,000 per instance)
  • Throughput: Up to 1,000 MB/s
  • Size: 4 GB - 16 TB
  • Durability: 99.999% (5 9's)
  • Cost: $0.125/GB-month + $0.065/provisioned IOPS
  • Use Case: I/O-intensive databases, critical applications

Throughput Optimized HDD (st1) - Low-cost HDD:

  • Throughput: Up to 500 MB/s
  • IOPS: Up to 500 IOPS
  • Size: 125 GB - 16 TB
  • Cost: $0.045/GB-month
  • Use Case: Big data, data warehouses, log processing

Cold HDD (sc1) - Lowest cost:

  • Throughput: Up to 250 MB/s
  • IOPS: Up to 250 IOPS
  • Size: 125 GB - 16 TB
  • Cost: $0.015/GB-month
  • Use Case: Infrequently accessed data, archives

EBS Performance Factors:

1. Instance Type Limits:

  • Each instance type has maximum EBS bandwidth
  • Example: t3.medium = 2,085 Mbps (260 MB/s)
  • Example: m5.4xlarge = 4,750 Mbps (593 MB/s)
  • Volume performance limited by instance bandwidth

2. Volume Size and IOPS:

  • gp3: Configurable IOPS independent of size
  • gp2: IOPS = size Ɨ 3 (larger volume = more IOPS)
  • io2: Provision exact IOPS needed

3. I/O Size:

  • IOPS measured in 16 KB chunks
  • 256 KB write = 16 IOPS (256 / 16)
  • Larger I/O sizes consume more IOPS

4. Queue Depth:

  • Number of pending I/O requests
  • Higher queue depth = better throughput (up to a point)
  • Optimal: 4-32 for most workloads

Detailed Example 2: Database Performance Tuning with EBS

Scenario: You're running a PostgreSQL database on EC2. Current performance:

  • Instance: m5.xlarge (4 vCPU, 16 GB RAM)
  • Volume: gp2 500 GB (1,500 IOPS baseline)
  • Workload: 3,000 IOPS, 200 MB/s throughput
  • Problem: Database slow during peak hours (IOPS throttling)

Analysis:

Current Configuration:

  • gp2 500 GB = 1,500 IOPS baseline
  • Burst credits: 5.4 million (can burst to 3,000 IOPS for 30 minutes)
  • After burst credits exhausted: Throttled to 1,500 IOPS
  • Problem: Workload needs sustained 3,000 IOPS, but volume only provides 1,500

Solution Options:

Option 1: Increase gp2 Volume Size:

  • Need 3,000 IOPS = 1,000 GB volume (3 IOPS per GB)
  • Cost: 1,000 GB Ɨ $0.10 = $100/month
  • Downside: Paying for storage you don't need

Option 2: Switch to gp3:

  • gp3 500 GB with 3,000 IOPS provisioned
  • Cost: (500 Ɨ $0.08) + (0 Ɨ $0.005) = $40/month
  • Savings: $60/month (60% reduction)
  • Performance: Sustained 3,000 IOPS, no burst credits needed

Option 3: Switch to io2 (if need more performance):

  • io2 500 GB with 10,000 IOPS provisioned
  • Cost: (500 Ɨ $0.125) + (10,000 Ɨ $0.065) = $62.50 + $650 = $712.50/month
  • Use Case: Only if need >16,000 IOPS or sub-millisecond latency

Recommendation: Switch to gp3 (Option 2)

Implementation:

# Create snapshot of current volume
aws ec2 create-snapshot \
  --volume-id vol-1234567890abcdef0 \
  --description "Before gp3 migration"

# Create new gp3 volume from snapshot
aws ec2 create-volume \
  --snapshot-id snap-0987654321fedcba0 \
  --availability-zone us-east-1a \
  --volume-type gp3 \
  --size 500 \
  --iops 3000 \
  --throughput 125

# Stop database
sudo systemctl stop postgresql

# Detach old volume
aws ec2 detach-volume --volume-id vol-1234567890abcdef0

# Attach new volume
aws ec2 attach-volume \
  --volume-id vol-new123456789abcdef \
  --instance-id i-1234567890abcdef0 \
  --device /dev/sdf

# Start database
sudo systemctl start postgresql

Performance Results:

  • Before: 1,500 IOPS sustained, 3,000 IOPS burst (30 minutes)
  • After: 3,000 IOPS sustained, no throttling
  • Latency: Reduced from 50ms (throttled) to 5ms (normal)
  • Cost: $100/month → $40/month (60% savings)

Additional Optimizations:

1. Use EBS-Optimized Instances:

  • Dedicated bandwidth for EBS traffic
  • Prevents network traffic from affecting storage performance
  • Most modern instance types are EBS-optimized by default

2. Use Multiple Volumes for Parallel I/O:

  • Stripe multiple volumes using RAID 0
  • Example: 4 Ɨ gp3 volumes (3,000 IOPS each) = 12,000 IOPS total
  • Use case: Databases with high I/O requirements

3. Enable EBS Fast Snapshot Restore:

  • Snapshots normally have performance penalty on first access (lazy loading)
  • Fast Snapshot Restore eliminates this penalty
  • Cost: $0.75 per snapshot per AZ per hour
  • Use case: Disaster recovery, quick instance launches

Amazon EFS Performance

What it is: Amazon Elastic File System (EFS) is a fully managed, elastic, shared file system for Linux workloads. Multiple EC2 instances can access the same EFS file system simultaneously.

Why it exists: EBS volumes can only be attached to one instance at a time. Applications that need shared file access (web servers serving same content, data processing pipelines, content management systems) require a shared file system.

Real-world analogy: EFS is like a shared network drive in an office. Multiple employees (EC2 instances) can access the same files simultaneously. When one person updates a file, others see the changes immediately. The drive automatically expands as you add more files.

EFS Performance Modes:

General Purpose (default):

  • Latency: Low latency (single-digit milliseconds)
  • Throughput: Up to 7,000 file operations per second
  • Use Case: Web serving, content management, development

Max I/O:

  • Latency: Higher latency (tens of milliseconds)
  • Throughput: >7,000 file operations per second
  • Use Case: Big data, media processing, high parallelism

EFS Throughput Modes:

Bursting (default):

  • Baseline: 50 MB/s per TB of storage
  • Burst: Up to 100 MB/s (using burst credits)
  • Burst Credits: Accumulate when below baseline
  • Use Case: Variable workloads, cost-sensitive

Provisioned:

  • Throughput: Configure exact throughput (1-1,024 MB/s)
  • Independent: Throughput independent of storage size
  • Cost: $6/MB/s-month
  • Use Case: Consistent high throughput needed

Elastic (recommended):

  • Automatic: Scales throughput automatically based on workload
  • Up to: 3 GB/s reads, 1 GB/s writes
  • Cost: Pay for throughput used (no provisioning)
  • Use Case: Unpredictable workloads, simplicity

Detailed Example 3: Shared Web Content with EFS

Scenario: You're running a WordPress site on multiple EC2 instances behind an ALB. All instances need access to the same uploaded media files (images, videos). Requirements:

  • Shared access from all web servers
  • Automatic scaling (don't want to manage storage)
  • Cost-effective

Architecture:

  1. ALB: Distributes traffic to web servers
  2. Auto Scaling Group: 2-10 EC2 instances
  3. EFS: Shared file system for WordPress uploads
  4. RDS: Database (separate from file storage)

Implementation:

Step 1: Create EFS File System:

# Create EFS file system
aws efs create-file-system \
  --performance-mode generalPurpose \
  --throughput-mode elastic \
  --encrypted \
  --tags Key=Name,Value=wordpress-media

# Create mount targets in each AZ
aws efs create-mount-target \
  --file-system-id fs-12345678 \
  --subnet-id subnet-1a \
  --security-groups sg-efs

aws efs create-mount-target \
  --file-system-id fs-12345678 \
  --subnet-id subnet-1b \
  --security-groups sg-efs

Step 2: Configure Security Group:

# Allow NFS traffic from web servers
aws ec2 authorize-security-group-ingress \
  --group-id sg-efs \
  --protocol tcp \
  --port 2049 \
  --source-group sg-web-servers

Step 3: Mount EFS on EC2 Instances:

# Install EFS mount helper
sudo yum install -y amazon-efs-utils

# Create mount point
sudo mkdir -p /var/www/html/wp-content/uploads

# Mount EFS
sudo mount -t efs -o tls fs-12345678:/ /var/www/html/wp-content/uploads

# Add to /etc/fstab for automatic mount on boot
echo "fs-12345678:/ /var/www/html/wp-content/uploads efs _netdev,tls 0 0" | sudo tee -a /etc/fstab

Step 4: Configure WordPress:

// wp-config.php
define('UPLOADS', 'wp-content/uploads');

Traffic Flow:

  1. User uploads image to WordPress
  2. WordPress saves to /var/www/html/wp-content/uploads/2024/01/image.jpg
  3. File written to EFS (accessible from all instances)
  4. User requests image
  5. ALB routes to any web server
  6. Web server reads from EFS and serves image

Performance:

  • Storage: 100 GB of media files
  • Baseline Throughput: 100 GB Ɨ 50 MB/s per TB = 5 MB/s
  • Burst Throughput: Up to 100 MB/s (for short periods)
  • Actual Usage: 10 MB/s average (well within limits)

Scaling:

  • Auto Scaling adds new instance
  • Instance automatically mounts EFS
  • Instance immediately has access to all media files
  • No manual file synchronization needed

Cost:

  • EFS Standard: 100 GB Ɨ $0.30 = $30/month
  • EFS Infrequent Access: 50 GB Ɨ $0.025 = $1.25/month (for old files)
  • Total: $31.25/month

Compared to EBS:

  • EBS: Would need to sync files between instances (complex, error-prone)
  • EBS: Each instance needs separate volume (100 GB Ɨ 10 instances = 1 TB)
  • EBS Cost: 1 TB Ɨ $0.10 = $100/month
  • EFS Savings: $68.75/month (69% reduction)

Section 2: High-Performing Compute Solutions

Introduction

The problem: Different workloads have different compute requirements. A web server needs consistent CPU. A batch job needs high CPU for short bursts. A machine learning model needs GPU acceleration. Using the wrong compute type results in poor performance or wasted money.

The solution: AWS provides multiple compute options optimized for different workloads. Understanding instance families, sizing, and pricing models enables you to choose the right compute for each workload.

Why it's tested: Compute is the foundation of most applications. This section tests your ability to select appropriate instance types, configure auto scaling, and optimize compute costs while maintaining performance.

Core Concepts

EC2 Instance Types and Families

What they are: EC2 instance types are combinations of CPU, memory, storage, and networking capacity. Instance families are groups of instance types optimized for specific workloads.

Why they exist: One size doesn't fit all. A database needs lots of memory. A video encoder needs powerful CPU. A machine learning model needs GPU. Instance families provide optimized hardware for each use case.

Real-world analogy: Instance types are like vehicles. A sports car (compute-optimized) is fast but has little cargo space. A truck (memory-optimized) carries heavy loads but isn't fast. An SUV (general purpose) balances both. You choose based on your needs.

Instance Families:

General Purpose (T, M, A):

  • Balance: CPU, memory, networking
  • T3/T3a: Burstable CPU (baseline + burst credits)
    • Use case: Web servers, dev/test, small databases
    • Cost: $0.0104/hour (t3.medium)
  • M5/M5a: Consistent performance
    • Use case: Application servers, medium databases
    • Cost: $0.096/hour (m5.xlarge)
  • M6i: Latest generation (Intel Ice Lake)
    • Use case: General workloads, best price/performance
    • Cost: $0.192/hour (m6i.xlarge)

Compute Optimized (C):

  • High CPU: High CPU-to-memory ratio
  • C5/C5a: Intel/AMD processors
    • Use case: Batch processing, media transcoding, gaming servers
    • Cost: $0.085/hour (c5.xlarge)
  • C6i: Latest generation
    • Use case: High-performance computing, scientific modeling
    • Cost: $0.17/hour (c6i.xlarge)

Memory Optimized (R, X, Z):

  • High Memory: High memory-to-CPU ratio
  • R5/R5a: General memory-intensive
    • Use case: In-memory databases (Redis, Memcached), big data
    • Cost: $0.252/hour (r5.xlarge)
  • X1e: Extreme memory (up to 3,904 GB)
    • Use case: SAP HANA, in-memory databases
    • Cost: $26.688/hour (x1e.32xlarge)
  • Z1d: High frequency + memory
    • Use case: Electronic design automation, gaming
    • Cost: $0.744/hour (z1d.xlarge)

Storage Optimized (I, D, H):

  • High I/O: NVMe SSD instance store
  • I3/I3en: High IOPS, low latency
    • Use case: NoSQL databases, data warehousing
    • Cost: $0.312/hour (i3.xlarge)
  • D2: Dense HDD storage
    • Use case: MapReduce, Hadoop, log processing
    • Cost: $0.69/hour (d2.xlarge)

Accelerated Computing (P, G, F):

  • GPU/FPGA: Specialized processors
  • P3: NVIDIA V100 GPUs
    • Use case: Machine learning training, HPC
    • Cost: $3.06/hour (p3.2xlarge)
  • G4: NVIDIA T4 GPUs
    • Use case: ML inference, graphics workstations
    • Cost: $1.20/hour (g4dn.xlarge)
  • F1: FPGA
    • Use case: Genomics, financial analytics
    • Cost: $1.65/hour (f1.2xlarge)

Instance Sizing:

  • nano: 0.5 vCPU, 0.5 GB RAM
  • micro: 1 vCPU, 1 GB RAM
  • small: 1 vCPU, 2 GB RAM
  • medium: 2 vCPU, 4 GB RAM
  • large: 2 vCPU, 8 GB RAM
  • xlarge: 4 vCPU, 16 GB RAM
  • 2xlarge: 8 vCPU, 32 GB RAM
  • 4xlarge: 16 vCPU, 64 GB RAM
  • (continues to 96xlarge for some families)

Detailed Example 4: Right-Sizing EC2 Instances

Scenario: You're running a web application on m5.2xlarge instances (8 vCPU, 32 GB RAM). CloudWatch shows:

  • Average CPU: 15%
  • Average Memory: 8 GB (25%)
  • Network: 100 Mbps
  • Cost: $0.384/hour Ɨ 10 instances Ɨ 730 hours = $2,803/month

Analysis: Significantly over-provisioned. Let's right-size.

Option 1: Downsize to m5.large:

  • Specs: 2 vCPU, 8 GB RAM
  • CPU: 15% Ɨ 8 vCPU = 1.2 vCPU used → 60% on m5.large (acceptable)
  • Memory: 8 GB (100% of m5.large) → Tight but acceptable
  • Cost: $0.096/hour Ɨ 10 instances Ɨ 730 hours = $701/month
  • Savings: $2,102/month (75% reduction)

Option 2: Switch to t3.large (burstable):

  • Specs: 2 vCPU, 8 GB RAM, 30% baseline CPU
  • CPU: 15% average < 30% baseline → No burst credits needed
  • Memory: 8 GB (100% of t3.large)
  • Cost: $0.0832/hour Ɨ 10 instances Ɨ 730 hours = $607/month
  • Savings: $2,196/month (78% reduction)

Option 3: Reduce instance count + upsize:

  • Current: 10 Ɨ m5.2xlarge (80 vCPU total, 15% used = 12 vCPU)
  • New: 4 Ɨ m5.xlarge (16 vCPU total, 75% used = 12 vCPU)
  • Memory: 4 Ɨ 16 GB = 64 GB total (8 GB used per instance)
  • Cost: $0.192/hour Ɨ 4 instances Ɨ 730 hours = $561/month
  • Savings: $2,242/month (80% reduction)
  • Benefit: Fewer instances to manage

Recommendation: Option 3 (4 Ɨ m5.xlarge)

Implementation:

# Update Auto Scaling launch template
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name web-app-asg \
  --launch-template LaunchTemplateName=web-app-template,Version=2 \
  --min-size 4 \
  --max-size 12 \
  --desired-capacity 4

# Launch template specifies m5.xlarge

Monitoring After Change:

  • Week 1: CPU 75%, Memory 50% → Good utilization
  • Week 2: Traffic spike, Auto Scaling adds 4 more instances → Handles load
  • Week 3: Traffic normal, scales back to 4 instances → Cost optimized

Result:

  • Performance: Same (adequate CPU/memory)
  • Cost: $2,803 → $561/month (80% savings)
  • Scalability: Still scales to 12 instances during peaks

Section 3: High-Performing Database Solutions

Introduction

The problem: Databases are often the performance bottleneck in applications. Slow queries, insufficient IOPS, connection limits, and lack of caching can degrade application performance. Choosing the wrong database type or configuration results in poor performance and high costs.

The solution: AWS provides multiple database services optimized for different data models and access patterns. Understanding database types, performance tuning, caching strategies, and read scaling enables you to build high-performing data layers.

Why it's tested: Database performance is critical for most applications. This section tests your ability to select appropriate database services, configure for performance, and implement caching strategies.

Core Concepts

Amazon RDS Performance Optimization

What it is: Amazon RDS is a managed relational database service supporting MySQL, PostgreSQL, MariaDB, Oracle, and SQL Server. RDS handles provisioning, patching, backup, and recovery.

Why it exists: Managing database servers is complex - patching, backups, replication, failover. RDS automates these tasks, allowing you to focus on application development and performance tuning.

RDS Performance Factors:

1. Instance Type:

  • db.t3: Burstable CPU (dev/test, small workloads)
  • db.m5: General purpose (balanced CPU/memory)
  • db.r5: Memory optimized (large datasets, caching)
  • db.x1e: Extreme memory (SAP HANA, in-memory)

2. Storage Type:

  • General Purpose SSD (gp3): 3,000-16,000 IOPS, 125-1,000 MB/s
  • Provisioned IOPS SSD (io1): Up to 64,000 IOPS, 1,000 MB/s
  • Magnetic: Legacy, not recommended

3. Read Replicas:

  • Asynchronous replication from primary
  • Offload read traffic from primary
  • Up to 15 read replicas per primary
  • Can be in different regions

4. Connection Pooling:

  • RDS Proxy manages connection pool
  • Reduces connection overhead
  • Improves scalability

Detailed Example 5: Database Performance Tuning

Scenario: You're running a MySQL database on RDS. Performance issues:

  • Instance: db.m5.large (2 vCPU, 8 GB RAM)
  • Storage: gp2 100 GB (300 IOPS baseline)
  • Workload: 1,000 queries/sec (70% reads, 30% writes)
  • Problem: Slow queries during peak hours, CPU 90%

Analysis:

Issue 1: IOPS Bottleneck:

  • gp2 100 GB = 300 IOPS baseline
  • Workload needs ~500 IOPS
  • Solution: Upgrade to gp3 with 3,000 IOPS

Issue 2: CPU Bottleneck:

  • 90% CPU indicates compute bottleneck
  • Solution: Offload reads to read replicas

Issue 3: Connection Overhead:

  • 1,000 queries/sec = many connections
  • Each connection consumes memory
  • Solution: Use RDS Proxy for connection pooling

Implementation:

Step 1: Upgrade Storage to gp3:

aws rds modify-db-instance \
  --db-instance-identifier mydb \
  --storage-type gp3 \
  --iops 3000 \
  --apply-immediately

Step 2: Create Read Replicas:

# Create 2 read replicas
aws rds create-db-instance-read-replica \
  --db-instance-identifier mydb-replica-1 \
  --source-db-instance-identifier mydb \
  --db-instance-class db.m5.large

aws rds create-db-instance-read-replica \
  --db-instance-identifier mydb-replica-2 \
  --source-db-instance-identifier mydb \
  --db-instance-class db.m5.large

Step 3: Configure RDS Proxy:

aws rds create-db-proxy \
  --db-proxy-name mydb-proxy \
  --engine-family MYSQL \
  --auth '{
    "AuthScheme": "SECRETS",
    "SecretArn": "arn:aws:secretsmanager:us-east-1:123456789012:secret:mydb-secret"
  }' \
  --role-arn arn:aws:iam::123456789012:role/RDSProxyRole \
  --vpc-subnet-ids subnet-1a subnet-1b

# Register read replicas with proxy
aws rds register-db-proxy-targets \
  --db-proxy-name mydb-proxy \
  --db-instance-identifiers mydb mydb-replica-1 mydb-replica-2

Step 4: Update Application:

# Before: Direct connection to RDS
import pymysql

# Write connection (primary)
write_conn = pymysql.connect(
    host='mydb.abc123.us-east-1.rds.amazonaws.com',
    user='admin',
    password='password',
    database='myapp'
)

# After: Connection through RDS Proxy
write_conn = pymysql.connect(
    host='mydb-proxy.proxy-abc123.us-east-1.rds.amazonaws.com',
    user='admin',
    password='password',
    database='myapp'
)

# Read connection (proxy distributes to replicas)
read_conn = pymysql.connect(
    host='mydb-proxy.proxy-abc123.us-east-1.rds.amazonaws.com',
    user='admin',
    password='password',
    database='myapp'
)

# Application logic
def get_user(user_id):
    cursor = read_conn.cursor()  # Use read connection
    cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
    return cursor.fetchone()

def update_user(user_id, name):
    cursor = write_conn.cursor()  # Use write connection
    cursor.execute("UPDATE users SET name = %s WHERE id = %s", (name, user_id))
    write_conn.commit()

Performance Results:

Before:

  • Primary CPU: 90%
  • IOPS: 300 (throttled)
  • Query latency: 500ms (slow)
  • Connections: 500 (high overhead)

After:

  • Primary CPU: 30% (writes only)
  • Replica 1 CPU: 35% (reads)
  • Replica 2 CPU: 35% (reads)
  • IOPS: 3,000 (no throttling)
  • Query latency: 50ms (10x faster)
  • Connections: 50 (pooled by RDS Proxy)

Cost:

  • Storage upgrade: $10 → $40/month (+$30)
  • Read replicas: 2 Ɨ $146/month (+$292)
  • RDS Proxy: $0.015/hour Ɨ 730 = $11/month (+$11)
  • Total increase: $333/month
  • Value: 10x performance improvement, handles 3x more traffic

Amazon DynamoDB Performance

What it is: Amazon DynamoDB is a fully managed NoSQL database that provides single-digit millisecond performance at any scale. DynamoDB automatically scales throughput and storage.

Why it exists: Relational databases struggle with massive scale (millions of requests per second, petabytes of data). DynamoDB provides consistent performance at any scale without manual sharding or capacity planning.

Real-world analogy: DynamoDB is like a massive library with instant retrieval. No matter how many books (items) or how many people (requests), you always get your book in the same time (single-digit milliseconds). The library automatically expands as you add more books.

DynamoDB Performance Characteristics:

Capacity Modes:

On-Demand:

  • Throughput: Unlimited (scales automatically)
  • Pricing: $1.25 per million write requests, $0.25 per million read requests
  • Use Case: Unpredictable workloads, new applications

Provisioned:

  • Throughput: Specify read/write capacity units (RCU/WCU)
  • Pricing: $0.00065 per WCU-hour, $0.00013 per RCU-hour
  • Auto Scaling: Automatically adjusts capacity based on load
  • Use Case: Predictable workloads, cost optimization

Performance Metrics:

  • Latency: Single-digit milliseconds (typically 1-5ms)
  • Throughput: Millions of requests per second
  • Storage: Unlimited (automatically scales)
  • Item Size: Up to 400 KB per item

DynamoDB Performance Optimization:

1. Partition Key Design:

  • DynamoDB distributes data across partitions based on partition key
  • Poor partition key → Hot partitions (uneven load)
  • Good partition key → Even distribution

Example:

  • Bad: partition_key = "status" (only 3 values: active, inactive, pending)
    • All "active" items on same partition → Hot partition
  • Good: partition_key = "user_id" (millions of unique values)
    • Items evenly distributed across partitions

2. Use Global Secondary Indexes (GSI):

  • Query on non-key attributes
  • Each GSI has own throughput capacity
  • Up to 20 GSIs per table

3. Use DynamoDB Accelerator (DAX):

  • In-memory cache for DynamoDB
  • Microsecond latency (vs milliseconds)
  • Reduces DynamoDB read costs

4. Use Batch Operations:

  • BatchGetItem: Retrieve up to 100 items in single request
  • BatchWriteItem: Write up to 25 items in single request
  • Reduces request count and cost

Detailed Example 6: DynamoDB with DAX Caching

Scenario: You're building a gaming leaderboard. Requirements:

  • 1 million active players
  • 10,000 leaderboard queries per second
  • Sub-millisecond latency
  • Real-time updates

Architecture:

  1. DynamoDB Table: Store player scores
  2. DAX Cluster: Cache frequent queries
  3. Lambda: Update scores
  4. API Gateway: Serve leaderboard API

DynamoDB Table Design:

Table: GameLeaderboard
Partition Key: game_id (string)
Sort Key: score#player_id (string)  # Composite for sorting

Item Example:
{
  "game_id": "game123",
  "score#player_id": "9999999#player456",  # High score first
  "player_name": "ProGamer",
  "score": 9999999,
  "timestamp": "2024-01-15T10:30:00Z"
}

Query Pattern:

# Get top 10 players for game
response = dynamodb.query(
    TableName='GameLeaderboard',
    KeyConditionExpression='game_id = :game_id',
    ExpressionAttributeValues={':game_id': 'game123'},
    ScanIndexForward=False,  # Descending order (highest score first)
    Limit=10
)

Without DAX:

  • 10,000 queries/sec Ɨ $0.25 per million = $2.50/sec = $6,480/day
  • Latency: 5ms (DynamoDB)

With DAX:

import amazondax

# Create DAX client
dax = amazondax.AmazonDaxClient()

# Query through DAX (same API as DynamoDB)
response = dax.query(
    TableName='GameLeaderboard',
    KeyConditionExpression='game_id = :game_id',
    ExpressionAttributeValues={':game_id': 'game123'},
    Limit=10
)

DAX Configuration:

# Create DAX cluster
aws dax create-cluster \
  --cluster-name game-leaderboard-cache \
  --node-type dax.r5.large \
  --replication-factor 3 \
  --iam-role-arn arn:aws:iam::123456789012:role/DAXRole \
  --subnet-group game-subnet-group

Performance with DAX:

  • Cache hit rate: 95% (leaderboard queries are repetitive)
  • Cached queries: 9,500/sec (no DynamoDB cost)
  • DynamoDB queries: 500/sec (cache misses)
  • Cost: 500 Ɨ $0.25 per million = $0.125/sec = $324/day
  • Savings: $6,156/day (95% reduction)
  • Latency: 0.5ms (10x faster)

DAX Cost:

  • dax.r5.large: $0.40/hour Ɨ 3 nodes Ɨ 24 hours = $28.80/day
  • Net Savings: $6,156 - $28.80 = $6,127/day

Write Performance:

  • Writes go directly to DynamoDB (DAX write-through)
  • DAX automatically invalidates cached items on write
  • Write latency: 5ms (same as without DAX)

Chapter Summary

What We Covered

This chapter covered the "Design High-Performing Architectures" domain, which represents 24% of the SAA-C03 exam. We explored three major areas:

āœ… Section 1: High-Performing Storage Solutions

  • S3 performance optimization (prefixes, multipart upload, Transfer Acceleration)
  • EBS volume types and performance tuning (gp3, io2, throughput)
  • EFS performance modes and throughput configuration
  • Storage selection based on access patterns

āœ… Section 2: High-Performing Compute Solutions

  • EC2 instance families and types (general purpose, compute, memory, storage, accelerated)
  • Instance sizing and right-sizing strategies
  • Burstable instances (T3) vs consistent performance (M5, C5, R5)
  • Cost optimization through proper instance selection

āœ… Section 3: High-Performing Database Solutions

  • RDS performance optimization (instance types, storage, read replicas)
  • RDS Proxy for connection pooling
  • DynamoDB capacity modes (on-demand vs provisioned)
  • DynamoDB Accelerator (DAX) for caching
  • Partition key design for even distribution

Critical Takeaways

  1. S3 Performance: Use multiple prefixes for high request rates (5,500 GET/sec per prefix). Use multipart upload for large files. Use Transfer Acceleration for long-distance uploads. Use CloudFront for frequently accessed objects.

  2. EBS Selection: Use gp3 for most workloads (better price/performance than gp2). Use io2 for high-IOPS databases. Use st1 for throughput-intensive workloads. Use sc1 for infrequently accessed data.

  3. EFS vs EBS: Use EFS for shared file access across multiple instances. Use EBS for single-instance block storage. EFS automatically scales; EBS requires manual resizing.

  4. Instance Selection: Match instance family to workload (compute-optimized for CPU, memory-optimized for RAM, storage-optimized for I/O). Use burstable instances (T3) for variable workloads. Right-size based on actual utilization.

  5. Database Performance: Use read replicas to offload read traffic. Use RDS Proxy for connection pooling. Upgrade storage to gp3 for better IOPS. Use appropriate instance type for workload.

  6. DynamoDB Optimization: Design partition keys for even distribution. Use DAX for read-heavy workloads (95%+ cost reduction). Use batch operations to reduce request count. Choose on-demand for unpredictable workloads, provisioned for predictable.

  7. Caching Strategy: Use CloudFront for static content. Use DAX for DynamoDB. Use ElastiCache for application caching. Caching reduces latency and costs.

Self-Assessment Checklist

Test yourself before moving on:

  • I understand S3 performance limits (requests per prefix)
  • I know when to use multipart upload
  • I can explain the difference between gp3 and io2 EBS volumes
  • I understand when to use EFS vs EBS
  • I know the different EC2 instance families and their use cases
  • I can right-size EC2 instances based on utilization
  • I understand how RDS read replicas improve performance
  • I know when to use RDS Proxy
  • I understand DynamoDB capacity modes (on-demand vs provisioned)
  • I can explain how DAX improves DynamoDB performance
  • I know how to design DynamoDB partition keys
  • I understand caching strategies (CloudFront, DAX, ElastiCache)

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle 1: Questions 1-25 (Storage and compute)
  • Domain 3 Bundle 2: Questions 26-50 (Database and caching)
  • Full Practice Test 1: Questions 38-53 (Domain 3 questions)

Expected score: 70%+ to proceed confidently

If you scored below 70%:

  • Review sections: Focus on areas where you missed questions
  • Key topics to strengthen:
    • S3 performance optimization techniques
    • EBS volume type selection
    • EC2 instance family characteristics
    • RDS read replica use cases
    • DynamoDB partition key design

Quick Reference Card

Storage Services:

  • S3: Object storage, unlimited scale, 5,500 GET/sec per prefix
  • EBS gp3: General purpose SSD, 3,000-16,000 IOPS, $0.08/GB-month
  • EBS io2: High-performance SSD, up to 64,000 IOPS, $0.125/GB-month
  • EFS: Shared file system, automatic scaling, $0.30/GB-month

EC2 Instance Families:

  • T3: Burstable CPU, cost-effective for variable workloads
  • M5: General purpose, balanced CPU/memory
  • C5: Compute optimized, high CPU-to-memory ratio
  • R5: Memory optimized, high memory-to-CPU ratio
  • I3: Storage optimized, high IOPS NVMe SSD

Database Services:

  • RDS: Managed relational database, Multi-AZ, read replicas
  • DynamoDB: NoSQL, single-digit millisecond latency, unlimited scale
  • DAX: DynamoDB cache, microsecond latency, 95% cost reduction
  • RDS Proxy: Connection pooling, improves scalability

Decision Points:

  • High request rate → Use multiple S3 prefixes
  • Large file upload → Use S3 multipart upload
  • Shared file access → Use EFS (not EBS)
  • High IOPS database → Use io2 EBS or Provisioned IOPS RDS
  • Variable CPU workload → Use T3 burstable instances
  • Read-heavy database → Use RDS read replicas
  • DynamoDB read-heavy → Use DAX caching
  • Many database connections → Use RDS Proxy

Next Chapter: 05_domain4_cost_optimized_architectures - Design Cost-Optimized Architectures (20% of exam)

5. Use CloudFront for Frequently Accessed Content:

CloudFront is a CDN that caches content at edge locations worldwide, reducing latency and S3 request costs.

Performance Benefits:

  • Latency: 10-50ms (edge) vs 100-200ms (S3 direct)
  • Throughput: Higher (edge locations closer to users)
  • Cost: Reduces S3 GET requests (cached at edge)

Example Scenario:

  • Website with 1 million image requests per day
  • Without CloudFront:
    • All requests hit S3: 1M requests Ɨ $0.0004/1K = $400/day
    • Latency: 100-200ms per request
  • With CloudFront (90% cache hit rate):
    • S3 requests: 100K Ɨ $0.0004/1K = $40/day
    • CloudFront requests: 1M Ɨ $0.0075/10K = $0.75/day
    • Total: $40.75/day (90% savings)
    • Latency: 10-50ms per request (5-10x faster)

šŸ“Š S3 Performance Optimization Diagram:

graph TB
    subgraph "S3 Performance Strategies"
        A[Application] --> B{Request Rate?}
        B -->|< 5,500/sec| C[Single Prefix OK]
        B -->|> 5,500/sec| D[Multiple Prefixes]
        
        A --> E{Object Size?}
        E -->|< 100 MB| F[Standard PUT]
        E -->|> 100 MB| G[Multipart Upload]
        
        A --> H{User Location?}
        H -->|Same Region| I[Direct S3]
        H -->|Far Away| J[Transfer Acceleration]
        
        A --> K{Access Pattern?}
        K -->|Frequent Reads| L[CloudFront CDN]
        K -->|Selective Data| M[S3 Select]
    end
    
    style D fill:#c8e6c9
    style G fill:#c8e6c9
    style J fill:#c8e6c9
    style L fill:#c8e6c9
    style M fill:#c8e6c9

See: diagrams/04_domain3_s3_performance_optimization.mmd

Diagram Explanation:
This decision tree shows how to optimize S3 performance based on different requirements. For high request rates (>5,500 GET/sec), distribute objects across multiple prefixes to scale beyond single-prefix limits. For large objects (>100 MB), use multipart upload to parallelize uploads and improve reliability. For users far from the S3 region, enable Transfer Acceleration to route data over AWS's optimized network. For frequently accessed content, use CloudFront to cache at edge locations and reduce latency. For selective data retrieval, use S3 Select to filter data server-side and reduce data transfer.

⭐ Must Know (S3 Performance):

  • S3 supports 5,500 GET/sec and 3,500 PUT/sec per prefix (not per bucket)
  • Use multiple prefixes to scale beyond these limits (e.g., date-based prefixes)
  • Multipart upload is recommended for objects >100 MB and required for >5 GB
  • Transfer Acceleration can improve upload speeds by 50-500% for long distances
  • S3 Select reduces data transfer by filtering data server-side
  • CloudFront caching reduces S3 costs and improves latency for end users

When to use S3 Performance Features:

  • āœ… Use multiple prefixes when: Request rate exceeds 5,500 GET/sec or 3,500 PUT/sec
  • āœ… Use multipart upload when: Objects are >100 MB or upload reliability is critical
  • āœ… Use Transfer Acceleration when: Users are >1,000 miles from S3 region
  • āœ… Use S3 Select when: You need only a subset of data from large objects
  • āœ… Use CloudFront when: Content is accessed frequently from multiple locations
  • āŒ Don't use Transfer Acceleration when: Users are in same region as bucket (no benefit)
  • āŒ Don't use S3 Select when: You need the entire object (adds processing cost)

Amazon EBS Performance Optimization

What it is: Amazon Elastic Block Store (EBS) provides block-level storage volumes for EC2 instances. EBS volumes are network-attached storage that persist independently of instance lifetime.

Why it exists: EC2 instances need persistent storage that survives instance termination. Instance store (ephemeral storage) is lost when instance stops. EBS provides durable, high-performance block storage with snapshots, encryption, and multiple volume types optimized for different workloads.

Real-world analogy: EBS is like an external hard drive that you can attach to your computer (EC2 instance). You can detach it, attach it to a different computer, take snapshots (backups), and choose different drive types (SSD vs HDD) based on your needs.

EBS Volume Types and Performance:

Volume Type Use Case IOPS Throughput Latency Cost
gp3 (General Purpose SSD) Most workloads 3,000-16,000 125-1,000 MB/s Single-digit ms $0.08/GB-month
gp2 (General Purpose SSD) Legacy, variable performance 100-16,000 (burst) 128-250 MB/s Single-digit ms $0.10/GB-month
io2 (Provisioned IOPS SSD) High-performance databases 100-64,000 256-4,000 MB/s Sub-millisecond $0.125/GB-month + $0.065/IOPS
io2 Block Express Highest performance 256,000 IOPS 4,000 MB/s Sub-millisecond $0.125/GB-month + $0.065/IOPS
st1 (Throughput Optimized HDD) Big data, data warehouses 500 (max) 500 MB/s Low ms $0.045/GB-month
sc1 (Cold HDD) Infrequent access 250 (max) 250 MB/s Low ms $0.015/GB-month

How EBS Performance Works:

1. IOPS (Input/Output Operations Per Second):

  • Measures number of read/write operations per second
  • gp3: Baseline 3,000 IOPS (regardless of size), can provision up to 16,000
  • gp2: 3 IOPS per GB (100 GB = 300 IOPS, 5,334 GB = 16,000 IOPS max)
  • io2: Provision exactly what you need (100-64,000 IOPS)

2. Throughput (MB/s):

  • Measures amount of data transferred per second
  • gp3: Baseline 125 MB/s, can provision up to 1,000 MB/s
  • gp2: Scales with IOPS (250 MB/s max)
  • st1: 500 MB/s max (optimized for sequential reads)

3. Burst Performance (gp2 only):

  • gp2 volumes accumulate I/O credits when idle
  • Can burst to 3,000 IOPS for short periods
  • Credit balance: 5.4 million I/O credits (30 minutes at 3,000 IOPS)
  • Problem: Credits deplete quickly under sustained load

Detailed Example 1: Database Server (High IOPS)

Scenario: You're running a PostgreSQL database with 500 transactions per second. Each transaction requires 10 IOPS (reads + writes). You need 5,000 IOPS sustained.

Option 1: gp2 (Legacy):

  • Need 5,000 IOPS Ć· 3 IOPS/GB = 1,667 GB volume
  • Cost: 1,667 GB Ɨ $0.10 = $166.70/month
  • Problem: Paying for storage you don't need just to get IOPS

Option 2: gp3 (Recommended):

  • Baseline: 3,000 IOPS (not enough)
  • Provision additional: 5,000 - 3,000 = 2,000 IOPS
  • Storage: 500 GB (actual need)
  • Cost: (500 GB Ɨ $0.08) + (2,000 IOPS Ɨ $0.005) = $40 + $10 = $50/month
  • Savings: $116.70/month (70% cheaper)

Option 3: io2 (Overkill for this scenario):

  • Provision 5,000 IOPS
  • Storage: 500 GB
  • Cost: (500 GB Ɨ $0.125) + (5,000 IOPS Ɨ $0.065) = $62.50 + $325 = $387.50/month
  • When to use: Need >16,000 IOPS or sub-millisecond latency

Detailed Example 2: Big Data Processing (High Throughput)

Scenario: You're running Apache Spark processing 10 TB of data. You need high sequential read throughput (500 MB/s) but don't need high IOPS.

Option 1: gp3:

  • Provision 1,000 MB/s throughput
  • Storage: 10,000 GB (10 TB)
  • Cost: (10,000 GB Ɨ $0.08) + (875 MB/s Ɨ $0.04) = $800 + $35 = $835/month
  • Problem: Expensive for throughput-optimized workload

Option 2: st1 (Recommended):

  • Throughput: 500 MB/s (max)
  • Storage: 10,000 GB (10 TB)
  • Cost: 10,000 GB Ɨ $0.045 = $450/month
  • Savings: $385/month (46% cheaper)
  • Trade-off: Lower IOPS (500 max), but not needed for sequential reads

Detailed Example 3: Log Archive Storage (Infrequent Access)

Scenario: You need to store 50 TB of application logs for compliance. Logs are accessed once per month for audits.

Option 1: gp3:

  • Storage: 50,000 GB
  • Cost: 50,000 GB Ɨ $0.08 = $4,000/month
  • Problem: Paying for performance you don't need

Option 2: sc1 (Recommended):

  • Storage: 50,000 GB
  • Cost: 50,000 GB Ɨ $0.015 = $750/month
  • Savings: $3,250/month (81% cheaper)
  • Trade-off: Lower throughput (250 MB/s), but acceptable for infrequent access

šŸ“Š EBS Volume Type Selection Diagram:

graph TD
    A[Select EBS Volume Type] --> B{Workload Type?}
    
    B -->|Transactional| C{IOPS Requirement?}
    C -->|< 16,000 IOPS| D[gp3 General Purpose SSD]
    C -->|> 16,000 IOPS| E[io2 Provisioned IOPS SSD]
    C -->|> 64,000 IOPS| F[io2 Block Express]
    
    B -->|Throughput-Intensive| G{Access Pattern?}
    G -->|Frequent Access| H[st1 Throughput Optimized HDD]
    G -->|Infrequent Access| I[sc1 Cold HDD]
    
    B -->|Boot Volume| J[gp3 or gp2]
    
    style D fill:#c8e6c9
    style E fill:#fff3e0
    style F fill:#ffebee
    style H fill:#c8e6c9
    style I fill:#e1f5fe
    style J fill:#c8e6c9

See: diagrams/04_domain3_ebs_volume_selection.mmd

Diagram Explanation:
This decision tree helps select the appropriate EBS volume type based on workload characteristics. For transactional workloads (databases, applications), choose based on IOPS requirements: gp3 for most workloads (<16,000 IOPS), io2 for high-performance databases (16,000-64,000 IOPS), or io2 Block Express for extreme performance (>64,000 IOPS). For throughput-intensive workloads (big data, data warehouses), choose st1 for frequently accessed data or sc1 for infrequently accessed data. For boot volumes, gp3 or gp2 are appropriate choices.

⭐ Must Know (EBS Performance):

  • gp3 is the default choice for most workloads (better price/performance than gp2)
  • gp3 provides 3,000 IOPS and 125 MB/s baseline regardless of volume size
  • gp2 performance scales with size (3 IOPS per GB), making it expensive for high IOPS
  • io2 is for high-performance databases requiring >16,000 IOPS or sub-millisecond latency
  • st1 is for throughput-intensive workloads (big data, data warehouses)
  • sc1 is for infrequently accessed data (lowest cost per GB)
  • EBS volumes are AZ-specific (cannot attach to instance in different AZ)
  • Use EBS snapshots for backups (stored in S3, incremental)

EBS Performance Optimization Techniques:

1. Use EBS-Optimized Instances:

  • Provides dedicated bandwidth for EBS traffic
  • Prevents network contention between EBS and application traffic
  • Most modern instance types are EBS-optimized by default
  • Performance Impact: Up to 2x better EBS performance

2. Use RAID 0 for Higher Performance:

  • Stripe data across multiple EBS volumes
  • Increases aggregate IOPS and throughput
  • Example: 4 Ɨ gp3 volumes (3,000 IOPS each) = 12,000 IOPS total
  • Trade-off: No redundancy (if one volume fails, all data lost)
  • Use case: Temporary data, high-performance computing

3. Pre-Warm EBS Volumes from Snapshots:

  • New volumes created from snapshots have lazy loading
  • First access to each block incurs latency penalty (50-100ms)
  • Solution: Read all blocks before production use
  • Command: sudo dd if=/dev/xvdf of=/dev/null bs=1M
  • Alternative: Use Fast Snapshot Restore (FSR) for instant performance

4. Use Fast Snapshot Restore (FSR):

  • Eliminates lazy loading penalty
  • Volumes created from FSR-enabled snapshots have full performance immediately
  • Cost: $0.75 per snapshot per AZ per month
  • Use case: Critical databases, time-sensitive restores

5. Monitor EBS Performance Metrics:

  • VolumeReadOps/VolumeWriteOps: IOPS usage
  • VolumeReadBytes/VolumeWriteBytes: Throughput usage
  • VolumeThroughputPercentage: Percentage of provisioned throughput used
  • VolumeQueueLength: Number of pending I/O requests (should be low)

Amazon EFS Performance Optimization

What it is: Amazon Elastic File System (EFS) is a fully managed, elastic, shared file system for Linux-based workloads. Multiple EC2 instances can access EFS concurrently.

Why it exists: EBS volumes can only be attached to one instance at a time. Applications that need shared file access (web servers, content management, development environments) require a shared file system. EFS provides NFS-compatible shared storage that automatically scales.

Real-world analogy: EFS is like a shared network drive in an office. Multiple employees (EC2 instances) can access the same files simultaneously. The drive automatically expands as you add more files, and you only pay for what you use.

EFS Performance Modes:

Performance Mode Throughput Latency Use Case Cost
General Purpose Up to 7,000 file ops/sec Low (single-digit ms) Most workloads $0.30/GB-month
Max I/O >7,000 file ops/sec Higher (double-digit ms) Big data, media processing $0.30/GB-month

EFS Throughput Modes:

Throughput Mode Throughput Scaling Cost
Bursting 50 MB/s per TB (baseline), burst to 100 MB/s Scales with size Included
Provisioned 1-1,024 MB/s (fixed) Independent of size $6/MB/s-month
Elastic Scales automatically Automatic $0.30/GB-month (read), $0.90/GB-month (write)

How EFS Performance Works:

Bursting Throughput Mode:

  • Baseline: 50 MB/s per TB of storage
  • Burst: 100 MB/s per TB (using burst credits)
  • Burst credits: Accumulate when below baseline, deplete when above
  • Example: 1 TB file system
    • Baseline: 50 MB/s
    • Burst: 100 MB/s (for limited time)
    • Minimum: 1 MB/s (even for small file systems)

Provisioned Throughput Mode:

  • Provision exact throughput needed (1-1,024 MB/s)
  • Independent of storage size
  • Use case: Small file system needing high throughput
  • Example: 100 GB file system needing 100 MB/s
    • Bursting mode: 50 MB/s Ɨ 0.1 TB = 5 MB/s (not enough)
    • Provisioned mode: 100 MB/s (exactly what you need)
    • Cost: (100 GB Ɨ $0.30) + (100 MB/s Ɨ $6) = $30 + $600 = $630/month

Elastic Throughput Mode (Recommended for most workloads):

  • Automatically scales throughput based on workload
  • No need to provision or manage throughput
  • Pay only for throughput used
  • Cost: $0.30/GB for reads, $0.90/GB for writes (data transferred)

Detailed Example 1: Web Server Content (Shared Access)

Scenario: You have 10 web servers serving static content (images, CSS, JavaScript). Content is 500 GB and accessed frequently.

Option 1: EBS (Won't Work):

  • EBS can only attach to one instance
  • Would need to replicate content to 10 EBS volumes
  • Synchronization complexity
  • Problem: Not designed for shared access

Option 2: S3 (Possible but Suboptimal):

  • Can serve content from S3
  • Need to modify application to use S3 API
  • Higher latency than local file system
  • Problem: Requires application changes

Option 3: EFS (Recommended):

  • Mount EFS on all 10 web servers
  • Shared access to same files
  • Automatic scaling
  • Performance: 50 MB/s Ɨ 0.5 TB = 25 MB/s baseline
  • Cost: 500 GB Ɨ $0.30 = $150/month
  • Benefits: No application changes, shared access, automatic scaling

Detailed Example 2: Development Environment (Many Small Files)

Scenario: You have 50 developers sharing a code repository (100 GB, 1 million files). High file operation rate (>10,000 ops/sec).

Performance Mode Selection:

  • General Purpose: Up to 7,000 file ops/sec (not enough)
  • Max I/O: >7,000 file ops/sec (sufficient)
  • Trade-off: Slightly higher latency (acceptable for development)

Throughput Mode Selection:

  • Bursting: 50 MB/s Ɨ 0.1 TB = 5 MB/s baseline (sufficient for code)
  • Cost: 100 GB Ɨ $0.30 = $30/month

Configuration:

  • Performance Mode: Max I/O
  • Throughput Mode: Bursting
  • Storage Class: Standard (frequent access)

Detailed Example 3: Machine Learning Training (Large Dataset)

Scenario: You're training ML models on a 10 TB dataset. Need 500 MB/s throughput for data loading.

Throughput Mode Selection:

  • Bursting: 50 MB/s Ɨ 10 TB = 500 MB/s baseline (exactly what you need)
  • Cost: 10,000 GB Ɨ $0.30 = $3,000/month
  • Perfect fit: Storage size naturally provides needed throughput

Alternative (if dataset was smaller):

  • Scenario: 1 TB dataset, need 500 MB/s
  • Bursting: 50 MB/s Ɨ 1 TB = 50 MB/s (not enough)
  • Provisioned: 500 MB/s
  • Cost: (1,000 GB Ɨ $0.30) + (500 MB/s Ɨ $6) = $300 + $3,000 = $3,300/month
  • Consideration: Expensive for small dataset with high throughput needs

šŸ“Š EFS Performance Architecture Diagram:

graph TB
    subgraph "EFS Shared File System"
        EFS[EFS File System<br/>500 GB, 25 MB/s]
    end
    
    subgraph "Availability Zone 1"
        EC2_1[Web Server 1]
        EC2_2[Web Server 2]
        EC2_3[Web Server 3]
    end
    
    subgraph "Availability Zone 2"
        EC2_4[Web Server 4]
        EC2_5[Web Server 5]
    end
    
    EC2_1 -.NFS Mount.-> EFS
    EC2_2 -.NFS Mount.-> EFS
    EC2_3 -.NFS Mount.-> EFS
    EC2_4 -.NFS Mount.-> EFS
    EC2_5 -.NFS Mount.-> EFS
    
    EFS --> MT1[Mount Target AZ-1]
    EFS --> MT2[Mount Target AZ-2]
    
    style EFS fill:#c8e6c9
    style MT1 fill:#e1f5fe
    style MT2 fill:#e1f5fe

See: diagrams/04_domain3_efs_shared_access.mmd

Diagram Explanation:
This diagram shows how EFS provides shared file system access across multiple EC2 instances in different Availability Zones. The EFS file system is accessed through mount targets in each AZ. All instances mount the same file system using NFS protocol, enabling shared access to the same files. This architecture is ideal for web servers serving static content, development environments, or any application requiring shared file access.

⭐ Must Know (EFS Performance):

  • EFS provides shared file system access (multiple instances can mount simultaneously)
  • Performance scales with storage size in Bursting mode (50 MB/s per TB baseline)
  • Use Provisioned Throughput when small file system needs high throughput
  • Use Elastic Throughput for variable workloads (automatic scaling)
  • General Purpose mode: Up to 7,000 file ops/sec (most workloads)
  • Max I/O mode: >7,000 file ops/sec (big data, many small files)
  • EFS is more expensive than EBS ($0.30/GB vs $0.08/GB for gp3)
  • Use EFS Infrequent Access (IA) for files not accessed frequently (90% cost savings)

When to use EFS vs EBS:

  • āœ… Use EFS when: Multiple instances need shared access to same files
  • āœ… Use EFS when: File system needs to scale automatically
  • āœ… Use EFS when: Application uses standard file system operations (POSIX)
  • āœ… Use EBS when: Single instance needs block storage
  • āœ… Use EBS when: Need highest IOPS (>16,000) or lowest latency
  • āœ… Use EBS when: Cost is primary concern (EBS is cheaper)
  • āŒ Don't use EFS when: Only one instance needs access (use EBS instead)
  • āŒ Don't use EFS when: Need Windows file system (use FSx for Windows instead)

Amazon FSx Performance Optimization

What it is: Amazon FSx provides fully managed third-party file systems optimized for specific workloads. FSx offers Windows File Server, Lustre (HPC), NetApp ONTAP, and OpenZFS.

Why it exists: Some applications require specific file system features not available in EFS. Windows applications need SMB protocol and Active Directory integration. High-performance computing needs parallel file systems like Lustre. FSx provides these specialized file systems as managed services.

FSx for Windows File Server:

  • Use case: Windows applications, Active Directory integration, SMB protocol
  • Performance: Up to 2 GB/s throughput, millions of IOPS
  • Features: Deduplication, compression, shadow copies, DFS namespaces
  • Cost: $0.013-0.65/GB-month (depends on storage type and throughput)

FSx for Lustre (High-Performance Computing):

  • Use case: Machine learning, video processing, financial modeling, genomics
  • Performance: Up to 1 TB/s throughput, millions of IOPS
  • Features: S3 integration, parallel file system, sub-millisecond latencies
  • Cost: $0.145-1.20/GB-month (depends on deployment type)

Detailed Example: Video Rendering (FSx for Lustre)

Scenario: You're rendering 4K video files (100 GB each). Rendering requires reading entire file, processing, and writing output. Need 10 GB/s aggregate throughput for 100 parallel render nodes.

Option 1: EFS:

  • Throughput: 50 MB/s per TB
  • Need: 10 GB/s = 10,000 MB/s
  • Storage required: 10,000 MB/s Ć· 50 MB/s per TB = 200 TB
  • Cost: 200,000 GB Ɨ $0.30 = $60,000/month
  • Problem: Paying for storage you don't need just to get throughput

Option 2: FSx for Lustre (Recommended):

  • Throughput: 200 MB/s per TB (Persistent SSD)
  • Need: 10 GB/s = 10,000 MB/s
  • Storage required: 10,000 MB/s Ć· 200 MB/s per TB = 50 TB
  • Cost: 50,000 GB Ɨ $0.145 = $7,250/month
  • Savings: $52,750/month (88% cheaper)
  • Additional benefits: Sub-millisecond latency, S3 integration

FSx for Lustre Deployment Types:

Deployment Type Throughput Latency Durability Use Case Cost
Scratch 200 MB/s per TB Sub-ms No replication Temporary data, cost-sensitive $0.145/GB-month
Persistent SSD 200 MB/s per TB Sub-ms Replicated within AZ Production workloads $0.290/GB-month
Persistent HDD 40 MB/s per TB Low ms Replicated within AZ Throughput-intensive, cost-sensitive $0.140/GB-month

⭐ Must Know (FSx):

  • FSx for Windows: Use for Windows applications needing SMB protocol and AD integration
  • FSx for Lustre: Use for HPC workloads needing extreme performance (ML, video, genomics)
  • FSx for NetApp ONTAP: Use for multi-protocol access (NFS, SMB, iSCSI) and advanced data management
  • FSx for OpenZFS: Use for Linux workloads needing ZFS features (snapshots, compression)
  • FSx for Lustre integrates with S3 (can use S3 as data repository)
  • FSx for Lustre Scratch: Temporary data, no replication, lowest cost
  • FSx for Lustre Persistent: Production data, replicated, higher cost

Section 2: High-Performing Compute Solutions

Introduction

The problem: Different workloads have vastly different compute requirements. A web server needs consistent CPU for handling requests. A batch job needs massive parallel processing. A microservice needs to scale from zero to thousands of instances instantly. Using the wrong compute service results in poor performance, high costs, or operational complexity.

The solution: AWS provides multiple compute services optimized for different use cases. Understanding the characteristics of each service (performance, scalability, cost, operational overhead) enables you to choose the right compute for each workload.

Why it's tested: Compute is the foundation of every application. This section tests your ability to select and configure compute services for optimal performance, scalability, and cost.

Core Concepts

EC2 Instance Types and Families

What it is: Amazon EC2 provides virtual servers (instances) in the cloud. EC2 offers hundreds of instance types optimized for different workloads, organized into instance families.

Why it exists: Different applications have different resource requirements. A database needs lots of memory. A video encoder needs powerful CPUs. A machine learning model needs GPUs. EC2 provides specialized instance types optimized for each workload.

Real-world analogy: EC2 instance types are like different types of vehicles. A sports car (compute-optimized) is fast but has limited cargo space. A truck (memory-optimized) can carry heavy loads but isn't as fast. A van (general purpose) balances both. You choose the vehicle based on your needs.

EC2 Instance Families:

Family Optimized For vCPU:Memory Ratio Use Cases Example Types
T3/T3a Burstable CPU 1:2 Variable workloads, dev/test t3.micro, t3.medium
M5/M6i General Purpose 1:4 Balanced workloads, web servers m5.large, m6i.xlarge
C5/C6i Compute Optimized 1:2 CPU-intensive, batch processing c5.2xlarge, c6i.4xlarge
R5/R6i Memory Optimized 1:8 In-memory databases, caching r5.xlarge, r6i.2xlarge
I3/I3en Storage Optimized 1:8 + NVMe SSD NoSQL databases, data warehouses i3.2xlarge, i3en.6xlarge
P3/P4 GPU Accelerated GPUs Machine learning, video encoding p3.2xlarge, p4d.24xlarge
G4 Graphics Accelerated GPUs Graphics workloads, game streaming g4dn.xlarge

Instance Size Naming Convention:

  • Format: {family}{generation}.{size}
  • Example: m5.2xlarge
    • m: General purpose family
    • 5: 5th generation
    • 2xlarge: Size (8 vCPUs, 32 GB RAM)

Instance Sizes (using M5 as example):

  • m5.large: 2 vCPUs, 8 GB RAM
  • m5.xlarge: 4 vCPUs, 16 GB RAM
  • m5.2xlarge: 8 vCPUs, 32 GB RAM
  • m5.4xlarge: 16 vCPUs, 64 GB RAM
  • m5.8xlarge: 32 vCPUs, 128 GB RAM
  • m5.12xlarge: 48 vCPUs, 192 GB RAM
  • m5.16xlarge: 64 vCPUs, 256 GB RAM
  • m5.24xlarge: 96 vCPUs, 384 GB RAM

Detailed Example 1: Web Application Server

Scenario: You're running a web application with moderate traffic (100 requests/sec). CPU usage varies between 20-60% throughout the day.

Option 1: T3 Burstable Instance (Recommended):

  • Instance: t3.medium (2 vCPUs, 4 GB RAM)
  • Baseline: 20% CPU utilization
  • Burst: Up to 100% CPU when needed
  • CPU Credits: Accumulate when below baseline, spend when above
  • Cost: $0.0416/hour = $30/month
  • Benefits: Cost-effective for variable workloads

Option 2: M5 General Purpose Instance:

  • Instance: m5.large (2 vCPUs, 8 GB RAM)
  • Performance: Consistent 100% CPU available
  • Cost: $0.096/hour = $70/month
  • When to use: Sustained high CPU usage (>40% average)

How T3 CPU Credits Work:

  • Baseline: t3.medium earns 24 CPU credits/hour (20% of 2 vCPUs)
  • Burst: Spending 100% CPU consumes 120 CPU credits/hour (2 vCPUs Ɨ 60 min)
  • Credit Balance: Maximum 288 credits (24 hours of baseline)
  • Example:
    • Hour 1-8 (night, low traffic): 20% CPU, earn 24 credits/hour = +192 credits
    • Hour 9-10 (morning spike): 80% CPU, spend 96 credits/hour = -192 credits
    • Result: Credits balance out, no additional cost

Detailed Example 2: Database Server (Memory-Intensive)

Scenario: You're running PostgreSQL with a 100 GB working set (data that must fit in memory for good performance). Need 128 GB RAM.

Option 1: M5 General Purpose:

  • Instance: m5.8xlarge (32 vCPUs, 128 GB RAM)
  • Cost: $1.536/hour = $1,121/month
  • Problem: Paying for 32 vCPUs when you only need 8

Option 2: R5 Memory Optimized (Recommended):

  • Instance: r5.4xlarge (16 vCPUs, 128 GB RAM)
  • Cost: $1.008/hour = $736/month
  • Savings: $385/month (34% cheaper)
  • Benefits: Same memory, fewer vCPUs (better ratio for database)

Detailed Example 3: Batch Processing (CPU-Intensive)

Scenario: You're running video encoding jobs that max out CPU for hours. Need to process 1,000 videos per day.

Option 1: M5 General Purpose:

  • Instance: m5.4xlarge (16 vCPUs, 64 GB RAM)
  • Cost: $0.768/hour
  • Processing: 10 videos/hour
  • Time: 100 hours/day
  • Daily cost: 100 hours Ɨ $0.768 = $76.80

Option 2: C5 Compute Optimized (Recommended):

  • Instance: c5.4xlarge (16 vCPUs, 32 GB RAM)
  • Cost: $0.68/hour
  • Processing: 12 videos/hour (better CPU performance)
  • Time: 83 hours/day
  • Daily cost: 83 hours Ɨ $0.68 = $56.44
  • Savings: $20.36/day (27% cheaper)

šŸ“Š EC2 Instance Family Selection Diagram:

graph TD
    A[Select EC2 Instance Type] --> B{Workload Characteristics?}
    
    B -->|Variable CPU| C[T3/T3a Burstable]
    B -->|Balanced| D[M5/M6i General Purpose]
    B -->|CPU-Intensive| E[C5/C6i Compute Optimized]
    B -->|Memory-Intensive| F[R5/R6i Memory Optimized]
    B -->|Storage-Intensive| G[I3/I3en Storage Optimized]
    B -->|GPU Workload| H{GPU Type?}
    
    H -->|ML Training| I[P3/P4 GPU Instances]
    H -->|Graphics| J[G4 Graphics Instances]
    
    C --> K[Web servers, dev/test]
    D --> L[Application servers, microservices]
    E --> M[Batch processing, HPC]
    F --> N[Databases, caching]
    G --> O[NoSQL, data warehouses]
    
    style C fill:#e1f5fe
    style D fill:#c8e6c9
    style E fill:#fff3e0
    style F fill:#f3e5f5
    style G fill:#ffebee
    style I fill:#ffe0b2
    style J fill:#ffe0b2

See: diagrams/04_domain3_ec2_instance_selection.mmd

Diagram Explanation:
This decision tree helps select the appropriate EC2 instance family based on workload characteristics. For variable CPU workloads, use T3/T3a burstable instances. For balanced workloads, use M5/M6i general purpose. For CPU-intensive workloads, use C5/C6i compute optimized. For memory-intensive workloads, use R5/R6i memory optimized. For storage-intensive workloads, use I3/I3en storage optimized. For GPU workloads, choose P3/P4 for ML training or G4 for graphics.

⭐ Must Know (EC2 Instance Types):

  • T3 burstable instances are cost-effective for variable workloads (accumulate CPU credits)
  • M5 general purpose instances provide balanced CPU/memory (1:4 ratio)
  • C5 compute optimized instances provide high CPU-to-memory ratio (1:2 ratio)
  • R5 memory optimized instances provide high memory-to-CPU ratio (1:8 ratio)
  • I3 storage optimized instances provide NVMe SSD for high IOPS
  • Instance size doubles resources with each step (large → xlarge → 2xlarge)
  • Use Compute Optimizer to get right-sizing recommendations
  • Newer generations (M6i vs M5) provide better price/performance

EC2 Performance Optimization Techniques:

1. Use Placement Groups for Low Latency:

  • Cluster: Instances in same AZ, low-latency network (10 Gbps)
  • Spread: Instances on different hardware (max 7 per AZ)
  • Partition: Instances in different partitions (for distributed systems)
  • Use case: HPC, distributed databases, big data

2. Use Enhanced Networking:

  • Provides higher bandwidth, higher PPS, lower latency
  • SR-IOV: Single Root I/O Virtualization
  • ENA: Elastic Network Adapter (up to 100 Gbps)
  • EFA: Elastic Fabric Adapter (for HPC, MPI)
  • Enabled by default on most modern instance types

3. Right-Size Instances:

  • Monitor CPU, memory, network, disk utilization
  • Use CloudWatch metrics and Compute Optimizer
  • Over-provisioned: Wasting money on unused resources
  • Under-provisioned: Poor performance, user complaints
  • Target: 40-60% average utilization (allows for spikes)

4. Use Auto Scaling for Variable Workloads:

  • Automatically add/remove instances based on demand
  • Target Tracking: Maintain target metric (e.g., 50% CPU)
  • Step Scaling: Add/remove instances in steps
  • Scheduled Scaling: Scale based on time (e.g., business hours)

AWS Lambda Performance Optimization

What it is: AWS Lambda is a serverless compute service that runs code in response to events. You don't manage servers; AWS automatically scales and manages infrastructure.

Why it exists: Managing servers is complex and expensive. You pay for idle capacity, handle scaling, patch operating systems, and monitor infrastructure. Lambda eliminates this operational overhead by running code only when needed and automatically scaling.

Real-world analogy: Lambda is like hiring a contractor for specific tasks instead of a full-time employee. You only pay when they're working (per request), they bring their own tools (runtime), and you don't manage their schedule (automatic scaling).

Lambda Performance Characteristics:

Execution Limits:

  • Memory: 128 MB - 10,240 MB (10 GB)
  • Timeout: 1 second - 15 minutes (900 seconds)
  • Ephemeral Storage (/tmp): 512 MB - 10,240 MB (10 GB)
  • Concurrent Executions: 1,000 (default, can request increase)
  • Deployment Package: 50 MB (zipped), 250 MB (unzipped)

Performance Scaling:

  • CPU: Scales linearly with memory (1,769 MB = 1 vCPU)
  • Network: Scales with memory (higher memory = more bandwidth)
  • Cold Start: 100-1,000ms (first invocation or after idle period)
  • Warm Start: 1-10ms (subsequent invocations)

How Lambda Memory Affects Performance:

Lambda allocates CPU power proportional to memory:

  • 128 MB: 0.07 vCPU (very slow)
  • 512 MB: 0.29 vCPU
  • 1,024 MB: 0.58 vCPU
  • 1,769 MB: 1.0 vCPU (full vCPU)
  • 3,008 MB: 1.7 vCPUs
  • 10,240 MB: 6 vCPUs

Detailed Example 1: Image Processing (CPU-Intensive)

Scenario: You're resizing images (1 MB each). Processing takes 5 seconds at 128 MB memory.

Option 1: 128 MB Memory:

  • Execution time: 5 seconds
  • Cost per invocation: 5 sec Ɨ 128 MB Ɨ $0.0000000167/GB-sec = $0.0000107
  • CPU: 0.07 vCPU (very slow)

Option 2: 1,024 MB Memory (Recommended):

  • Execution time: 0.625 seconds (8x faster due to 8x more CPU)
  • Cost per invocation: 0.625 sec Ɨ 1,024 MB Ɨ $0.0000000167/GB-sec = $0.0000107
  • CPU: 0.58 vCPU
  • Result: Same cost, 8x faster!

Option 3: 1,769 MB Memory (Full vCPU):

  • Execution time: 0.36 seconds (14x faster)
  • Cost per invocation: 0.36 sec Ɨ 1,769 MB Ɨ $0.0000000167/GB-sec = $0.0000107
  • CPU: 1.0 vCPU
  • Result: Same cost, 14x faster!

Key Insight: For CPU-intensive workloads, increasing memory often reduces execution time proportionally, resulting in same cost but better performance.

Detailed Example 2: API Backend (Low Latency)

Scenario: You're building an API that queries DynamoDB and returns results. Need <100ms response time.

Cold Start Problem:

  • Cold start: 500ms (Lambda initialization)
  • Warm start: 10ms (Lambda already initialized)
  • Problem: First request after idle period is slow

Solution 1: Provisioned Concurrency:

  • Pre-initializes Lambda functions
  • Eliminates cold starts
  • Cost: $0.000004167 per GB-second (in addition to execution cost)
  • Example: 10 provisioned instances Ɨ 1,024 MB Ɨ 24 hours = $10.32/day
  • When to use: Latency-sensitive applications, predictable traffic

Solution 2: Keep Functions Warm:

  • Invoke function every 5 minutes (before idle timeout)
  • Cost: Minimal (just invocation cost)
  • Limitation: Not guaranteed (AWS may still cold start)

Solution 3: Increase Memory:

  • Higher memory = faster cold start (more CPU for initialization)
  • 128 MB: 1,000ms cold start
  • 1,024 MB: 500ms cold start
  • 3,008 MB: 200ms cold start

Detailed Example 3: Batch Processing (High Throughput)

Scenario: You need to process 1 million records from S3. Each record takes 100ms to process.

Option 1: Sequential Processing:

  • Time: 1,000,000 records Ɨ 100ms = 100,000 seconds (27.8 hours)
  • Problem: Too slow

Option 2: Parallel Lambda Invocations (Recommended):

  • Concurrency: 1,000 Lambda functions running in parallel
  • Time: 1,000,000 records Ć· 1,000 = 1,000 records per function
  • Time per function: 1,000 Ɨ 100ms = 100 seconds
  • Total time: 100 seconds (1.7 minutes)
  • Speedup: 1,000x faster

How to Achieve Parallelism:

  • Use S3 event notifications (one Lambda per object)
  • Use SQS with batch size (Lambda polls queue)
  • Use Step Functions Map state (parallel execution)
  • Use Kinesis Data Streams (one Lambda per shard)

šŸ“Š Lambda Performance Optimization Diagram:

graph TB
    A[Lambda Performance Optimization] --> B{Optimization Goal?}
    
    B -->|Reduce Cost| C{Workload Type?}
    C -->|CPU-Intensive| D[Increase Memory<br/>Faster = Same Cost]
    C -->|I/O-Intensive| E[Minimize Memory<br/>Waiting ≠ CPU]
    
    B -->|Reduce Latency| F{Cold Start Issue?}
    F -->|Yes| G[Provisioned Concurrency]
    F -->|No| H[Optimize Code]
    
    B -->|Increase Throughput| I[Parallel Invocations]
    I --> J[S3 Events]
    I --> K[SQS Batching]
    I --> L[Kinesis Shards]
    
    style D fill:#c8e6c9
    style E fill:#c8e6c9
    style G fill:#fff3e0
    style I fill:#e1f5fe

See: diagrams/04_domain3_lambda_optimization.mmd

Diagram Explanation:
This decision tree shows Lambda performance optimization strategies based on goals. To reduce cost for CPU-intensive workloads, increase memory (faster execution = same cost). For I/O-intensive workloads, minimize memory (waiting doesn't use CPU). To reduce latency with cold start issues, use Provisioned Concurrency. To increase throughput, use parallel invocations via S3 events, SQS batching, or Kinesis shards.

⭐ Must Know (Lambda Performance):

  • Lambda allocates CPU proportional to memory (1,769 MB = 1 vCPU)
  • For CPU-intensive workloads, increasing memory reduces execution time proportionally
  • Cold starts occur on first invocation or after idle period (100-1,000ms)
  • Provisioned Concurrency eliminates cold starts but costs more
  • Lambda scales automatically up to concurrency limit (1,000 default)
  • Use parallel invocations for high throughput (S3 events, SQS, Kinesis)
  • Lambda timeout maximum is 15 minutes (use Step Functions for longer workflows)
  • Ephemeral storage (/tmp) is 512 MB default, can increase to 10 GB

Section 3: High-Performing Database Solutions

Introduction

The problem: Databases are often the performance bottleneck in applications. Slow queries, connection limits, insufficient IOPS, and poor caching strategies result in slow response times and poor user experience.

The solution: AWS provides multiple database services optimized for different data models and access patterns. Understanding database performance characteristics (IOPS, throughput, latency, connection pooling, caching) enables you to design high-performing data layers.

Why it's tested: Database performance directly impacts application performance. This section tests your ability to select and configure database services for optimal performance.

Core Concepts

Amazon RDS Performance Optimization

What it is: Amazon RDS is a managed relational database service supporting MySQL, PostgreSQL, MariaDB, Oracle, and SQL Server. RDS handles backups, patching, and replication.

Why it exists: Managing database servers is complex. You must handle backups, replication, failover, patching, and monitoring. RDS automates these operational tasks, allowing you to focus on application development.

Real-world analogy: RDS is like hiring a database administrator who handles all maintenance tasks. You focus on your application while RDS handles backups, updates, and keeping the database running.

RDS Performance Factors:

1. Instance Type:

  • db.t3: Burstable CPU, cost-effective for variable workloads
  • db.m5: General purpose, balanced CPU/memory
  • db.r5: Memory optimized, high memory for large working sets
  • db.x1e: Extreme memory, up to 3,904 GB RAM

2. Storage Type:

  • gp3: General purpose SSD, 3,000-16,000 IOPS, 125-1,000 MB/s
  • gp2: Legacy SSD, 100-16,000 IOPS (burst), 128-250 MB/s
  • io1: Provisioned IOPS SSD, 100-64,000 IOPS, 256-4,000 MB/s

3. Read Replicas:

  • Asynchronous replication from primary
  • Offload read traffic (reports, analytics)
  • Up to 5 read replicas per primary
  • Can be in different regions (cross-region read replicas)

4. RDS Proxy:

  • Connection pooling and management
  • Reduces database connections
  • Improves scalability for serverless applications
  • Automatic failover (faster than DNS-based failover)

Detailed Example 1: E-Commerce Database (High Read Traffic)

Scenario: You have an e-commerce site with 10,000 product page views per minute. Each page view requires 5 database queries. Database CPU is at 80% due to read queries.

Problem Analysis:

  • Read queries: 10,000 views/min Ɨ 5 queries = 50,000 queries/min
  • Write queries: 100 orders/min Ɨ 10 queries = 1,000 queries/min
  • Read:Write ratio: 50:1 (read-heavy workload)

Solution: Read Replicas:

  • Primary: Handle all writes (1,000 queries/min)
  • Read Replica 1: Handle 25,000 read queries/min
  • Read Replica 2: Handle 25,000 read queries/min
  • Result: Primary CPU drops to 20%, read replicas at 40% each

Implementation:

# Application code with read/write splitting
import pymysql

# Primary endpoint (writes)
primary_conn = pymysql.connect(
    host='mydb.abc123.us-east-1.rds.amazonaws.com',
    user='admin',
    password='password',
    database='ecommerce'
)

# Read replica endpoint (reads)
replica_conn = pymysql.connect(
    host='mydb-replica.abc123.us-east-1.rds.amazonaws.com',
    user='admin',
    password='password',
    database='ecommerce'
)

# Write operation (use primary)
def create_order(order_data):
    cursor = primary_conn.cursor()
    cursor.execute("INSERT INTO orders ...")
    primary_conn.commit()

# Read operation (use replica)
def get_product(product_id):
    cursor = replica_conn.cursor()
    cursor.execute("SELECT * FROM products WHERE id = %s", (product_id,))
    return cursor.fetchone()

Cost Analysis:

  • Before: db.r5.2xlarge (8 vCPUs, 64 GB) at 80% CPU = $1.008/hour
  • After:
    • Primary: db.r5.large (2 vCPUs, 16 GB) at 20% CPU = $0.252/hour
    • Replica 1: db.r5.large at 40% CPU = $0.252/hour
    • Replica 2: db.r5.large at 40% CPU = $0.252/hour
    • Total: $0.756/hour
  • Savings: $0.252/hour (25% cheaper) + better performance

Detailed Example 2: Serverless Application (Connection Pooling)

Scenario: You have a Lambda function that queries RDS. Each Lambda invocation creates a new database connection. With 1,000 concurrent Lambda executions, you hit the database connection limit (100 connections).

Problem:

  • Lambda concurrency: 1,000 functions
  • Connections per Lambda: 1 connection
  • Total connections: 1,000 connections
  • Database limit: 100 connections (db.t3.medium)
  • Result: Connection errors, failed requests

Solution: RDS Proxy:

  • RDS Proxy: Pools connections, reuses existing connections
  • Lambda connections: 1,000 functions
  • RDS Proxy connections: 10 connections to database
  • Result: No connection errors, 100x reduction in database connections

Implementation:

import pymysql

# Without RDS Proxy (creates new connection each time)
def lambda_handler_without_proxy(event, context):
    conn = pymysql.connect(
        host='mydb.abc123.us-east-1.rds.amazonaws.com',
        user='admin',
        password='password'
    )
    # Execute query
    conn.close()  # Connection closed, wasted

# With RDS Proxy (reuses connections)
def lambda_handler_with_proxy(event, context):
    conn = pymysql.connect(
        host='mydb-proxy.proxy-abc123.us-east-1.rds.amazonaws.com',
        user='admin',
        password='password'
    )
    # Execute query
    conn.close()  # Connection returned to pool, reused

Performance Benefits:

  • Connection time: 100ms (without proxy) → 10ms (with proxy)
  • Database CPU: 60% (without proxy) → 20% (with proxy)
  • Failed requests: 90% (without proxy) → 0% (with proxy)

Cost:

  • RDS Proxy: $0.015/hour per vCPU = $0.03/hour (2 vCPUs)
  • Benefit: Prevents need to upgrade database instance ($0.096/hour savings)

Detailed Example 3: Analytics Workload (Storage Performance)

Scenario: You're running analytics queries on a 1 TB database. Queries scan large tables and require high IOPS (10,000 IOPS sustained).

Option 1: gp2 Storage:

  • IOPS: 3 IOPS per GB
  • Storage needed: 10,000 IOPS Ć· 3 = 3,334 GB
  • Cost: 3,334 GB Ɨ $0.115 = $383/month
  • Problem: Paying for storage you don't need

Option 2: gp3 Storage (Recommended):

  • Baseline: 3,000 IOPS
  • Additional: 7,000 IOPS
  • Storage: 1,000 GB (actual need)
  • Cost: (1,000 GB Ɨ $0.115) + (7,000 IOPS Ɨ $0.02) = $115 + $140 = $255/month
  • Savings: $128/month (33% cheaper)

Option 3: io1 Storage:

  • IOPS: 10,000 provisioned
  • Storage: 1,000 GB
  • Cost: (1,000 GB Ɨ $0.125) + (10,000 IOPS Ɨ $0.10) = $125 + $1,000 = $1,125/month
  • When to use: Need >16,000 IOPS or sub-millisecond latency

⭐ Must Know (RDS Performance):

  • Use read replicas to offload read traffic from primary (up to 5 replicas)
  • Read replicas use asynchronous replication (eventual consistency)
  • Use RDS Proxy for connection pooling (reduces database connections)
  • RDS Proxy improves Lambda scalability (reuses connections)
  • gp3 storage provides better price/performance than gp2
  • Use Performance Insights to identify slow queries
  • Multi-AZ provides high availability but NOT performance improvement
  • Cross-region read replicas have higher replication lag (network latency)

Amazon Aurora Performance Optimization

What it is: Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. Aurora provides up to 5x performance of MySQL and 3x performance of PostgreSQL.

Why it exists: Traditional databases were designed for single servers with local storage. Cloud databases need to scale across multiple servers and storage nodes. Aurora was built from the ground up for cloud architecture, providing better performance, availability, and scalability.

Real-world analogy: Aurora is like a high-performance sports car designed specifically for racing, while RDS is like a regular car modified for racing. Both can race, but the purpose-built car performs better.

Aurora Performance Advantages:

1. Storage Architecture:

  • Traditional RDS: Single EBS volume (limited IOPS)
  • Aurora: Distributed storage across 6 copies in 3 AZs
  • Result: Higher throughput, lower latency, automatic scaling

2. Read Scaling:

  • RDS: Up to 5 read replicas
  • Aurora: Up to 15 read replicas
  • Aurora Replica Lag: <10ms (vs 100ms+ for RDS)

3. Failover:

  • RDS Multi-AZ: 1-2 minutes (DNS propagation)
  • Aurora: 30 seconds (promotes existing replica)

4. Backups:

  • RDS: Impacts performance during backup window
  • Aurora: Continuous backup, no performance impact

Detailed Example: High-Traffic Application

Scenario: You have a social media application with 100,000 users. Database handles 50,000 queries/sec (80% reads, 20% writes).

RDS MySQL Limitations:

  • Primary: Handles 10,000 writes/sec (at capacity)
  • Read Replicas: 5 replicas Ɨ 8,000 reads/sec = 40,000 reads/sec
  • Total reads: 40,000 reads/sec (need 40,000, at capacity)
  • Problem: Cannot scale further, high replication lag (200ms)

Aurora MySQL Solution:

  • Primary: Handles 10,000 writes/sec
  • Read Replicas: 15 replicas Ɨ 8,000 reads/sec = 120,000 reads/sec
  • Total reads: 120,000 reads/sec (3x capacity)
  • Replication lag: <10ms (20x better)
  • Failover: 30 seconds (2-4x faster)

Performance Comparison:

Metric RDS MySQL Aurora MySQL Improvement
Max Read Replicas 5 15 3x
Replication Lag 100-200ms <10ms 10-20x
Failover Time 60-120 sec 30 sec 2-4x
Backup Impact Performance hit No impact āˆž
Storage Scaling Manual Automatic Auto

Amazon DynamoDB Performance Optimization

What it is: Amazon DynamoDB is a fully managed NoSQL database that provides single-digit millisecond latency at any scale. DynamoDB automatically scales to handle millions of requests per second.

Why it exists: Relational databases struggle with massive scale and require complex sharding. NoSQL databases like DynamoDB are designed for horizontal scaling, providing consistent performance regardless of data size.

Real-world analogy: DynamoDB is like a massive filing system where you can instantly retrieve any document by its ID. The system automatically adds more filing cabinets as you add more documents, and retrieval time stays constant.

DynamoDB Performance Characteristics:

Capacity Modes:

  • On-Demand: Pay per request, automatic scaling, no capacity planning
  • Provisioned: Specify RCU/WCU, predictable cost, can use Auto Scaling

Read/Write Capacity Units:

  • RCU (Read Capacity Unit): 1 strongly consistent read/sec for items up to 4 KB
  • WCU (Write Capacity Unit): 1 write/sec for items up to 1 KB
  • Eventually consistent reads: 2 reads per RCU (half the cost)

Latency:

  • GetItem: Single-digit milliseconds (typically 1-5ms)
  • Query: Single-digit milliseconds (depends on result size)
  • Scan: Slow (reads entire table, avoid in production)

Detailed Example 1: Partition Key Design (Critical for Performance)

Scenario: You're building a user profile service. Each user has a profile with 10 attributes (2 KB total).

Bad Design (Hot Partition):

{
  "PK": "USER",
  "SK": "user123",
  "name": "John Doe",
  "email": "john@example.com"
}
  • Problem: All users have same partition key ("USER")
  • Result: All data in single partition (10 GB limit, 3,000 RCU/1,000 WCU limit)
  • Performance: Throttling when exceeding partition limits

Good Design (Distributed Partitions):

{
  "PK": "USER#user123",
  "SK": "PROFILE",
  "name": "John Doe",
  "email": "john@example.com"
}
  • Partition key: Unique per user ("USER#user123")
  • Result: Data distributed across many partitions
  • Performance: No throttling, scales to millions of users

Key Principle: Partition key should have high cardinality (many unique values) to distribute data evenly.

Detailed Example 2: DynamoDB Accelerator (DAX) for Caching

Scenario: You have a product catalog with 100,000 products. Each product page view requires reading product details. You have 10,000 page views per minute.

Without DAX:

  • Reads: 10,000 reads/min = 167 reads/sec
  • RCU needed: 167 RCU (strongly consistent)
  • Cost: 167 RCU Ɨ $0.00013/hour = $0.022/hour = $16/month
  • Latency: 5ms per read

With DAX (90% cache hit rate):

  • Cache hits: 9,000 reads/min (served from DAX, <1ms latency)
  • Cache misses: 1,000 reads/min = 17 reads/sec (from DynamoDB)
  • RCU needed: 17 RCU
  • DynamoDB cost: 17 RCU Ɨ $0.00013/hour = $0.002/hour = $1.50/month
  • DAX cost: $0.04/hour (dax.t3.small) = $29/month
  • Total cost: $30.50/month
  • Latency: <1ms for cache hits (5x faster)
  • Trade-off: Higher cost ($14.50 more) but much better performance

When DAX Makes Sense:

  • āœ… Read-heavy workloads (>90% reads)
  • āœ… Frequently accessed items (high cache hit rate)
  • āœ… Latency-sensitive applications (need <1ms response)
  • āŒ Write-heavy workloads (cache invalidation overhead)
  • āŒ Infrequently accessed items (low cache hit rate)

⭐ Must Know (DynamoDB Performance):

  • Partition key design is critical (use high-cardinality keys)
  • Hot partitions cause throttling (distribute data evenly)
  • Use DAX for read-heavy workloads (microsecond latency)
  • On-Demand mode: No capacity planning, pay per request
  • Provisioned mode: Predictable cost, can use Auto Scaling
  • Eventually consistent reads are half the cost of strongly consistent
  • Global Secondary Indexes (GSI) enable different query patterns
  • Avoid Scan operations in production (reads entire table)

Section 4: High-Performing Network Architectures

Introduction

The problem: Network latency and bandwidth limitations impact application performance. Users far from your servers experience slow load times. Inefficient routing increases costs. Poor network design creates bottlenecks.

The solution: AWS provides multiple networking services to optimize performance. CloudFront caches content at edge locations. Global Accelerator routes traffic over AWS's optimized network. VPC design and load balancing strategies improve throughput and reduce latency.

Why it's tested: Network performance affects user experience. This section tests your ability to design network architectures for optimal performance and cost.

Core Concepts

Amazon CloudFront Performance Optimization

What it is: Amazon CloudFront is a content delivery network (CDN) that caches content at edge locations worldwide. CloudFront reduces latency by serving content from the location closest to users.

Why it exists: Serving content from a single region results in high latency for distant users. A user in Australia accessing content in US-East-1 experiences 200-300ms latency. CloudFront caches content at 400+ edge locations, reducing latency to 10-50ms.

Real-world analogy: CloudFront is like having local warehouses in every city instead of one central warehouse. Customers get products faster because they're shipped from the nearest warehouse.

CloudFront Performance Characteristics:

Latency Reduction:

  • Direct to S3: 100-300ms (depends on distance)
  • Via CloudFront: 10-50ms (edge location nearby)
  • Improvement: 2-10x faster

Cache Hit Ratio:

  • High cache hit ratio (>80%): Most requests served from edge
  • Low cache hit ratio (<50%): Many requests go to origin (slower, more expensive)

Detailed Example: Global Website

Scenario: You have a website with users worldwide. Static assets (images, CSS, JavaScript) are 10 MB per page. You have 1 million page views per day.

Without CloudFront:

  • Data transfer: 1M views Ɨ 10 MB = 10 TB/day
  • S3 data transfer cost: 10 TB Ɨ $0.09/GB = $900/day
  • S3 requests: 1M Ɨ 50 objects/page = 50M requests
  • S3 request cost: 50M Ɨ $0.0004/1K = $20/day
  • Total cost: $920/day = $27,600/month
  • Latency: 100-300ms (varies by user location)

With CloudFront (80% cache hit rate):

  • Origin requests: 20% Ɨ 50M = 10M requests
  • S3 data transfer: 20% Ɨ 10 TB = 2 TB
  • S3 cost: (2 TB Ɨ $0.09/GB) + (10M Ɨ $0.0004/1K) = $180 + $4 = $184/day
  • CloudFront data transfer: 10 TB Ɨ $0.085/GB = $850/day
  • CloudFront requests: 50M Ɨ $0.0075/10K = $37.50/day
  • Total cost: $1,071.50/day = $32,145/month
  • Latency: 10-50ms (much faster)
  • Trade-off: 16% higher cost but 3-10x better performance

Optimization: Increase Cache Hit Ratio:

  • Cache-Control headers: Set appropriate TTL (Time To Live)
  • Query string handling: Don't forward unnecessary query strings
  • Cookie handling: Don't forward unnecessary cookies
  • Result: 80% → 95% cache hit ratio
  • New origin requests: 5% Ɨ 50M = 2.5M requests
  • New S3 cost: $46/day
  • New total cost: $933.50/day = $28,005/month
  • Savings: $4,140/month (13% cheaper than CloudFront with 80% hit ratio)

⭐ Must Know (CloudFront Performance):

  • CloudFront caches content at 400+ edge locations worldwide
  • Cache hit ratio is critical (aim for >80%)
  • Use Cache-Control headers to control TTL
  • CloudFront supports both static and dynamic content
  • Origin Shield adds additional caching layer (reduces origin load)
  • Use signed URLs/cookies for private content
  • CloudFront integrates with AWS WAF for security
  • Regional Edge Caches provide additional caching between edge and origin

AWS Global Accelerator

What it is: AWS Global Accelerator routes traffic over AWS's global network infrastructure instead of the public internet. It provides static IP addresses that route to optimal AWS endpoints.

Why it exists: Public internet routing is unpredictable and can be slow. Global Accelerator uses AWS's private network, which is faster and more reliable than public internet.

Real-world analogy: Global Accelerator is like taking a private highway instead of public roads. The private highway has less traffic, better maintenance, and faster speeds.

Performance Benefits:

  • Latency reduction: 10-60% faster than public internet
  • Consistent performance: AWS network is more reliable
  • Automatic failover: Routes to healthy endpoints
  • Static IPs: No DNS caching issues

When to use Global Accelerator vs CloudFront:

  • CloudFront: Static content, caching, HTTP/HTTPS
  • Global Accelerator: Dynamic content, TCP/UDP, non-HTTP protocols

Chapter Summary

What We Covered

āœ… Section 1: High-Performing Storage Solutions

  • S3 performance optimization (prefixes, multipart upload, Transfer Acceleration)
  • EBS volume types and performance characteristics (gp3, io2, st1, sc1)
  • EFS performance modes and throughput modes
  • FSx for specialized file systems (Windows, Lustre, ONTAP, OpenZFS)

āœ… Section 2: High-Performing Compute Solutions

  • EC2 instance families and types (T3, M5, C5, R5, I3, P3, G4)
  • Instance sizing and right-sizing strategies
  • Lambda performance optimization (memory, concurrency, cold starts)
  • Provisioned Concurrency for latency-sensitive applications

āœ… Section 3: High-Performing Database Solutions

  • RDS performance optimization (read replicas, RDS Proxy, storage types)
  • Aurora advantages (distributed storage, 15 read replicas, fast failover)
  • DynamoDB partition key design and DAX caching
  • Database selection based on workload characteristics

āœ… Section 4: High-Performing Network Architectures

  • CloudFront CDN for global content delivery
  • Cache hit ratio optimization
  • Global Accelerator for non-HTTP traffic
  • Network performance optimization strategies

Critical Takeaways

  1. S3 Performance: Use multiple prefixes for >5,500 GET/sec. Use multipart upload for >100 MB objects. Use Transfer Acceleration for long-distance uploads. Use CloudFront for frequently accessed content.

  2. EBS Selection: Use gp3 for most workloads (better price/performance than gp2). Use io2 for high-IOPS databases (>16,000 IOPS). Use st1 for throughput-intensive workloads. Use sc1 for infrequently accessed data.

  3. EFS vs EBS: Use EFS for shared file access across multiple instances. Use EBS for single-instance block storage. EFS automatically scales; EBS requires manual resizing.

  4. EC2 Instance Selection: Match instance family to workload (T3 for variable, M5 for balanced, C5 for CPU, R5 for memory, I3 for storage). Use Compute Optimizer for right-sizing recommendations.

  5. Lambda Optimization: For CPU-intensive workloads, increasing memory reduces execution time proportionally (same cost, better performance). Use Provisioned Concurrency to eliminate cold starts. Use parallel invocations for high throughput.

  6. RDS Performance: Use read replicas to offload read traffic (up to 5 replicas). Use RDS Proxy for connection pooling (critical for Lambda). Use gp3 storage for better price/performance. Use Performance Insights to identify slow queries.

  7. Aurora Advantages: Up to 15 read replicas (vs 5 for RDS). <10ms replication lag (vs 100ms+ for RDS). 30-second failover (vs 60-120 seconds for RDS). Continuous backup with no performance impact.

  8. DynamoDB Optimization: Design partition keys for even distribution (high cardinality). Use DAX for read-heavy workloads (microsecond latency). Use On-Demand mode for unpredictable workloads. Avoid Scan operations in production.

  9. CloudFront Performance: Caches content at 400+ edge locations. Aim for >80% cache hit ratio. Use Cache-Control headers to control TTL. Reduces latency by 2-10x for global users.

  10. Global Accelerator: Routes traffic over AWS network (10-60% faster than internet). Use for dynamic content and non-HTTP protocols. Provides static IPs and automatic failover.

Self-Assessment Checklist

Test yourself before moving on:

  • I understand S3 performance limits (5,500 GET/sec per prefix)
  • I know when to use multipart upload and Transfer Acceleration
  • I can explain the difference between gp3 and io2 EBS volumes
  • I understand when to use EFS vs EBS
  • I know the different EC2 instance families and their use cases
  • I can right-size EC2 instances based on utilization
  • I understand how Lambda memory affects CPU and performance
  • I know when to use Provisioned Concurrency
  • I understand how RDS read replicas improve performance
  • I know when to use RDS Proxy
  • I can explain Aurora's performance advantages over RDS
  • I understand DynamoDB partition key design principles
  • I know when to use DAX for DynamoDB
  • I understand how CloudFront reduces latency
  • I can explain when to use Global Accelerator vs CloudFront

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle 1: Questions 1-25 (Storage and compute)
  • Domain 3 Bundle 2: Questions 26-50 (Database and networking)
  • Full Practice Test 1: Questions 38-53 (Domain 3 questions)

Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • Review sections: Focus on areas where you missed questions
  • Key topics to strengthen:
    • S3 performance optimization techniques
    • EBS volume type selection criteria
    • EC2 instance family characteristics
    • Lambda memory and concurrency
    • RDS read replica use cases
    • DynamoDB partition key design
    • CloudFront caching strategies

Quick Reference Card

Storage Services:

  • S3: Object storage, 5,500 GET/sec per prefix, unlimited scale
  • EBS gp3: General purpose SSD, 3,000-16,000 IOPS, $0.08/GB-month
  • EBS io2: High-performance SSD, up to 64,000 IOPS, $0.125/GB-month
  • EFS: Shared file system, 50 MB/s per TB, $0.30/GB-month
  • FSx Lustre: HPC file system, 200 MB/s per TB, $0.145/GB-month

Compute Services:

  • T3: Burstable CPU, cost-effective for variable workloads
  • M5: General purpose, balanced CPU/memory (1:4 ratio)
  • C5: Compute optimized, high CPU-to-memory ratio (1:2 ratio)
  • R5: Memory optimized, high memory-to-CPU ratio (1:8 ratio)
  • Lambda: Serverless, 1,769 MB = 1 vCPU, 15-minute timeout

Database Services:

  • RDS: Managed relational database, up to 5 read replicas
  • Aurora: Cloud-native database, up to 15 read replicas, <10ms lag
  • DynamoDB: NoSQL, single-digit millisecond latency, unlimited scale
  • DAX: DynamoDB cache, microsecond latency, 95% cost reduction
  • RDS Proxy: Connection pooling, improves Lambda scalability

Network Services:

  • CloudFront: CDN, 400+ edge locations, 10-50ms latency
  • Global Accelerator: AWS network routing, 10-60% faster than internet
  • VPC Endpoints: Private connectivity to AWS services, no internet gateway

Decision Points:

  • High request rate → Use multiple S3 prefixes
  • Large file upload → Use S3 multipart upload
  • Shared file access → Use EFS (not EBS)
  • High IOPS database → Use io2 EBS or Aurora
  • Variable CPU workload → Use T3 burstable instances
  • Read-heavy database → Use RDS read replicas or Aurora
  • DynamoDB read-heavy → Use DAX caching
  • Many database connections → Use RDS Proxy
  • Global users → Use CloudFront CDN
  • Non-HTTP traffic → Use Global Accelerator

Next Chapter: 05_domain4_cost_optimized_architectures - Design Cost-Optimized Architectures (20% of exam)


Chapter Summary

What We Covered

This chapter covered Domain 3: Design High-Performing Architectures (24% of the exam). We explored five major task areas:

  • āœ… Task 3.1 - High-Performing Storage Solutions: S3 performance optimization, EBS volume types, EFS throughput modes, FSx file systems, hybrid storage with Storage Gateway
  • āœ… Task 3.2 - High-Performing Compute Solutions: EC2 instance types and families, placement groups, Auto Scaling strategies, Lambda optimization, ECS/EKS capacity providers
  • āœ… Task 3.3 - High-Performing Database Solutions: RDS instance sizing, Aurora performance features, DynamoDB capacity modes, ElastiCache strategies, database connection pooling
  • āœ… Task 3.4 - High-Performing Network Architectures: CloudFront edge caching, Global Accelerator, VPC design for performance, Direct Connect, load balancer optimization
  • āœ… Task 3.5 - Data Ingestion and Transformation: Kinesis streaming, Glue ETL, Athena query optimization, EMR big data processing, data lake architectures

Critical Takeaways

  1. Match Storage to Workload: Use S3 for object storage with 11 9's durability, EBS for block storage with low latency, EFS for shared file systems, and FSx for specialized workloads (Windows, Lustre, NetApp).

  2. Choose the Right Compute: EC2 for full control, Lambda for event-driven serverless, Fargate for serverless containers, and ECS/EKS for container orchestration. Match instance types to workload characteristics.

  3. Database Performance is Multi-Faceted: Consider read/write patterns, use read replicas for read-heavy workloads, implement caching with ElastiCache, and choose between relational (RDS/Aurora) and NoSQL (DynamoDB) based on data structure.

  4. Edge Services Reduce Latency: Use CloudFront for content delivery, Global Accelerator for static IP and TCP/UDP optimization, and Route 53 latency-based routing for global applications.

  5. Caching is Critical: Implement caching at multiple layers - CloudFront for static content, ElastiCache for database queries, DAX for DynamoDB, API Gateway for API responses.

  6. Streaming vs. Batch Processing: Use Kinesis for real-time streaming data, Glue for batch ETL, and EMR for large-scale data processing. Choose based on latency requirements.

  7. Optimize Data Transfer: Use S3 Transfer Acceleration for long-distance uploads, multipart upload for large files, and VPC endpoints to avoid internet traffic.

Self-Assessment Checklist

Test yourself before moving to Domain 4. You should be able to:

High-Performing Storage:

  • Choose appropriate S3 storage class based on access patterns
  • Optimize S3 performance using prefixes and multipart upload
  • Select EBS volume type (gp3, io2, st1, sc1) based on IOPS/throughput needs
  • Configure EFS performance mode (General Purpose vs. Max I/O)
  • Choose FSx file system (Windows, Lustre, NetApp, OpenZFS) for specific workloads
  • Implement S3 Transfer Acceleration for global uploads
  • Use Storage Gateway for hybrid cloud storage

High-Performing Compute:

  • Select EC2 instance family (C, M, R, T, I, G, P) based on workload
  • Configure EC2 placement groups (Cluster, Spread, Partition)
  • Optimize Lambda function memory and timeout settings
  • Implement Lambda provisioned concurrency for consistent performance
  • Choose between ECS EC2 and ECS Fargate based on requirements
  • Configure Auto Scaling policies for optimal performance and cost
  • Use Compute Optimizer for right-sizing recommendations

High-Performing Databases:

  • Choose between RDS and Aurora based on performance needs
  • Configure RDS read replicas for read-heavy workloads
  • Select DynamoDB capacity mode (On-Demand vs. Provisioned)
  • Design DynamoDB partition keys for even distribution
  • Implement ElastiCache (Redis or Memcached) for caching
  • Use DynamoDB DAX for microsecond latency
  • Configure RDS Proxy for connection pooling

High-Performing Networks:

  • Configure CloudFront distributions with optimal caching policies
  • Use Global Accelerator for static IP and improved performance
  • Design VPC with appropriate subnet sizing and routing
  • Choose between ALB and NLB based on performance requirements
  • Implement Direct Connect for consistent network performance
  • Use VPC endpoints to reduce latency and data transfer costs
  • Configure Route 53 latency-based routing for global applications

Data Ingestion and Transformation:

  • Design Kinesis Data Streams for real-time data ingestion
  • Use Kinesis Data Firehose for data delivery to S3/Redshift
  • Configure Glue ETL jobs for data transformation
  • Optimize Athena queries with partitioning and columnar formats
  • Choose between EMR and Glue for big data processing
  • Implement data lake architecture with Lake Formation
  • Use QuickSight for data visualization

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle 1: Questions 1-50 (storage and compute performance)
  • Domain 3 Bundle 2: Questions 1-50 (database and network performance)
  • Storage Services Bundle: Questions 1-50 (S3, EBS, EFS, FSx)
  • Database Services Bundle: Questions 1-50 (RDS, Aurora, DynamoDB, ElastiCache)
  • Compute Services Bundle: Questions 1-50 (EC2, Lambda, ECS, EKS)

Expected Score: 75%+ to proceed

If you scored below 75%:

  • Storage weak: Review S3 performance optimization, EBS volume types, EFS modes
  • Compute weak: Review EC2 instance types, Lambda optimization, Auto Scaling
  • Database weak: Review RDS vs. Aurora, DynamoDB design, caching strategies
  • Network weak: Review CloudFront, Global Accelerator, load balancer types
  • Revisit diagrams: S3 performance, EC2 instance selection, database architecture, CloudFront caching

Common Exam Traps

Watch out for these in Domain 3 questions:

  1. EBS Volume Types: gp3 is newer and more cost-effective than gp2; io2 Block Express for highest IOPS (256,000)
  2. S3 Performance: 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix
  3. Lambda Memory: More memory = more CPU; optimize for both performance and cost
  4. DynamoDB Partition Keys: Poor key design leads to hot partitions and throttling
  5. ElastiCache Redis vs. Memcached: Redis for persistence, replication, advanced data structures; Memcached for simple caching
  6. CloudFront vs. Global Accelerator: CloudFront for HTTP/HTTPS content; Global Accelerator for TCP/UDP with static IP
  7. RDS Read Replicas: Asynchronous replication, can have lag; not for high availability (use Multi-AZ)

Quick Reference Card

Storage Performance:

  • S3: 3,500 PUT/5,500 GET per second per prefix, use multipart for >100 MB
  • EBS gp3: 3,000-16,000 IOPS, 125-1,000 MB/s throughput (independent)
  • EBS io2: Up to 64,000 IOPS (256,000 with Block Express), 99.999% durability
  • EFS: General Purpose (default) or Max I/O (higher aggregate throughput)
  • FSx Lustre: HPC workloads, 100s GB/s throughput, sub-millisecond latency

Compute Instance Families:

  • C: Compute-optimized (batch processing, HPC, gaming)
  • M: General purpose (balanced compute, memory, network)
  • R: Memory-optimized (in-memory databases, big data)
  • T: Burstable (variable workloads, development)
  • I: Storage-optimized (NoSQL databases, data warehousing)
  • G: GPU (machine learning, graphics rendering)
  • P: GPU compute (deep learning training)

Database Performance:

  • RDS: Up to 64 TB storage, 80,000 IOPS with io2
  • Aurora: 5x MySQL, 3x PostgreSQL performance, 128 TB storage
  • DynamoDB: Single-digit millisecond latency, unlimited throughput (On-Demand)
  • DAX: Microsecond latency for DynamoDB, 10x performance improvement
  • ElastiCache Redis: Sub-millisecond latency, up to 500 nodes per cluster
  • ElastiCache Memcached: Sub-millisecond latency, up to 20 nodes per cluster

Network Performance:

  • CloudFront: 200+ edge locations, cache TTL 0-365 days
  • Global Accelerator: 2 static anycast IPs, 60% performance improvement
  • Direct Connect: 1 Gbps or 10 Gbps dedicated connection, consistent latency
  • ALB: 100,000s requests/sec, WebSocket support
  • NLB: Millions of requests/sec, <100 microsecond latency

Data Ingestion:

  • Kinesis Data Streams: Real-time, 1 MB/sec per shard, 1,000 records/sec per shard
  • Kinesis Data Firehose: Near real-time (60 sec buffer), automatic scaling
  • Glue: Serverless ETL, Apache Spark-based, pay per DPU-hour
  • EMR: Managed Hadoop/Spark, up to 1,000s of nodes
  • Athena: Serverless SQL, pay per TB scanned, query S3 directly

Decision Frameworks

When to use which storage:

  • S3: Object storage, static content, backups, data lakes
  • EBS: Block storage for EC2, databases, boot volumes
  • EFS: Shared file system, Linux workloads, content management
  • FSx Windows: Windows file shares, Active Directory integration
  • FSx Lustre: HPC, machine learning, high-throughput workloads

When to use which database:

  • RDS: Relational data, ACID transactions, existing SQL applications
  • Aurora: RDS with better performance, global databases, serverless option
  • DynamoDB: NoSQL, key-value, millisecond latency, unlimited scale
  • ElastiCache: In-memory caching, session storage, leaderboards
  • Redshift: Data warehousing, OLAP, petabyte-scale analytics

When to use which compute:

  • EC2: Full control, custom configurations, long-running workloads
  • Lambda: Event-driven, <15 min execution, serverless
  • Fargate: Serverless containers, no infrastructure management
  • ECS: Container orchestration, AWS-native, simpler than Kubernetes
  • EKS: Kubernetes, multi-cloud, complex orchestration needs

Integration with Other Domains

Performance concepts from Domain 3 integrate with:

  • Domain 1 (Secure Architectures): Encryption overhead, VPC endpoints for security and performance
  • Domain 2 (Resilient Architectures): Read replicas for both performance and availability
  • Domain 4 (Cost-Optimized Architectures): Right-sizing for cost-performance balance

Key Performance Metrics

Latency Targets:

  • S3: 100-200 ms first byte
  • EBS: Single-digit milliseconds
  • DynamoDB: Single-digit milliseconds (DAX: microseconds)
  • ElastiCache: Sub-millisecond
  • CloudFront: <50 ms (edge locations)

Throughput Targets:

  • S3: 3,500 PUT/5,500 GET per second per prefix
  • EBS gp3: Up to 1,000 MB/s
  • EBS io2: Up to 4,000 MB/s
  • EFS: 10+ GB/s aggregate throughput
  • FSx Lustre: 100s GB/s

Scaling Limits:

  • Lambda: 1,000 concurrent executions (default)
  • DynamoDB: 40,000 RCU/WCU per table (On-Demand: unlimited)
  • RDS: Up to 64 TB storage, 80,000 IOPS
  • Aurora: Up to 128 TB storage, 15 read replicas

Next Steps

You're now ready for Domain 4: Design Cost-Optimized Architectures (Chapter 5). This domain covers:

  • Cost-optimized storage solutions (20% of exam weight)
  • Cost-optimized compute solutions
  • Cost-optimized database solutions
  • Cost-optimized network architectures

Performance principles from this chapter will be balanced with cost considerations in Domain 4.


Chapter 3 Complete āœ… | Next: Chapter 4 - Domain 4: Cost-Optimized Architectures


Chapter Summary

What We Covered

  • āœ… High-Performing Storage Solutions
    • S3 performance optimization (prefixes, multipart upload, Transfer Acceleration)
    • EBS volume types (gp3, io2, st1, sc1)
    • EFS performance modes (General Purpose, Max I/O)
    • FSx file systems (Windows, Lustre, NetApp ONTAP, OpenZFS)
  • āœ… Elastic Compute Solutions
    • EC2 instance types and families
    • Auto Scaling strategies
    • Lambda optimization (memory, concurrency, provisioned concurrency)
    • Container orchestration (ECS, EKS, Fargate)
  • āœ… High-Performing Database Solutions
    • RDS instance types and storage
    • Aurora performance features (Parallel Query, Global Database)
    • DynamoDB capacity modes and DAX
    • ElastiCache (Redis vs Memcached)
  • āœ… Network Optimization
    • CloudFront edge caching
    • Global Accelerator
    • VPC endpoints (Gateway vs Interface)
    • Direct Connect and LAG
  • āœ… Data Ingestion and Analytics
    • Kinesis (Data Streams, Firehose, Analytics)
    • Glue ETL and Data Catalog
    • Athena query optimization
    • EMR for big data processing

Critical Takeaways

  1. Storage Performance: Use S3 prefixes for parallelization (3500 PUT/5500 GET per prefix), gp3 for cost-effective IOPS, io2 for mission-critical workloads, EFS for shared file access
  2. Compute Optimization: Choose instance types based on workload (compute-optimized for CPU, memory-optimized for RAM, storage-optimized for I/O), use Auto Scaling for elasticity, Lambda for event-driven
  3. Database Performance: Aurora for high-performance relational (15 read replicas), DynamoDB for single-digit millisecond NoSQL, DAX for microsecond caching, ElastiCache for sub-millisecond
  4. Network Acceleration: CloudFront for global content delivery (50+ edge locations), Global Accelerator for static IP and health-based routing, VPC endpoints to avoid internet gateway
  5. Data Analytics: Kinesis for real-time streaming, Glue for ETL, Athena for serverless SQL on S3, EMR for big data frameworks (Spark, Hadoop)

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain S3 performance optimization techniques (prefixes, multipart, Transfer Acceleration)
  • I understand EBS volume types and when to use each (gp3, io2, st1, sc1)
  • I know the difference between EFS performance modes
  • I can select appropriate EC2 instance types for different workloads
  • I understand Lambda memory and concurrency optimization
  • I know when to use RDS vs Aurora vs DynamoDB
  • I can explain DynamoDB capacity modes (On-Demand vs Provisioned)
  • I understand CloudFront caching strategies
  • I know when to use Global Accelerator vs CloudFront
  • I can design a high-performing data ingestion pipeline with Kinesis

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle 1: Questions 1-25 (Storage and compute)
  • Domain 3 Bundle 2: Questions 1-25 (Database and network)
  • Storage Services Bundle: Questions 1-25
  • Database Services Bundle: Questions 1-25
  • Expected score: 75%+ to proceed

If you scored below 75%:

  • Review sections: S3 performance, EBS volume types, Database selection, CloudFront vs Global Accelerator
  • Focus on: Understanding performance characteristics and when to use each service

Quick Reference Card

Storage Performance:

  • S3: 3,500 PUT/5,500 GET per prefix, multipart for >100MB, Transfer Acceleration for global uploads
  • EBS gp3: 3,000 IOPS baseline, up to 16,000 IOPS, 125-1,000 MB/s
  • EBS io2: Up to 64,000 IOPS, 1,000 MB/s, 99.999% durability
  • EFS: 10+ GB/s aggregate, General Purpose (low latency) or Max I/O (high throughput)
  • FSx Lustre: 100s GB/s, sub-millisecond latency, HPC workloads

Compute Instance Families:

  • General Purpose (T, M): Balanced CPU/memory, web servers, dev/test
  • Compute Optimized (C): High CPU, batch processing, gaming, HPC
  • Memory Optimized (R, X): High RAM, databases, in-memory caches
  • Storage Optimized (I, D, H): High I/O, data warehouses, NoSQL
  • Accelerated Computing (P, G, F): GPU/FPGA, ML, graphics

Database Performance:

  • RDS: Up to 64 TB, 80,000 IOPS, 5 read replicas
  • Aurora: Up to 128 TB, 15 read replicas, Parallel Query, Global Database
  • DynamoDB: Single-digit milliseconds, unlimited throughput (On-Demand)
  • DAX: Microsecond latency, in-memory cache for DynamoDB
  • ElastiCache Redis: Sub-millisecond, persistence, replication
  • ElastiCache Memcached: Sub-millisecond, multi-threaded, no persistence

Network Performance:

  • CloudFront: 50+ edge locations, <50ms latency, caching, DDoS protection
  • Global Accelerator: Static anycast IPs, health-based routing, 60% performance improvement
  • VPC Endpoint Gateway: S3 and DynamoDB, no data transfer charges
  • VPC Endpoint Interface: Other AWS services, PrivateLink, $0.01/hour + data
  • Direct Connect: 1-100 Gbps, consistent latency, private connectivity

Data Ingestion:

  • Kinesis Data Streams: Real-time, 1MB/sec per shard, 24h-365d retention
  • Kinesis Firehose: Near real-time, auto-scaling, direct to S3/Redshift/ES
  • Kinesis Analytics: SQL on streaming data, real-time dashboards
  • Glue: Serverless ETL, Data Catalog, crawlers
  • Athena: Serverless SQL on S3, pay per query, partition for performance

Decision Points:

  • Need high IOPS? → EBS io2 (64,000 IOPS) or FSx Lustre (millions IOPS)
  • Need shared file storage? → EFS (Linux) or FSx (Windows/Lustre/NetApp)
  • Need fast database? → Aurora (relational) or DynamoDB (NoSQL) or ElastiCache (cache)
  • Need global content delivery? → CloudFront (caching) or Global Accelerator (TCP/UDP)
  • Need real-time analytics? → Kinesis Data Streams + Analytics
  • Need batch analytics? → Glue ETL + Athena


Chapter Summary

What We Covered

This chapter covered Domain 3: Design High-Performing Architectures (24% of the exam). We explored five major task areas:

āœ… Task 3.1: Determine High-Performing and/or Scalable Storage Solutions

  • S3 storage classes and performance optimization
  • EBS volume types and IOPS provisioning
  • EFS for shared file storage with performance modes
  • FSx for specialized file systems (Windows, Lustre, NetApp, OpenZFS)
  • Storage Gateway for hybrid cloud storage

āœ… Task 3.2: Design High-Performing and Elastic Compute Solutions

  • EC2 instance types and families (compute, memory, storage optimized)
  • Lambda optimization: memory, concurrency, layers, provisioned concurrency
  • Container orchestration with ECS and EKS
  • Auto Scaling strategies for elastic compute
  • Batch processing with AWS Batch and EMR

āœ… Task 3.3: Determine High-Performing Database Solutions

  • RDS vs Aurora performance characteristics
  • DynamoDB performance: on-demand vs provisioned, DAX caching
  • ElastiCache for sub-millisecond latency (Redis vs Memcached)
  • Database read replicas and replication strategies
  • Database connection pooling with RDS Proxy

āœ… Task 3.4: Determine High-Performing and/or Scalable Network Architectures

  • CloudFront for global content delivery and edge caching
  • Global Accelerator for TCP/UDP performance improvement
  • VPC networking: subnets, route tables, endpoints
  • Direct Connect for dedicated high-bandwidth connectivity
  • Load balancing strategies for optimal traffic distribution

āœ… Task 3.5: Determine High-Performing Data Ingestion and Transformation Solutions

  • Kinesis Data Streams for real-time data ingestion
  • Kinesis Firehose for near real-time delivery to data stores
  • Glue for serverless ETL and data cataloging
  • Athena for serverless SQL queries on S3
  • EMR for big data processing with Hadoop/Spark

Critical Takeaways

  1. Choose the right storage for the workload: S3 for objects, EBS for block, EFS for shared files. Match storage class to access patterns (Frequent → IA → Glacier).

  2. IOPS matter for databases: Use io2 Block Express for highest IOPS (256,000). Use gp3 for cost-effective performance. Provision IOPS for consistent performance.

  3. Right-size compute instances: Use Compute Optimizer recommendations. Match instance family to workload (c5 for compute, r5 for memory, i3 for storage).

  4. Lambda optimization is critical: More memory = more CPU. Use provisioned concurrency for consistent latency. Use layers for shared code. Optimize cold starts.

  5. Caching reduces latency and cost: Use CloudFront for static content, ElastiCache for database queries, DAX for DynamoDB, API Gateway caching for APIs.

  6. Database choice affects performance: Aurora for high-performance relational, DynamoDB for single-digit millisecond NoSQL, ElastiCache for sub-millisecond caching.

  7. Read replicas for read-heavy workloads: RDS supports up to 5 read replicas, Aurora supports up to 15. Use for reporting and analytics without impacting primary.

  8. Global performance requires edge services: CloudFront for content delivery, Global Accelerator for TCP/UDP, Route 53 latency-based routing for optimal endpoint selection.

  9. Real-time vs batch processing: Kinesis Data Streams for real-time (sub-second), Kinesis Firehose for near real-time (60 seconds), Glue/EMR for batch (minutes to hours).

  10. Partition data for performance: S3 prefixes for parallel requests, DynamoDB partition keys for even distribution, Athena partitions for faster queries.

Key Services Quick Reference

Storage Services:

  • S3: Object storage, 11 9's durability, 5,500 GET/3,500 PUT per prefix per second
  • S3 Intelligent-Tiering: Automatic cost optimization based on access patterns
  • EBS gp3: General purpose SSD, 3,000-16,000 IOPS, 125-1,000 MB/s
  • EBS io2: Provisioned IOPS SSD, up to 64,000 IOPS, 99.999% durability
  • EBS io2 Block Express: Up to 256,000 IOPS, 4,000 MB/s, sub-millisecond latency
  • EFS: Shared file storage, automatic scaling, bursting and provisioned throughput
  • FSx Lustre: HPC file system, up to 1 TB/s throughput, millions IOPS
  • FSx Windows: Windows file server, SMB protocol, Active Directory integration

Compute Services:

  • EC2: Virtual machines, 400+ instance types, multiple families
  • Lambda: Serverless functions, 128 MB - 10 GB memory, 15 min timeout
  • Fargate: Serverless containers, no server management, automatic scaling
  • ECS: Container orchestration, EC2 or Fargate launch types
  • EKS: Managed Kubernetes, complex container workloads
  • Batch: Managed batch processing, automatic scaling, job scheduling
  • EMR: Big data processing, Hadoop, Spark, Presto, Hive

Database Services:

  • RDS: Managed relational database, up to 5 read replicas, Multi-AZ
  • Aurora: High-performance MySQL/PostgreSQL, 5x faster, 15 read replicas
  • Aurora Serverless: Auto-scaling database, pay per second, pause when idle
  • DynamoDB: NoSQL, single-digit milliseconds, unlimited throughput (on-demand)
  • DAX: DynamoDB Accelerator, microsecond latency, in-memory cache
  • ElastiCache Redis: Sub-millisecond, persistence, replication, Lua scripts
  • ElastiCache Memcached: Sub-millisecond, multi-threaded, no persistence
  • RDS Proxy: Connection pooling, reduce database load, improve failover

Networking Services:

  • CloudFront: CDN, 50+ edge locations, <50ms latency, caching, DDoS protection
  • Global Accelerator: Static anycast IPs, health-based routing, 60% performance improvement
  • Direct Connect: 1-100 Gbps, consistent latency, private connectivity
  • VPC Endpoint Gateway: S3 and DynamoDB, no data transfer charges
  • VPC Endpoint Interface: Other AWS services, PrivateLink, $0.01/hour + data
  • Transit Gateway: Hub-and-spoke, up to 50 Gbps per VPN connection

Data Processing Services:

  • Kinesis Data Streams: Real-time, 1 MB/sec per shard, 24h-365d retention
  • Kinesis Firehose: Near real-time (60 sec), auto-scaling, direct to S3/Redshift
  • Kinesis Analytics: SQL on streaming data, real-time dashboards
  • Glue: Serverless ETL, Data Catalog, crawlers, job bookmarks
  • Athena: Serverless SQL on S3, pay per query ($5 per TB scanned)
  • Redshift: Data warehouse, columnar storage, massively parallel processing
  • QuickSight: BI and visualization, SPICE in-memory engine

Decision Frameworks

Choosing Storage Service:

What type of data?
ā”œā”€ Objects (files, images, videos)?
│  ā”œā”€ Frequent access? → S3 Standard
│  ā”œā”€ Infrequent access? → S3 IA or Intelligent-Tiering
│  └─ Archive? → Glacier or Glacier Deep Archive
ā”œā”€ Block storage (databases, boot volumes)?
│  ā”œā”€ Need highest IOPS? → io2 Block Express (256,000 IOPS)
│  ā”œā”€ Consistent performance? → io2 (64,000 IOPS)
│  └─ General purpose? → gp3 (16,000 IOPS, cost-effective)
ā”œā”€ Shared file storage?
│  ā”œā”€ Linux/NFS? → EFS
│  ā”œā”€ Windows/SMB? → FSx Windows
│  ā”œā”€ HPC/ML? → FSx Lustre
│  └─ NetApp ONTAP? → FSx NetApp
└─ Hybrid cloud? → Storage Gateway

Choosing Compute Service:

What's the workload?
ā”œā”€ Short-lived functions (<15 min)? → Lambda
ā”œā”€ Containers?
│  ā”œā”€ Need Kubernetes? → EKS
│  ā”œā”€ Simple containers? → ECS on Fargate
│  └─ Need EC2 control? → ECS on EC2
ā”œā”€ Batch processing?
│  ā”œā”€ Big data (Hadoop/Spark)? → EMR
│  └─ General batch jobs? → AWS Batch
ā”œā”€ Long-running applications?
│  ā”œā”€ Need full control? → EC2
│  └─ Want managed platform? → Elastic Beanstalk
└─ High-performance computing? → EC2 with placement groups

Choosing Database Service:

Requirement Solution Performance Use Case
Relational, high performance Aurora 5x MySQL, 3x PostgreSQL OLTP, high concurrency
Relational, standard RDS Standard MySQL/PostgreSQL General purpose
NoSQL, key-value DynamoDB Single-digit ms High scale, flexible schema
NoSQL, document DocumentDB MongoDB compatible Document storage
In-memory cache ElastiCache Sub-millisecond Session store, caching
Graph database Neptune Graph queries Social networks, fraud detection
Time series Timestream Optimized for time series IoT, metrics, logs
Data warehouse Redshift Columnar, MPP Analytics, BI

Choosing Caching Strategy:

Layer Service TTL Use Case
Edge CloudFront Hours-days Static content, videos, images
API API Gateway Seconds-hours API responses, reduce backend load
Application ElastiCache Minutes-hours Session data, database queries
Database DAX Milliseconds DynamoDB queries, hot keys
Query Athena N/A Query results (automatic)

Choosing Data Ingestion Service:

Requirement Service Latency Throughput Use Case
Real-time streaming Kinesis Data Streams <1 second 1 MB/s per shard Real-time analytics, log processing
Near real-time delivery Kinesis Firehose 60 seconds Auto-scaling ETL to S3/Redshift/ES
Batch transfer DataSync Minutes 10 Gbps On-premises to AWS migration
Large datasets Snow Family Days Petabytes Offline data transfer
Database migration DMS Continuous Varies Homogeneous/heterogeneous migration

Common Exam Patterns

Pattern 1: "Highest Performance" Questions

  • Look for: io2 Block Express, Aurora, DynamoDB, ElastiCache, CloudFront
  • Eliminate: Standard storage, single instance, no caching
  • Choose: Highest IOPS, lowest latency, distributed architecture

Pattern 2: "Optimize Database Performance" Questions

  • Look for: Read replicas, caching (ElastiCache/DAX), RDS Proxy, Aurora
  • Eliminate: Single database instance, no caching, synchronous reads
  • Choose: Read replicas for reads, caching for hot data, connection pooling

Pattern 3: "Global Performance" Questions

  • Look for: CloudFront, Global Accelerator, Route 53 latency routing, multi-region
  • Eliminate: Single region, no edge caching, no geographic routing
  • Choose: Edge services for content delivery, multi-region for data locality

Pattern 4: "Real-Time Processing" Questions

  • Look for: Kinesis Data Streams, Lambda, DynamoDB Streams, ElastiCache
  • Eliminate: Batch processing, high latency, polling
  • Choose: Streaming data services with sub-second latency

Pattern 5: "Cost-Effective Performance" Questions

  • Look for: S3 Intelligent-Tiering, gp3 volumes, Aurora Serverless, DynamoDB on-demand
  • Eliminate: Over-provisioned resources, always-on when not needed
  • Choose: Auto-scaling, pay-per-use, right-sized resources

Self-Assessment Checklist

Test yourself before moving to the next chapter:

Storage Performance:

  • I can choose the right S3 storage class based on access patterns
  • I understand EBS volume types and when to use each (gp3, io2, st1, sc1)
  • I know when to use EFS vs FSx for shared file storage
  • I can optimize S3 performance with prefixes and multipart upload
  • I understand storage performance metrics (IOPS, throughput, latency)

Compute Performance:

  • I can select the right EC2 instance type for different workloads
  • I understand Lambda optimization techniques (memory, concurrency, layers)
  • I know when to use ECS vs EKS vs Lambda vs EC2
  • I can design auto-scaling strategies for elastic compute
  • I understand placement groups for HPC workloads

Database Performance:

  • I can choose between RDS, Aurora, DynamoDB, and ElastiCache
  • I understand read replicas and when to use them
  • I know how to use DAX for DynamoDB caching
  • I can implement RDS Proxy for connection pooling
  • I understand database performance tuning (indexes, partitioning)

Network Performance:

  • I know when to use CloudFront vs Global Accelerator
  • I understand VPC endpoint types and performance implications
  • I can design Direct Connect for high-bandwidth connectivity
  • I know how to optimize inter-AZ and inter-region data transfer
  • I understand load balancer performance characteristics

Data Processing:

  • I can choose between Kinesis Data Streams and Firehose
  • I understand Glue for ETL and data cataloging
  • I know when to use Athena vs Redshift for analytics
  • I can design real-time data processing pipelines
  • I understand EMR for big data processing

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle 1: Questions 1-20 (Storage and compute performance)
  • Domain 3 Bundle 2: Questions 21-40 (Database and network performance)
  • Domain 3 Bundle 3: Questions 41-60 (Data ingestion and transformation)
  • Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • 60-74%: Review specific sections where you missed questions
  • Below 60%: Re-read the entire chapter and take detailed notes
  • Focus on:
    • EBS volume types and IOPS provisioning
    • Lambda optimization techniques
    • Database caching strategies (ElastiCache, DAX)
    • CloudFront vs Global Accelerator use cases
    • Kinesis Data Streams vs Firehose differences

Quick Reference Card

Copy this to your notes for quick review:

EBS Volume Types:

Type IOPS Throughput Use Case
gp3 3,000-16,000 125-1,000 MB/s General purpose, cost-effective
io2 100-64,000 1,000 MB/s High-performance databases
io2 Block Express 256,000 4,000 MB/s Highest performance
st1 500 500 MB/s Big data, data warehouses
sc1 250 250 MB/s Cold data, infrequent access

EC2 Instance Families:

  • C5: Compute optimized (CPU-intensive)
  • R5: Memory optimized (in-memory databases)
  • I3: Storage optimized (NoSQL databases, data warehouses)
  • M5: General purpose (balanced)
  • T3: Burstable (variable workloads)
  • P3: GPU (machine learning, HPC)

Lambda Optimization:

  • Memory: 128 MB - 10 GB (more memory = more CPU)
  • Timeout: Maximum 15 minutes
  • Concurrency: 1,000 per region (soft limit)
  • Provisioned concurrency: Pre-warmed instances for consistent latency
  • Layers: Share code across functions (up to 5 layers)
  • Cold start: ~100-500ms (reduce with provisioned concurrency)

Database Performance:

  • Aurora: 5x MySQL, 3x PostgreSQL, 15 read replicas, 30 sec failover
  • DynamoDB: Single-digit ms, unlimited throughput (on-demand)
  • DAX: Microsecond latency, 10x performance improvement
  • ElastiCache Redis: Sub-millisecond, persistence, replication
  • RDS Proxy: Connection pooling, 66% faster failover

Caching TTL Guidelines:

  • Static content (images, CSS, JS): 24 hours - 1 year
  • Semi-static (product pages): 1 hour - 24 hours
  • Dynamic (user-specific): 1 minute - 1 hour
  • Real-time (stock prices): No caching or <1 minute

Must Memorize:

  • S3 performance: 5,500 GET, 3,500 PUT per prefix per second
  • EBS gp3: 3,000 IOPS baseline, up to 16,000 IOPS
  • EBS io2 Block Express: 256,000 IOPS, 4,000 MB/s
  • Lambda timeout: Maximum 15 minutes
  • Lambda memory: 128 MB - 10 GB
  • Aurora read replicas: Up to 15
  • RDS read replicas: Up to 5
  • DynamoDB: Single-digit millisecond latency
  • DAX: Microsecond latency
  • CloudFront edge locations: 50+ globally
  • Kinesis Data Streams: 1 MB/s per shard

Congratulations! You've completed Domain 3 (24% of exam). Combined with Domains 1 and 2, you've now covered 80% of the exam content.

Next Chapter: 05_domain4_cost_optimized_architectures - Design Cost-Optimized Architectures (20% of exam)


Chapter Summary

What We Covered

This chapter covered Domain 3: Design High-Performing Architectures (24% of exam). You learned:

  • āœ… Storage Performance: S3, EBS, EFS, FSx performance characteristics and optimization
  • āœ… Compute Optimization: EC2 instance types, Lambda configuration, Auto Scaling, and container performance
  • āœ… Database Performance: RDS, Aurora, DynamoDB, ElastiCache, and database optimization strategies
  • āœ… Network Performance: CloudFront, Global Accelerator, VPC optimization, and Direct Connect
  • āœ… Caching Strategies: Application caching, content delivery, and database caching
  • āœ… Data Ingestion: Kinesis, Glue, Athena, EMR, and data pipeline optimization
  • āœ… Performance Monitoring: CloudWatch metrics, Performance Insights, and X-Ray tracing
  • āœ… Optimization Techniques: Right-sizing, placement groups, enhanced networking, and burst performance

Critical Takeaways

  1. Storage Selection: S3 for objects, EBS for block storage, EFS for shared file systems, FSx for specialized workloads
  2. EBS Performance: gp3 for general purpose (3,000 IOPS baseline), io2 Block Express for extreme performance (256,000 IOPS)
  3. S3 Performance: 5,500 GET and 3,500 PUT per prefix per second, use Transfer Acceleration for global uploads
  4. Compute Selection: EC2 for control, Lambda for serverless, Fargate for containers without servers
  5. Lambda Optimization: More memory = more CPU, use provisioned concurrency for consistent latency, layers for shared code
  6. Database Selection: Aurora for high performance relational, DynamoDB for single-digit ms NoSQL, ElastiCache for sub-ms caching
  7. DynamoDB Performance: On-demand for unpredictable, provisioned for predictable, DAX for microsecond latency
  8. Caching Strategy: CloudFront for content delivery, ElastiCache for application data, DAX for DynamoDB
  9. Network Optimization: CloudFront for global content, Global Accelerator for static IPs, Direct Connect for dedicated bandwidth
  10. Data Ingestion: Kinesis for real-time streaming, Glue for ETL, Athena for serverless queries, EMR for big data processing

Self-Assessment Checklist

Test yourself before moving on. Can you:

Storage Performance:

  • Choose the right storage service (S3, EBS, EFS, FSx) for different workloads?
  • Select the appropriate EBS volume type (gp3, io2, st1, sc1)?
  • Optimize S3 performance using prefixes and Transfer Acceleration?
  • Configure EFS performance modes (General Purpose, Max I/O)?
  • Use FSx for specialized workloads (Windows, Lustre, NetApp ONTAP)?

Compute Performance:

  • Select the right EC2 instance family (C5, R5, M5, T3, I3, P3)?
  • Configure Lambda memory and timeout for optimal performance?
  • Use EC2 placement groups for low latency (cluster, partition, spread)?
  • Implement Auto Scaling for elastic compute capacity?
  • Choose between EC2, Lambda, and Fargate for different workloads?

Database Performance:

  • Choose the right database service (RDS, Aurora, DynamoDB, ElastiCache)?
  • Configure RDS read replicas for read scaling?
  • Use Aurora for high-performance relational workloads?
  • Select DynamoDB capacity mode (on-demand vs provisioned)?
  • Implement DAX for DynamoDB caching?
  • Use ElastiCache Redis for application caching?

Network Performance:

  • Configure CloudFront for content delivery and edge caching?
  • Use Global Accelerator for static IPs and improved availability?
  • Optimize VPC networking (VPC endpoints, PrivateLink)?
  • Implement Direct Connect for dedicated bandwidth?
  • Choose the right load balancer for performance (ALB, NLB)?

Caching & Data Ingestion:

  • Implement multi-layer caching strategy (CloudFront, ElastiCache, DAX)?
  • Configure appropriate TTL values for different content types?
  • Use Kinesis Data Streams for real-time data ingestion?
  • Implement Glue for ETL and data transformation?
  • Use Athena for serverless SQL queries on S3?
  • Configure EMR for big data processing?

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle 1: Questions 1-50 (Expected score: 70%+ to proceed)
  • Domain 3 Bundle 2: Questions 51-100 (Expected score: 75%+ to proceed)

If you scored below 70%:

  • Review storage service selection criteria
  • Focus on EBS volume types and performance characteristics
  • Study database service selection (RDS, Aurora, DynamoDB)
  • Practice caching strategy design

If you scored 70-80%:

  • Review advanced topics: Lambda optimization, EC2 placement groups
  • Study data ingestion patterns (Kinesis, Glue, Athena)
  • Practice network performance optimization
  • Focus on multi-layer caching strategies

If you scored 80%+:

  • Excellent! You're ready to move to Domain 4
  • Continue practicing with full practice tests
  • Review any specific topics where you made mistakes

Progress Check: You've now completed 80% of the exam content (Domains 1 + 2 + 3). One more domain to go!

Next Steps: Proceed to 05_domain4_cost_optimized_architectures to learn about designing cost-optimized architectures (20% of exam).


Chapter Summary

What We Covered

This chapter explored designing high-performing architectures on AWS, representing 24% of the SAA-C03 exam. We covered five major task areas:

Task 3.1: Determine High-Performing Storage Solutions

  • āœ… S3 storage classes and performance optimization
  • āœ… EBS volume types (gp3, io2, st1, sc1) and use cases
  • āœ… EFS performance modes and throughput modes
  • āœ… FSx file systems (Windows, Lustre, NetApp ONTAP, OpenZFS)
  • āœ… Hybrid storage with Storage Gateway and DataSync
  • āœ… S3 Transfer Acceleration and multipart upload

Task 3.2: Design High-Performing Compute Solutions

  • āœ… EC2 instance families and types selection
  • āœ… Auto Scaling policies (target tracking, step, scheduled)
  • āœ… Lambda performance optimization (memory, concurrency)
  • āœ… Container orchestration with ECS and EKS
  • āœ… Batch processing with AWS Batch
  • āœ… Big data processing with EMR
  • āœ… Placement groups for low latency

Task 3.3: Determine High-Performing Database Solutions

  • āœ… RDS instance types and storage options
  • āœ… Aurora performance features (Serverless v2, Parallel Query)
  • āœ… DynamoDB capacity modes and DAX caching
  • āœ… ElastiCache (Redis vs Memcached) for caching
  • āœ… Read replicas for read scaling
  • āœ… RDS Proxy for connection pooling
  • āœ… Database engine selection and optimization

Task 3.4: Determine High-Performing Network Architectures

  • āœ… CloudFront for content delivery and edge caching
  • āœ… Global Accelerator for global traffic management
  • āœ… Direct Connect for dedicated network connections
  • āœ… VPC design for optimal performance
  • āœ… Load balancing strategies (ALB, NLB, GLB)
  • āœ… VPC endpoints for private connectivity
  • āœ… Enhanced networking and placement groups

Task 3.5: Determine High-Performing Data Ingestion and Transformation

  • āœ… Kinesis Data Streams for real-time streaming
  • āœ… Kinesis Data Firehose for data delivery
  • āœ… Glue for ETL and data cataloging
  • āœ… Athena for serverless SQL queries
  • āœ… EMR for big data processing
  • āœ… Lake Formation for data lake management
  • āœ… QuickSight for data visualization

Critical Takeaways

Storage Performance Principles:

  1. Match Storage to Workload: Use gp3 for general purpose, io2 for high IOPS, st1 for throughput-intensive
  2. S3 Performance: 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix
  3. EFS Modes: Bursting for variable workloads, Provisioned for consistent throughput
  4. Caching Layers: CloudFront for static content, ElastiCache for dynamic data
  5. Multipart Upload: Required for objects >5GB, recommended for >100MB

Compute Performance Optimization:

  • Instance Selection: Match instance family to workload (compute, memory, storage, GPU)
  • Lambda Memory: More memory = more CPU, test to find optimal configuration
  • Placement Groups: Cluster for low latency, Spread for high availability, Partition for distributed systems
  • Auto Scaling: Use target tracking for most cases, step scaling for complex scenarios
  • Containers: Use Fargate for simplicity, EC2 for control and cost optimization

Database Performance Strategies:

  • Read Replicas: Offload read traffic, up to 15 replicas for Aurora
  • Caching: ElastiCache for frequently accessed data, DAX for DynamoDB
  • Connection Pooling: RDS Proxy reduces connection overhead
  • Partition Keys: Design DynamoDB partition keys for even distribution
  • Aurora Features: Parallel Query for analytics, Serverless v2 for variable workloads

Network Performance Optimization:

  • CloudFront: Edge caching reduces latency, origin shield reduces origin load
  • Global Accelerator: Static anycast IPs, automatic failover, health checks
  • Direct Connect: Consistent network performance, lower latency than internet
  • Enhanced Networking: SR-IOV for higher PPS, lower latency, lower jitter
  • VPC Endpoints: Eliminate internet gateway, reduce latency and data transfer costs

Data Ingestion Best Practices:

  • Kinesis Streams: Real-time processing, multiple consumers, 24-hour retention (up to 365 days)
  • Kinesis Firehose: Near real-time delivery to S3, Redshift, Elasticsearch, Splunk
  • Glue: Serverless ETL, data catalog, crawlers for schema discovery
  • Athena: Serverless SQL on S3, partition data for better performance
  • EMR: Managed Hadoop/Spark for big data processing

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

Storage Performance:

  • Select appropriate EBS volume type based on IOPS and throughput requirements
  • Choose S3 storage class based on access patterns and cost
  • Configure EFS performance and throughput modes
  • Implement S3 Transfer Acceleration for global uploads
  • Use multipart upload for large objects
  • Select appropriate FSx file system for workload
  • Design hybrid storage architecture with Storage Gateway

Compute Performance:

  • Select EC2 instance family and type for specific workloads
  • Configure Lambda memory and timeout for optimal performance
  • Implement Auto Scaling with appropriate policies
  • Choose between ECS and EKS for container orchestration
  • Use placement groups for low-latency applications
  • Configure AWS Batch for batch processing workloads
  • Optimize EMR clusters for big data processing

Database Performance:

  • Select appropriate RDS instance type and storage
  • Configure Aurora for high performance (Serverless v2, Parallel Query)
  • Design DynamoDB partition keys for even distribution
  • Implement ElastiCache for application caching
  • Use DAX for DynamoDB caching
  • Configure read replicas for read scaling
  • Implement RDS Proxy for connection pooling

Network Performance:

  • Configure CloudFront for content delivery
  • Implement Global Accelerator for global applications
  • Design VPC architecture for optimal performance
  • Select appropriate load balancer (ALB, NLB, GLB)
  • Use VPC endpoints for private connectivity
  • Configure Direct Connect for hybrid connectivity
  • Enable enhanced networking for high-performance instances

Data Ingestion and Analytics:

  • Design real-time streaming architecture with Kinesis
  • Configure Kinesis Firehose for data delivery
  • Use Glue for ETL and data cataloging
  • Query data in S3 with Athena
  • Process big data with EMR
  • Build data lakes with Lake Formation
  • Create dashboards with QuickSight

Practice Questions

Try these from your practice test bundles:

Beginner Level (Target: 80%+ correct):

  • Domain 3 Bundle 1: Questions 1-20 (storage types, compute basics, database fundamentals)
  • Storage Services Bundle: Questions 1-15 (S3, EBS, EFS basics)
  • Compute Services Bundle: Questions 1-15 (EC2, Lambda basics)

Intermediate Level (Target: 70%+ correct):

  • Domain 3 Bundle 2: Questions 21-40 (performance optimization, caching, scaling)
  • Database Services Bundle: Questions 1-25 (RDS, Aurora, DynamoDB, ElastiCache)
  • Full Practice Test 1: Domain 3 questions (mixed difficulty)

Advanced Level (Target: 60%+ correct):

  • Full Practice Test 2: Domain 3 questions (complex performance scenarios)
  • Full Practice Test 3: Domain 3 questions (data ingestion and analytics)

If you scored below target:

  • Below 60%: Review storage types, compute options, and database fundamentals
  • 60-70%: Focus on performance optimization techniques and caching strategies
  • 70-80%: Study advanced features (Aurora Parallel Query, DynamoDB DAX, placement groups)
  • Above 80%: Excellent! Move to next domain

Quick Reference Card

Copy this to your notes for quick review:

EBS Volume Types

Type Use Case Max IOPS Max Throughput Cost
gp3 General purpose 16,000 1,000 MB/s Low
gp2 General purpose (legacy) 16,000 250 MB/s Low
io2 High performance, mission-critical 64,000 1,000 MB/s High
io2 Block Express Highest performance 256,000 4,000 MB/s Highest
st1 Throughput-optimized (big data) 500 500 MB/s Medium
sc1 Cold HDD (infrequent access) 250 250 MB/s Lowest

EC2 Instance Families

  • C: Compute-optimized (CPU-intensive workloads)
  • M: General purpose (balanced compute, memory, networking)
  • R: Memory-optimized (in-memory databases, caching)
  • X: Memory-optimized (large-scale in-memory applications)
  • I: Storage-optimized (high IOPS, NVMe SSD)
  • D: Storage-optimized (high sequential throughput, HDD)
  • G: GPU instances (machine learning, graphics)
  • P: GPU instances (deep learning, HPC)
  • T: Burstable (variable workloads, development)

Database Caching Strategies

Solution Use Case Latency Complexity
CloudFront Static content, API responses Lowest (edge) Low
ElastiCache Redis Session store, leaderboards, pub/sub Low (in-memory) Medium
ElastiCache Memcached Simple caching, horizontal scaling Low (in-memory) Low
DAX DynamoDB caching Microseconds Low
RDS Read Replicas Read scaling, reporting Medium (network) Medium

CloudFront vs Global Accelerator

Feature CloudFront Global Accelerator
Purpose Content delivery Application acceleration
Protocol HTTP/HTTPS TCP/UDP
Caching Yes (edge caching) No (proxying)
Static IP No Yes (2 anycast IPs)
Use Case Static/dynamic content Non-HTTP applications, gaming

Kinesis Services Comparison

Service Use Case Latency Consumers Retention
Data Streams Real-time processing Real-time Multiple 24h-365d
Data Firehose Data delivery Near real-time (60s) Single destination None
Data Analytics SQL on streams Real-time N/A N/A
Video Streams Video ingestion Real-time Multiple Configurable

Performance Optimization Checklist

  • āœ… Use appropriate storage type for workload (gp3, io2, st1, sc1)
  • āœ… Implement caching at multiple layers (CloudFront, ElastiCache, DAX)
  • āœ… Configure Auto Scaling for elasticity
  • āœ… Use read replicas for read-heavy workloads
  • āœ… Enable enhanced networking for high-performance instances
  • āœ… Use placement groups for low-latency applications
  • āœ… Implement connection pooling with RDS Proxy
  • āœ… Partition data for better query performance (Athena, DynamoDB)

Common Exam Scenarios

  • Scenario: High IOPS database → Solution: io2 or io2 Block Express EBS volumes
  • Scenario: Reduce database load → Solution: ElastiCache or DAX for caching
  • Scenario: Global content delivery → Solution: CloudFront with edge locations
  • Scenario: Low-latency HPC → Solution: Cluster placement group with enhanced networking
  • Scenario: Variable Lambda workload → Solution: Provisioned concurrency for predictable latency
  • Scenario: Read-heavy database → Solution: Read replicas (up to 15 for Aurora)
  • Scenario: Real-time analytics → Solution: Kinesis Data Streams + Lambda or Kinesis Data Analytics
  • Scenario: Large file uploads → Solution: S3 multipart upload + Transfer Acceleration

Next Chapter: 05_domain4_cost_optimized_architectures - Design Cost-Optimized Architectures (20% of exam)

Chapter Summary

What We Covered

This chapter covered Domain 3: Design High-Performing Architectures (24% of the exam), focusing on five critical task areas:

āœ… Task 3.1: Determine high-performing and/or scalable storage solutions

  • S3 storage classes and performance optimization
  • EBS volume types (gp3, io2, st1, sc1) and performance tuning
  • EFS performance modes and throughput modes
  • FSx file systems for specialized workloads
  • Storage Gateway for hybrid cloud storage
  • S3 Transfer Acceleration and multipart upload
  • DataSync for large-scale data migration

āœ… Task 3.2: Design high-performing and elastic compute solutions

  • EC2 instance types and families (compute, memory, storage optimized)
  • Placement groups for low-latency applications
  • Auto Scaling for elastic compute capacity
  • Lambda performance optimization (memory, concurrency, provisioned concurrency)
  • ECS and EKS for container orchestration
  • Fargate for serverless containers
  • Batch for large-scale batch processing
  • EMR for big data processing

āœ… Task 3.3: Determine high-performing database solutions

  • RDS instance types and storage optimization
  • Aurora performance features (parallel query, serverless)
  • DynamoDB capacity modes and partition key design
  • ElastiCache for caching (Redis vs Memcached)
  • DynamoDB Accelerator (DAX) for microsecond latency
  • RDS Proxy for connection pooling
  • Read replicas for read-heavy workloads
  • Database performance monitoring with Performance Insights

āœ… Task 3.4: Determine high-performing and/or scalable network architectures

  • CloudFront for global content delivery
  • Global Accelerator for low-latency global access
  • VPC design for optimal network performance
  • Direct Connect for dedicated network connections
  • Load balancing strategies (ALB, NLB, GLB)
  • VPC endpoints for private connectivity
  • Enhanced networking for high-performance instances

āœ… Task 3.5: Determine high-performing data ingestion and transformation solutions

  • Kinesis Data Streams for real-time data ingestion
  • Kinesis Data Firehose for data delivery
  • Kinesis Data Analytics for real-time analytics
  • AWS Glue for ETL and data cataloging
  • Athena for serverless SQL queries on S3
  • EMR for big data processing frameworks
  • Lake Formation for data lake management
  • QuickSight for business intelligence

Critical Takeaways

Performance is about choosing the right tool for the job:

  • Storage: Match storage type to access patterns (frequent vs infrequent, sequential vs random)
  • Compute: Choose instance type based on workload characteristics (CPU, memory, network)
  • Database: Select database engine based on data model and access patterns
  • Network: Use edge services (CloudFront, Global Accelerator) for global performance
  • Caching: Implement caching at multiple layers to reduce latency

Key Performance Principles:

  1. Right-Sizing: Choose appropriate resource sizes based on actual workload needs
  2. Caching: Cache at multiple layers (CloudFront, ElastiCache, DAX, application)
  3. Parallelization: Use parallel processing for large-scale workloads
  4. Proximity: Place resources close to users (edge locations, regional endpoints)
  5. Monitoring: Continuously monitor performance metrics and optimize

Most Important Services to Master:

  • S3: Storage classes, Transfer Acceleration, multipart upload
  • EBS: Volume types (gp3, io2), IOPS provisioning
  • Lambda: Memory configuration, concurrency limits, provisioned concurrency
  • ElastiCache: Redis vs Memcached, cluster mode
  • CloudFront: Edge caching, origin shield, signed URLs
  • RDS/Aurora: Read replicas, Performance Insights, RDS Proxy
  • DynamoDB: Partition key design, GSI, DAX

Common Exam Patterns:

  • Questions about high IOPS → io2 or io2 Block Express EBS volumes
  • Questions about caching → ElastiCache (Redis for complex data, Memcached for simple)
  • Questions about global content delivery → CloudFront with edge locations
  • Questions about low-latency compute → Cluster placement group + enhanced networking
  • Questions about read-heavy database → Read replicas (up to 15 for Aurora)
  • Questions about real-time analytics → Kinesis Data Streams + Lambda or Analytics
  • Questions about large file uploads → S3 multipart upload + Transfer Acceleration

Self-Assessment Checklist

Test yourself before moving to the next chapter. You should be able to:

Storage Performance

  • Choose appropriate S3 storage class based on access patterns
  • Select correct EBS volume type for workload (gp3, io2, st1, sc1)
  • Configure EFS performance mode and throughput mode
  • Decide when to use FSx for Windows, Lustre, or NetApp ONTAP
  • Implement S3 Transfer Acceleration for global uploads
  • Use S3 multipart upload for large files
  • Configure Storage Gateway for hybrid cloud storage
  • Optimize storage performance with proper configuration

Compute Performance

  • Select appropriate EC2 instance type for workload
  • Use placement groups for low-latency applications
  • Configure Lambda memory for optimal performance
  • Implement Lambda provisioned concurrency for predictable latency
  • Choose between ECS and EKS for container workloads
  • Decide when to use Fargate vs EC2 launch type
  • Configure Auto Scaling for elastic compute capacity
  • Use Batch for large-scale batch processing

Database Performance

  • Choose appropriate RDS instance type and storage
  • Configure Aurora for high performance (parallel query, serverless)
  • Design DynamoDB partition keys for even distribution
  • Implement ElastiCache for database caching
  • Use DAX for DynamoDB microsecond latency
  • Configure RDS Proxy for connection pooling
  • Set up read replicas for read-heavy workloads
  • Monitor database performance with Performance Insights

Network Performance

  • Configure CloudFront for global content delivery
  • Use Global Accelerator for low-latency global access
  • Design VPC for optimal network performance
  • Implement Direct Connect for dedicated connectivity
  • Choose appropriate load balancer (ALB, NLB, GLB)
  • Use VPC endpoints to reduce latency and cost
  • Enable enhanced networking for high-performance instances

Data Ingestion and Analytics

  • Configure Kinesis Data Streams for real-time ingestion
  • Use Kinesis Data Firehose for data delivery to S3/Redshift
  • Implement Kinesis Data Analytics for real-time SQL analytics
  • Design ETL pipelines with AWS Glue
  • Query S3 data with Athena
  • Use EMR for big data processing (Spark, Hadoop)
  • Build data lakes with Lake Formation
  • Create dashboards with QuickSight

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle 1: Questions 1-30 (Storage and compute performance)
  • Domain 3 Bundle 2: Questions 31-60 (Database and network performance)
  • Storage Services Bundle: All questions (S3, EBS, EFS, FSx)
  • Database Services Bundle: All questions (RDS, Aurora, DynamoDB, ElastiCache)
  • Compute Services Bundle: Questions on EC2, Lambda, ECS, EKS

Expected Score: 75%+ to proceed confidently

If you scored below 75%:

  • 60-74%: Review specific sections where you struggled, then retry
  • Below 60%: Re-read this entire chapter, focusing on performance comparisons
  • Focus on understanding when to use each service based on requirements

Quick Reference Card

Copy this to your notes for quick review:

Storage Performance Quick Facts

  • S3 Standard: Frequent access, millisecond latency, 99.99% availability
  • S3 Intelligent-Tiering: Unknown access patterns, automatic optimization
  • gp3: General purpose SSD, 3,000-16,000 IOPS, 125-1,000 MB/s
  • io2: High-performance SSD, up to 64,000 IOPS, 1,000 MB/s, 99.999% durability
  • EFS: Shared file system, automatic scaling, multiple performance modes
  • FSx Lustre: HPC workloads, sub-millisecond latency, 100s GB/s throughput

Compute Performance Quick Facts

  • Compute Optimized (C): High-performance processors, compute-intensive workloads
  • Memory Optimized (R, X): Large datasets in memory, in-memory databases
  • Storage Optimized (I, D): High sequential read/write, data warehousing
  • Placement Groups: Cluster (low latency), Spread (high availability), Partition (distributed)
  • Lambda: 128 MB - 10 GB memory, scales with memory, 15-minute timeout
  • Provisioned Concurrency: Pre-warmed Lambda instances, predictable latency

Database Performance Quick Facts

  • RDS: Managed relational database, Multi-AZ, read replicas, Performance Insights
  • Aurora: 5x MySQL, 3x PostgreSQL, up to 15 read replicas, <30s failover
  • Aurora Serverless: Auto-scaling, pay per second, good for variable workloads
  • DynamoDB: NoSQL, single-digit millisecond latency, auto-scaling
  • DAX: DynamoDB cache, microsecond latency, 10x performance improvement
  • ElastiCache Redis: Complex data structures, persistence, replication
  • ElastiCache Memcached: Simple key-value, multi-threaded, no persistence

Network Performance Quick Facts

  • CloudFront: 450+ edge locations, origin shield, field-level encryption
  • Global Accelerator: Static anycast IPs, 2 IPs for all regions, health checks
  • Direct Connect: 1 Gbps - 100 Gbps, dedicated connection, consistent latency
  • Enhanced Networking: SR-IOV, up to 100 Gbps, lower latency, higher PPS
  • VPC Endpoints: Private connectivity, no internet gateway, reduced latency

Data Ingestion Quick Facts

  • Kinesis Data Streams: Real-time, 1 MB/s per shard, 24h-365d retention
  • Kinesis Data Firehose: Near real-time (60s), auto-scaling, S3/Redshift delivery
  • Kinesis Data Analytics: SQL on streams, real-time analytics
  • Glue: Serverless ETL, data catalog, crawlers for schema discovery
  • Athena: Serverless SQL on S3, pay per query, Presto-based
  • EMR: Managed Hadoop/Spark, big data processing, auto-scaling

Decision Points

  • High IOPS database → io2 or io2 Block Express EBS volumes
  • Reduce database load → ElastiCache or DAX for caching
  • Global content delivery → CloudFront with edge locations
  • Low-latency HPC → Cluster placement group with enhanced networking
  • Variable Lambda workload → Provisioned concurrency for predictable latency
  • Read-heavy database → Read replicas (up to 15 for Aurora)
  • Real-time analytics → Kinesis Data Streams + Lambda or Kinesis Data Analytics
  • Large file uploads → S3 multipart upload + Transfer Acceleration

Congratulations! You've completed Domain 3: Design High-Performing Architectures. Performance optimization is critical for real-world applications, and this domain (24% of the exam) tests your ability to choose the right services for optimal performance.

Next Chapter: 05_domain4_cost_optimized_architectures - Design Cost-Optimized Architectures (20% of exam)


Chapter Summary

What We Covered

This chapter covered the five major task areas of Domain 3: Design High-Performing Architectures (24% of exam):

Task 3.1: Determine High-Performing Storage Solutions

  • āœ… S3 storage classes and performance optimization
  • āœ… EBS volume types (gp3, io2, st1, sc1)
  • āœ… EFS performance modes and throughput modes
  • āœ… FSx file systems (Windows, Lustre, NetApp ONTAP)
  • āœ… Storage Gateway for hybrid storage
  • āœ… S3 Transfer Acceleration and multipart upload

Task 3.2: Design High-Performing Compute Solutions

  • āœ… EC2 instance types and families
  • āœ… Placement groups for low latency
  • āœ… Auto Scaling policies (target tracking, step, predictive)
  • āœ… Lambda memory and concurrency optimization
  • āœ… ECS and EKS capacity providers
  • āœ… Batch for large-scale batch processing
  • āœ… EMR for big data analytics

Task 3.3: Determine High-Performing Database Solutions

  • āœ… RDS instance types and storage options
  • āœ… Aurora Serverless and Aurora Global Database
  • āœ… DynamoDB capacity modes and DAX caching
  • āœ… ElastiCache (Redis vs Memcached)
  • āœ… Read replicas for read scaling
  • āœ… RDS Proxy for connection pooling
  • āœ… Database caching strategies

Task 3.4: Determine High-Performing Network Architectures

  • āœ… CloudFront for global content delivery
  • āœ… Global Accelerator for static anycast IPs
  • āœ… Direct Connect for dedicated connectivity
  • āœ… Enhanced networking and placement groups
  • āœ… VPC endpoints for private connectivity
  • āœ… Load balancer selection and optimization

Task 3.5: Determine High-Performing Data Ingestion and Transformation

  • āœ… Kinesis Data Streams for real-time streaming
  • āœ… Kinesis Data Firehose for near real-time delivery
  • āœ… Kinesis Data Analytics for stream processing
  • āœ… Glue for serverless ETL
  • āœ… Athena for serverless SQL on S3
  • āœ… EMR for big data processing
  • āœ… Lake Formation for data lake management

Critical Takeaways

  1. Match Storage to Workload: Use gp3 for general purpose, io2 for high IOPS databases, st1 for throughput-intensive workloads, and sc1 for cold data.

  2. Cache Aggressively: Implement caching at multiple layers (CloudFront, ElastiCache, DAX) to reduce latency and database load.

  3. Choose the Right Compute: Use Lambda for event-driven, Fargate for containers without management, EC2 for full control, and Batch for large-scale batch jobs.

  4. Database Performance: Use read replicas for read scaling, Aurora for best performance, DynamoDB for single-digit millisecond latency, and caching for frequently accessed data.

  5. Global Performance: Use CloudFront for content delivery, Global Accelerator for static IPs and health checks, and multi-region deployments for global applications.

  6. Network Optimization: Use Direct Connect for consistent low latency, Enhanced Networking for high throughput, and VPC endpoints to avoid internet traffic.

  7. Real-Time Processing: Use Kinesis Data Streams for real-time analytics, Firehose for near real-time delivery, and Lambda for stream processing.

  8. Right-Size Everything: Use Compute Optimizer, Performance Insights, and CloudWatch metrics to continuously optimize resource sizing.

Self-Assessment Checklist

Test yourself before moving on. Can you:

Storage Performance

  • Choose the appropriate EBS volume type for different workloads?
  • Explain when to use EFS vs FSx vs S3?
  • Optimize S3 performance with multipart upload and Transfer Acceleration?
  • Select the right S3 storage class for access patterns?
  • Configure EFS performance and throughput modes?
  • Use Storage Gateway for hybrid storage scenarios?

Compute Performance

  • Select the appropriate EC2 instance type for workloads?
  • Configure placement groups for low-latency applications?
  • Implement Auto Scaling with appropriate policies?
  • Optimize Lambda memory and concurrency settings?
  • Choose between ECS and EKS for container workloads?
  • Use Batch for large-scale batch processing?

Database Performance

  • Choose between RDS, Aurora, and DynamoDB?
  • Configure read replicas for read scaling?
  • Implement database caching with ElastiCache or DAX?
  • Use RDS Proxy for connection pooling?
  • Optimize DynamoDB with partition key design?
  • Select appropriate database capacity modes?

Network Performance

  • Configure CloudFront for global content delivery?
  • Use Global Accelerator for static anycast IPs?
  • Implement Direct Connect for dedicated connectivity?
  • Enable Enhanced Networking for high throughput?
  • Choose the appropriate load balancer type?
  • Use VPC endpoints for private connectivity?

Data Ingestion and Analytics

  • Design real-time streaming architectures with Kinesis?
  • Use Glue for serverless ETL jobs?
  • Query S3 data with Athena?
  • Process big data with EMR?
  • Build data lakes with Lake Formation?

Practice Questions

Try these from your practice test bundles:

Beginner Level (Build Confidence):

  • Domain 3 Bundle 1: Questions 1-20
  • Storage Services Bundle: Questions 1-15
  • Expected score: 70%+ to proceed

Intermediate Level (Test Understanding):

  • Domain 3 Bundle 2: Questions 1-20
  • Compute Services Bundle: Questions 1-15
  • Database Services Bundle: Questions 1-15
  • Expected score: 75%+ to proceed

Advanced Level (Challenge Yourself):

  • Full Practice Test 2: Domain 3 questions
  • Expected score: 70%+ to proceed

If you scored below target:

  • Below 60%: Review storage and compute fundamentals
  • 60-70%: Focus on database and network optimization
  • 70-80%: Review quick facts and decision points
  • 80%+: Outstanding! Move to next domain

Quick Reference Card

Copy this to your notes for quick review:

Storage Performance

  • gp3: 3,000-16,000 IOPS, 125-1,000 MB/s, general purpose
  • io2: Up to 64,000 IOPS, 1,000 MB/s, high-performance databases
  • io2 Block Express: Up to 256,000 IOPS, 4,000 MB/s, largest databases
  • st1: 500 IOPS, 500 MB/s, throughput-intensive (big data, data warehouses)
  • sc1: 250 IOPS, 250 MB/s, cold data, lowest cost

Compute Performance

  • General Purpose: t3, t4g (burstable), m5, m6g (balanced)
  • Compute Optimized: c5, c6g (high CPU, batch processing, gaming)
  • Memory Optimized: r5, r6g, x1e (in-memory databases, big data)
  • Storage Optimized: i3, d2 (NoSQL, data warehouses, Hadoop)
  • Accelerated Computing: p3, p4 (ML training), g4 (ML inference, graphics)

Database Performance

  • RDS: Managed relational, Multi-AZ, read replicas, up to 64 TB
  • Aurora: 5x MySQL, 3x PostgreSQL, 128 TB, 15 read replicas
  • Aurora Serverless: Auto-scaling, pay per second, intermittent workloads
  • DynamoDB: Single-digit ms latency, unlimited scale, DAX for caching
  • ElastiCache Redis: In-memory, persistence, replication, pub/sub
  • ElastiCache Memcached: In-memory, multi-threaded, simple caching

Network Performance

  • CloudFront: 450+ edge locations, origin shield, field-level encryption
  • Global Accelerator: Static anycast IPs, health checks, DDoS protection
  • Direct Connect: 1-100 Gbps, dedicated connection, consistent latency
  • Enhanced Networking: SR-IOV, up to 100 Gbps, lower latency
  • VPC Endpoints: Private connectivity, no internet, reduced latency

Data Ingestion

  • Kinesis Data Streams: Real-time, 1 MB/s per shard, 24h-365d retention
  • Kinesis Firehose: Near real-time (60s), auto-scaling, S3/Redshift delivery
  • Kinesis Analytics: SQL on streams, real-time analytics
  • Glue: Serverless ETL, data catalog, crawlers
  • Athena: Serverless SQL on S3, pay per query
  • EMR: Managed Hadoop/Spark, big data, auto-scaling

Key Decision Points

Scenario Solution
High IOPS database io2 or io2 Block Express EBS
Reduce database load ElastiCache or DAX caching
Global content delivery CloudFront with edge locations
Low-latency HPC Cluster placement group + enhanced networking
Variable Lambda workload Provisioned concurrency
Read-heavy database Read replicas (up to 15 for Aurora)
Real-time analytics Kinesis Data Streams + Lambda
Large file uploads S3 multipart + Transfer Acceleration

Chapter Summary

What We Covered

This chapter explored Design High-Performing Architectures (24% of the exam), covering five major task areas:

āœ… Task 3.1: Determine high-performing storage solutions

  • S3 storage classes and performance optimization
  • EBS volume types (gp3, io2, st1, sc1)
  • EFS performance modes and throughput modes
  • FSx file systems (Windows, Lustre, NetApp ONTAP)
  • Hybrid storage with Storage Gateway and DataSync

āœ… Task 3.2: Design high-performing compute solutions

  • EC2 instance families and types
  • Auto Scaling policies and strategies
  • Lambda memory and concurrency optimization
  • Container orchestration with ECS and EKS
  • Batch processing with AWS Batch
  • Big data with EMR

āœ… Task 3.3: Determine high-performing database solutions

  • RDS instance types and storage options
  • Aurora performance features (Serverless, Parallel Query)
  • DynamoDB capacity modes and DAX caching
  • ElastiCache (Redis vs Memcached)
  • Database read replicas and connection pooling

āœ… Task 3.4: Determine high-performing network architectures

  • CloudFront edge caching and optimization
  • Global Accelerator for static anycast IPs
  • Direct Connect for dedicated connectivity
  • VPC design and endpoint optimization
  • Load balancer selection and configuration

āœ… Task 3.5: Determine high-performing data ingestion and transformation

  • Kinesis Data Streams for real-time ingestion
  • Kinesis Firehose for near real-time delivery
  • Glue for serverless ETL
  • Athena for serverless SQL on S3
  • EMR for big data processing

Critical Takeaways

  1. Storage Performance: Use gp3 for general purpose (16,000 IOPS), io2 Block Express for extreme performance (256,000 IOPS), EFS for shared access.

  2. Compute Optimization: Choose instance types based on workload (compute-optimized for CPU, memory-optimized for RAM, storage-optimized for I/O).

  3. Database Performance: Use read replicas for read-heavy workloads, Aurora for high performance, DynamoDB for single-digit millisecond latency.

  4. Caching Layers: Implement caching at multiple layers (CloudFront for content, ElastiCache for data, DAX for DynamoDB) to reduce latency.

  5. Network Performance: Use CloudFront for global content delivery, Global Accelerator for static IPs, Direct Connect for consistent low latency.

  6. Lambda Optimization: Increase memory to get more CPU, use provisioned concurrency for predictable latency, optimize cold starts.

  7. Real-Time Processing: Use Kinesis Data Streams for real-time (sub-second), Firehose for near real-time (60s), Glue for batch ETL.

  8. Placement Groups: Use cluster placement for HPC (low latency), spread for critical instances (different hardware), partition for distributed systems.

Self-Assessment Checklist

Test yourself before moving on:

  • I can select the appropriate EBS volume type for different workloads
  • I understand when to use EFS vs FSx vs S3
  • I can choose the right EC2 instance family for a workload
  • I know how to optimize Lambda performance and cost
  • I understand database performance tuning (read replicas, caching)
  • I can design a multi-layer caching strategy
  • I know when to use CloudFront vs Global Accelerator
  • I understand Kinesis Data Streams vs Firehose
  • I can optimize S3 performance for high throughput
  • I know how to select the right database for performance requirements

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle 1: Questions 1-50 (Expected score: 70%+)
  • Domain 3 Bundle 2: Questions 1-50 (Expected score: 70%+)
  • Storage Services Bundle: Questions 1-50 (Expected score: 70%+)
  • Database Services Bundle: Questions 1-50 (Expected score: 70%+)
  • Full Practice Test 1: Domain 3 questions (Expected score: 75%+)

If you scored below 70%:

  • Review storage performance characteristics
  • Focus on compute instance type selection
  • Study database performance optimization techniques
  • Practice caching strategy design

Quick Reference Card

Storage Performance:

  • gp3: 3,000-16,000 IOPS, 125-1,000 MB/s, $0.08/GB-month
  • io2: 64,000 IOPS, 1,000 MB/s, 99.999% durability
  • io2 Block Express: 256,000 IOPS, 4,000 MB/s, sub-millisecond latency
  • EFS: Shared access, auto-scaling, bursting or provisioned throughput
  • FSx Lustre: HPC, 100s GB/s, sub-millisecond latency

Compute Performance:

  • C-family: Compute-optimized (CPU-intensive)
  • M-family: General purpose (balanced)
  • R-family: Memory-optimized (RAM-intensive)
  • I-family: Storage-optimized (I/O-intensive)
  • Lambda: 128 MB-10 GB memory, scales with CPU

Database Performance:

  • Aurora: 5x MySQL, 3x PostgreSQL, 15 read replicas, <30s failover
  • Aurora Serverless: Auto-scaling, pay per second
  • DynamoDB: Single-digit millisecond, unlimited throughput
  • DAX: DynamoDB cache, microsecond latency, 10x performance
  • ElastiCache Redis: In-memory, persistence, replication
  • ElastiCache Memcached: In-memory, multi-threaded, simple caching

Network Performance:

  • CloudFront: 450+ edge locations, origin shield, field-level encryption
  • Global Accelerator: Static anycast IPs, health checks, DDoS protection
  • Direct Connect: 1-100 Gbps, dedicated connection, consistent latency
  • Enhanced Networking: SR-IOV, up to 100 Gbps, lower latency
  • VPC Endpoints: Private connectivity, no internet, reduced latency

Data Ingestion:

  • Kinesis Data Streams: Real-time, 1 MB/s per shard, 24h-365d retention
  • Kinesis Firehose: Near real-time (60s), auto-scaling, S3/Redshift delivery
  • Kinesis Analytics: SQL on streams, real-time analytics
  • Glue: Serverless ETL, data catalog, crawlers
  • Athena: Serverless SQL on S3, pay per query
  • EMR: Managed Hadoop/Spark, big data, auto-scaling

Decision Points:

  • Need high IOPS? → io2 or io2 Block Express EBS
  • Need to reduce database load? → ElastiCache or DAX caching
  • Need global content delivery? → CloudFront with edge locations
  • Need low-latency HPC? → Cluster placement group + enhanced networking
  • Need variable Lambda workload? → Provisioned concurrency
  • Need read-heavy database? → Read replicas (up to 15 for Aurora)
  • Need real-time analytics? → Kinesis Data Streams + Lambda
  • Need large file uploads? → S3 multipart + Transfer Acceleration

Next Chapter: Proceed to 05_domain4_cost_optimized_architectures to learn about designing cost-optimized architectures.

Chapter Summary

What We Covered

This chapter covered high-performance architecture design, representing 24% of the exam content. You learned:

  • āœ… Storage Performance: S3, EBS, EFS, FSx, and storage optimization techniques
  • āœ… Compute Performance: EC2 instance types, Lambda optimization, and container performance
  • āœ… Database Performance: RDS, Aurora, DynamoDB, ElastiCache, and caching strategies
  • āœ… Network Performance: CloudFront, Global Accelerator, Direct Connect, and network optimization
  • āœ… Data Ingestion: Kinesis, Glue, Athena, EMR, and real-time analytics
  • āœ… Performance Monitoring: CloudWatch, X-Ray, and performance troubleshooting

Critical Takeaways

  1. Choose the Right Storage: Match storage type to access pattern - S3 for objects, EBS for block, EFS for shared file, FSx for specialized workloads
  2. Optimize Compute: Use appropriate instance types (compute-optimized for CPU, memory-optimized for RAM), placement groups for HPC, and provisioned concurrency for Lambda
  3. Cache Aggressively: Implement caching at multiple layers (CloudFront edge, ElastiCache/DAX, application) to reduce latency and database load
  4. Scale Databases Properly: Use read replicas for read-heavy workloads, Aurora for high performance, DynamoDB for massive scale
  5. Leverage Edge Services: Use CloudFront for global content delivery, Global Accelerator for static IPs and health checks
  6. Monitor and Optimize: Use CloudWatch metrics, X-Ray tracing, and Compute Optimizer recommendations to continuously improve performance

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

Storage Performance:

  • Choose between S3 storage classes based on access patterns
  • Select appropriate EBS volume types (gp3, io2, st1, sc1)
  • Configure EFS performance modes (General Purpose vs Max I/O)
  • Implement S3 Transfer Acceleration for global uploads
  • Use S3 multipart upload for large files (>100 MB)

Compute Performance:

  • Select EC2 instance families based on workload (C, M, R, T, etc.)
  • Configure placement groups for low-latency HPC workloads
  • Optimize Lambda memory and timeout settings
  • Use provisioned concurrency for consistent Lambda performance
  • Choose between ECS and EKS for container workloads

Database Performance:

  • Configure RDS read replicas for read scaling
  • Choose between Aurora and RDS based on requirements
  • Design DynamoDB partition keys for even distribution
  • Implement ElastiCache or DAX for database caching
  • Use RDS Proxy for connection pooling

Network Performance:

  • Configure CloudFront with appropriate caching behaviors
  • Use Global Accelerator for static anycast IPs
  • Implement Direct Connect for consistent low latency
  • Choose between ALB and NLB based on performance needs
  • Use VPC endpoints to reduce latency and costs

Data Ingestion & Analytics:

  • Design streaming architectures with Kinesis Data Streams
  • Use Kinesis Firehose for near real-time delivery
  • Implement Glue ETL jobs for data transformation
  • Query S3 data with Athena using partitioning
  • Process big data with EMR clusters

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle 1: Questions 1-25 (Storage and compute performance)
  • Domain 3 Bundle 2: Questions 26-50 (Database and network performance)
  • Storage Services Bundle: All questions
  • Database Services Bundle: All questions
  • Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • Review EBS volume types and when to use each
  • Practice designing caching strategies at multiple layers
  • Focus on understanding database scaling patterns (read replicas, sharding)
  • Revisit CloudFront and Global Accelerator use cases

Quick Reference Card

Storage Performance:

  • S3 Standard: 3,500 PUT/s, 5,500 GET/s per prefix
  • EBS gp3: 16,000 IOPS, 1,000 MB/s, $0.08/GB-month
  • EBS io2: 64,000 IOPS, 1,000 MB/s, 99.999% durability
  • EFS: 10+ GB/s throughput, millions of IOPS
  • FSx Lustre: 100+ GB/s, sub-millisecond latency

Compute Performance:

  • C instances: Compute-optimized (3.5 GHz, high CPU)
  • M instances: General purpose (balanced CPU/memory)
  • R instances: Memory-optimized (up to 24 TB RAM)
  • T instances: Burstable (baseline + burst credits)
  • Lambda: 128 MB - 10 GB memory, 15 min timeout

Database Performance:

  • RDS: Up to 64 TB, 80,000 IOPS, 15 read replicas
  • Aurora: Up to 128 TB, 15 read replicas, <30s failover
  • DynamoDB: Unlimited storage, millions of requests/s
  • ElastiCache Redis: 250+ nodes, 340 GB RAM per node
  • DAX: 10x DynamoDB read performance, microsecond latency

Network Performance:

  • CloudFront: 450+ edge locations, <10ms latency
  • Global Accelerator: Static anycast IPs, 60% performance improvement
  • Direct Connect: 1-100 Gbps, <10ms latency
  • Enhanced Networking: Up to 100 Gbps, SR-IOV
  • VPC Endpoints: Private connectivity, no internet

Caching Strategies:

  1. CloudFront: Edge caching (TTL 0-365 days)
  2. ElastiCache: Application caching (Redis or Memcached)
  3. DAX: DynamoDB caching (microsecond latency)
  4. RDS Read Replicas: Read scaling (up to 15 replicas)
  5. API Gateway: Response caching (0.5 GB - 237 GB)

Data Ingestion:

  • Kinesis Data Streams: 1 MB/s per shard, 1,000 records/s
  • Kinesis Firehose: Auto-scaling, 60s buffer
  • Glue: Serverless ETL, $0.44/DPU-hour
  • Athena: $5/TB scanned, serverless SQL
  • EMR: Managed Hadoop/Spark, auto-scaling

Common Exam Scenarios:

  • Need high IOPS? → io2 or io2 Block Express EBS
  • Need to reduce database load? → ElastiCache or DAX caching
  • Need global content delivery? → CloudFront with edge locations
  • Need low-latency HPC? → Cluster placement group + enhanced networking
  • Need variable Lambda workload? → Provisioned concurrency
  • Need read-heavy database? → Read replicas (up to 15 for Aurora)
  • Need real-time analytics? → Kinesis Data Streams + Lambda
  • Need large file uploads? → S3 multipart + Transfer Acceleration

You're ready to proceed when you can:

  • Select appropriate storage and compute resources for performance requirements
  • Design multi-layer caching strategies to optimize performance
  • Choose the right database solution and scaling strategy
  • Implement global content delivery with CloudFront
  • Troubleshoot performance bottlenecks using CloudWatch and X-Ray

Next: Move to Chapter 4: Cost-Optimized Architectures to learn about cost optimization.


Chapter Summary

What We Covered

This chapter covered the essential concepts for designing high-performing architectures on AWS, which accounts for 24% of the SAA-C03 exam. We explored five major task areas:

Task 3.1: High-Performing Storage Solutions

  • āœ… S3 storage classes and performance optimization
  • āœ… EBS volume types (gp3, io2, st1, sc1) and use cases
  • āœ… EFS performance modes and throughput modes
  • āœ… FSx file systems (Windows, Lustre, NetApp ONTAP)
  • āœ… Storage Gateway for hybrid cloud storage
  • āœ… DataSync for large-scale data migration

Task 3.2: High-Performing Compute Solutions

  • āœ… EC2 instance families and types selection
  • āœ… Placement groups (Cluster, Spread, Partition)
  • āœ… Enhanced networking and ENA
  • āœ… Auto Scaling policies and strategies
  • āœ… Lambda memory and concurrency optimization
  • āœ… ECS and EKS capacity providers
  • āœ… Batch for large-scale batch processing

Task 3.3: High-Performing Database Solutions

  • āœ… RDS instance types and storage optimization
  • āœ… Aurora Serverless and performance features
  • āœ… DynamoDB capacity modes and DAX caching
  • āœ… ElastiCache (Redis vs Memcached)
  • āœ… Database read replicas and replication
  • āœ… RDS Proxy for connection pooling

Task 3.4: High-Performing Network Architectures

  • āœ… CloudFront edge locations and caching
  • āœ… Global Accelerator for global applications
  • āœ… Direct Connect for dedicated connectivity
  • āœ… VPC design and subnet optimization
  • āœ… Load balancer performance characteristics
  • āœ… PrivateLink for private connectivity

Task 3.5: Data Ingestion and Transformation

  • āœ… Kinesis Data Streams for real-time ingestion
  • āœ… Kinesis Firehose for serverless delivery
  • āœ… Glue for ETL and data cataloging
  • āœ… Athena for serverless SQL queries
  • āœ… EMR for big data processing
  • āœ… Lake Formation for data lake management

Critical Takeaways

  1. Storage Performance: Choose gp3 for general purpose (16,000 IOPS), io2 Block Express for extreme performance (256,000 IOPS), EFS for shared file systems.

  2. EBS Optimization: Use gp3 instead of gp2 (20% cheaper, configurable IOPS/throughput), enable EBS optimization on instances, use Fast Snapshot Restore for quick recovery.

  3. S3 Performance: Use multipart upload for files >100 MB, enable Transfer Acceleration for global uploads, implement request rate optimization (3,500 PUT/5,500 GET per prefix).

  4. Compute Selection: Memory-optimized (R/X) for databases, Compute-optimized (C) for batch processing, General purpose (M/T) for web servers, GPU (P/G) for ML/graphics.

  5. Placement Groups: Cluster for low-latency HPC (single AZ), Spread for critical instances (max 7 per AZ), Partition for distributed systems (Hadoop, Cassandra).

  6. Lambda Optimization: More memory = more CPU (1,769 MB = 1 vCPU), use Provisioned Concurrency for consistent latency, optimize package size for faster cold starts.

  7. Database Caching: ElastiCache for general caching, DAX for DynamoDB (microsecond latency), RDS Proxy for connection pooling (reduce connection overhead).

  8. Aurora Performance: Up to 5x faster than MySQL, 3x faster than PostgreSQL, 15 read replicas, automatic failover <30 seconds, parallel query for analytics.

  9. DynamoDB Optimization: Use On-Demand for unpredictable workloads, Provisioned for steady-state (cheaper), design partition keys for even distribution, use GSI for query flexibility.

  10. CloudFront Benefits: Reduce origin load by 60-90%, cache at 450+ edge locations, Origin Shield for additional caching layer, signed URLs for private content.

  11. Global Accelerator: Static anycast IPs, intelligent routing to optimal endpoint, instant regional failover, TCP/UDP support (not just HTTP).

  12. Kinesis Streams: 1 MB/s write per shard, 2 MB/s read per shard, 1,000 records/s per shard, 24-hour default retention (up to 365 days).

  13. Data Format Optimization: Convert CSV to Parquet (10x compression, 100x faster queries), use columnar formats for analytics, partition data by query patterns.

  14. Network Performance: Enhanced networking (25 Gbps), Elastic Fabric Adapter for HPC (100 Gbps), placement groups for low latency (<1 ms).

  15. Monitoring: Use CloudWatch for metrics, X-Ray for distributed tracing, Performance Insights for database bottlenecks, VPC Flow Logs for network analysis.

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

Storage Performance:

  • Select appropriate EBS volume type based on IOPS and throughput requirements
  • Explain the difference between gp3 and io2 Block Express
  • Choose between EFS and FSx for different file system needs
  • Optimize S3 performance with multipart upload and Transfer Acceleration
  • Design hybrid storage solutions with Storage Gateway

Compute Optimization:

  • Select appropriate EC2 instance family for different workload types
  • Configure placement groups for HPC and distributed applications
  • Optimize Lambda function memory and concurrency settings
  • Choose between ECS on EC2 vs Fargate based on requirements
  • Design Auto Scaling policies for predictable and variable workloads

Database Performance:

  • Select appropriate RDS instance type and storage configuration
  • Explain when to use Aurora vs RDS vs DynamoDB
  • Configure DynamoDB partition keys for even distribution
  • Implement caching with ElastiCache or DAX
  • Design read replica strategy for read-heavy workloads
  • Use RDS Proxy to reduce connection overhead

Network Performance:

  • Configure CloudFront for optimal caching and performance
  • Explain when to use Global Accelerator vs CloudFront
  • Design Direct Connect for hybrid connectivity
  • Select appropriate load balancer based on performance needs
  • Optimize VPC design for high-throughput applications

Data Ingestion:

  • Design Kinesis Data Streams architecture with appropriate shard count
  • Choose between Kinesis Streams and Firehose
  • Configure Glue ETL jobs for data transformation
  • Optimize Athena queries with partitioning and columnar formats
  • Select appropriate EMR instance types for big data processing

Performance Monitoring:

  • Configure CloudWatch metrics and alarms for performance monitoring
  • Use X-Ray for distributed tracing and bottleneck identification
  • Analyze RDS Performance Insights for database optimization
  • Implement VPC Flow Logs for network performance analysis
  • Use Compute Optimizer for right-sizing recommendations

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle 1: Questions 1-25 (Focus: Storage and compute)
  • Domain 3 Bundle 2: Questions 26-50 (Focus: Database and networking)
  • Full Practice Test 2: Domain 3 questions (Mixed difficulty)

Expected score: 70%+ to proceed confidently

If you scored below 70%:

  • Review EBS volume types and use cases
  • Focus on database selection criteria (RDS vs Aurora vs DynamoDB)
  • Study CloudFront vs Global Accelerator differences
  • Practice Lambda optimization techniques
  • Review Kinesis architecture and shard calculations

Quick Reference Card

Copy this to your notes for quick review:

Storage Performance:

  • gp3: 3,000-16,000 IOPS, 125-1,000 MB/s, $0.08/GB-month
  • io2: 64,000 IOPS, 1,000 MB/s, 99.999% durability
  • io2 Block Express: 256,000 IOPS, 4,000 MB/s, sub-millisecond latency
  • EFS: Shared file system, auto-scaling, $0.30/GB-month (Standard)
  • FSx Lustre: HPC, 1,000s GB/s, sub-millisecond latency

Compute Families:

  • M (General): Balanced CPU/memory, web servers
  • C (Compute): High CPU, batch processing, gaming
  • R (Memory): High memory, databases, caching
  • X (Memory): Extreme memory, SAP HANA, in-memory DBs
  • P (GPU): ML training, HPC simulations
  • G (Graphics): Graphics workloads, video encoding

Database Performance:

  • Aurora: 5x MySQL, 3x PostgreSQL, 15 read replicas, <30s failover
  • RDS Read Replicas: Up to 15 replicas, async replication
  • DynamoDB: Single-digit millisecond latency, unlimited throughput
  • DAX: Microsecond latency, 10x performance for DynamoDB
  • ElastiCache Redis: Sub-millisecond, persistence, replication
  • ElastiCache Memcached: Sub-millisecond, multi-threaded, no persistence

Network Performance:

  • CloudFront: 450+ edge locations, cache at edge, $0.085/GB
  • Global Accelerator: Static IPs, intelligent routing, TCP/UDP
  • Direct Connect: 1-100 Gbps, private connectivity, consistent latency
  • Enhanced Networking: 25 Gbps, low latency, low jitter
  • EFA: 100 Gbps, HPC, MPI, NCCL

Data Ingestion:

  • Kinesis Streams: 1 MB/s write, 2 MB/s read per shard
  • Kinesis Firehose: Auto-scaling, 60s buffer, serverless
  • Glue: $0.44/DPU-hour, serverless ETL
  • Athena: $5/TB scanned, serverless SQL
  • EMR: Managed Hadoop/Spark, auto-scaling

Caching Layers:

  1. CloudFront: Edge caching (global)
  2. API Gateway: Response caching (regional)
  3. ElastiCache/DAX: Application caching (AZ)
  4. RDS Read Replicas: Read scaling (up to 15 replicas)
  5. DynamoDB DAX: Microsecond caching

Performance Optimization Checklist:

  • Use gp3 instead of gp2 for EBS (20% cheaper)
  • Enable S3 Transfer Acceleration for global uploads
  • Implement CloudFront for static content delivery
  • Use ElastiCache/DAX for frequently accessed data
  • Configure RDS read replicas for read-heavy workloads
  • Use Provisioned Concurrency for Lambda (consistent latency)
  • Enable enhanced networking on EC2 instances
  • Use placement groups for low-latency HPC
  • Convert data to Parquet for analytics (10x compression)
  • Partition data by query patterns in Athena

Congratulations! You've completed Chapter 3: Design High-Performing Architectures. You now understand how to optimize storage, compute, database, network, and data ingestion for maximum performance on AWS.

Next Steps:

  1. Complete the self-assessment checklist above
  2. Practice with Domain 3 test bundles
  3. Review any weak areas identified
  4. When ready, proceed to Chapter 4: Cost-Optimized Architectures

Chapter Summary

What We Covered

Task 3.1: High-Performing Storage Solutions

  • āœ… S3 storage classes and performance optimization
  • āœ… EBS volume types (gp3, io2, st1, sc1)
  • āœ… EFS performance modes and throughput modes
  • āœ… FSx file systems (Windows, Lustre, NetApp ONTAP)
  • āœ… Storage Gateway for hybrid cloud
  • āœ… DataSync for data migration

Task 3.2: High-Performing Compute Solutions

  • āœ… EC2 instance types and families
  • āœ… Placement groups for low latency
  • āœ… Auto Scaling policies (target tracking, step, predictive)
  • āœ… Lambda performance optimization
  • āœ… ECS and EKS capacity providers
  • āœ… Batch for large-scale processing
  • āœ… EMR for big data analytics

Task 3.3: High-Performing Database Solutions

  • āœ… RDS instance types and storage
  • āœ… Aurora Serverless v2 and parallel query
  • āœ… DynamoDB capacity modes and DAX
  • āœ… ElastiCache Redis vs Memcached
  • āœ… Database read replicas and connection pooling
  • āœ… RDS Proxy for connection management

Task 3.4: High-Performing Network Architectures

  • āœ… CloudFront edge caching and origin shield
  • āœ… Global Accelerator for global applications
  • āœ… Direct Connect for dedicated connectivity
  • āœ… VPC design and subnet sizing
  • āœ… Enhanced networking and placement groups
  • āœ… Load balancer performance optimization

Task 3.5: High-Performing Data Ingestion & Transformation

  • āœ… Kinesis Data Streams and Firehose
  • āœ… Glue ETL jobs and crawlers
  • āœ… Athena query optimization
  • āœ… EMR for big data processing
  • āœ… Lake Formation for data lakes
  • āœ… QuickSight for visualization

Critical Takeaways

  1. Right Storage for Right Workload: gp3 for general purpose, io2 for IOPS-intensive, S3 for objects
  2. Caching Everywhere: CloudFront (edge), API Gateway (regional), ElastiCache (application), DAX (DynamoDB)
  3. Compute Optimization: Choose instance type based on workload (compute, memory, storage, GPU)
  4. Database Performance: Use read replicas for read scaling, caching for frequently accessed data
  5. Network Optimization: CloudFront for global content, Direct Connect for consistent bandwidth
  6. Data Format Matters: Parquet/ORC for analytics (10x compression), partition data for query performance
  7. Provisioned Concurrency: Use for Lambda when consistent latency is required
  8. Enhanced Networking: Enable for high-throughput, low-latency workloads

Self-Assessment Checklist

Test yourself before moving on:

Storage Performance

  • I can choose the right EBS volume type for different workloads
  • I understand when to use EFS vs FSx vs S3
  • I know how to optimize S3 performance (multipart upload, Transfer Acceleration)
  • I can explain EFS performance modes (General Purpose vs Max I/O)
  • I understand FSx Lustre for HPC workloads

Compute Performance

  • I can select the right EC2 instance type for different workloads
  • I understand placement groups (cluster, partition, spread)
  • I know when to use Lambda vs Fargate vs EC2
  • I can optimize Lambda performance (memory, provisioned concurrency)
  • I understand Auto Scaling policies and when to use each

Database Performance

  • I can choose between RDS and Aurora based on performance needs
  • I understand when to use Aurora Serverless v2
  • I know how to optimize DynamoDB performance (partition keys, GSIs)
  • I can explain when to use DynamoDB DAX
  • I understand ElastiCache Redis vs Memcached use cases
  • I know how to use RDS Proxy for connection pooling

Network Performance

  • I can design CloudFront distributions for optimal caching
  • I understand when to use Global Accelerator vs CloudFront
  • I know how to optimize Direct Connect performance
  • I can explain enhanced networking benefits
  • I understand VPC endpoint performance implications

Data Ingestion & Analytics

  • I can design streaming architectures with Kinesis
  • I understand when to use Kinesis Data Streams vs Firehose
  • I know how to optimize Athena queries (partitioning, columnar formats)
  • I can explain Glue ETL job optimization
  • I understand EMR cluster sizing and optimization

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle 1: Questions 1-50 (storage and compute performance)
  • Domain 3 Bundle 2: Questions 51-100 (database and network performance)
  • Storage Services Bundle: 50 questions on S3, EBS, EFS, FSx
  • Database Services Bundle: 50 questions on RDS, Aurora, DynamoDB, ElastiCache

Expected Score: 70%+ to proceed confidently

If you scored below 70%:

  • Review storage performance characteristics (IOPS, throughput, latency)
  • Practice choosing the right compute instance types
  • Focus on database performance optimization techniques
  • Revisit caching strategies at multiple layers

Quick Reference Card

Copy this to your notes for quick review:

Storage Performance:

  • gp3: 3,000 IOPS baseline, 125 MB/s, cost-effective general purpose
  • io2: Up to 64,000 IOPS, 1,000 MB/s, mission-critical workloads
  • EFS: Shared file system, auto-scaling, bursting or provisioned throughput
  • FSx Lustre: HPC, ML, 100s GB/s throughput, sub-millisecond latency

Compute Performance:

  • C-family: Compute-optimized (batch processing, HPC)
  • R-family: Memory-optimized (in-memory databases, caching)
  • I-family: Storage-optimized (NoSQL databases, data warehousing)
  • P/G-family: GPU-optimized (ML training, graphics rendering)

Database Performance:

  • Aurora: 5x MySQL, 3x PostgreSQL, up to 15 read replicas
  • DynamoDB: Single-digit millisecond latency, DAX for microseconds
  • ElastiCache Redis: Sub-millisecond latency, persistence, replication
  • RDS Proxy: Connection pooling, 66% faster failover

Caching Layers:

  1. CloudFront: Edge caching (global)
  2. API Gateway: Response caching (regional)
  3. ElastiCache/DAX: Application caching (AZ)
  4. RDS Read Replicas: Read scaling (up to 15 replicas)
  5. DynamoDB DAX: Microsecond caching

Performance Optimization Checklist:

  • Use gp3 instead of gp2 for EBS (20% cheaper)
  • Enable S3 Transfer Acceleration for global uploads
  • Implement CloudFront for static content delivery
  • Use ElastiCache/DAX for frequently accessed data
  • Configure RDS read replicas for read-heavy workloads
  • Use Provisioned Concurrency for Lambda (consistent latency)
  • Enable enhanced networking on EC2 instances
  • Use placement groups for low-latency HPC
  • Convert data to Parquet for analytics (10x compression)
  • Partition data by query patterns in Athena

Chapter Summary

What We Covered

This chapter covered the five critical task areas for designing high-performing architectures on AWS:

āœ… Task 3.1: High-Performing Storage Solutions

  • S3 storage classes and performance optimization
  • EBS volume types (gp3, gp2, io2, st1, sc1)
  • EFS performance modes and throughput modes
  • FSx file systems (Windows, Lustre, NetApp ONTAP)
  • S3 Transfer Acceleration and multipart upload
  • Storage Gateway for hybrid storage
  • DataSync for large-scale data migration

āœ… Task 3.2: High-Performing Compute Solutions

  • EC2 instance families and types
  • Lambda memory and concurrency configuration
  • Auto Scaling policies and warm pools
  • Fargate task sizing
  • Batch for large-scale batch processing
  • EMR for big data processing
  • Placement groups for low-latency HPC
  • Compute Optimizer for right-sizing

āœ… Task 3.3: High-Performing Database Solutions

  • RDS instance types and storage options
  • Aurora performance features (Serverless v2, Parallel Query)
  • DynamoDB capacity modes and DAX caching
  • ElastiCache Redis vs Memcached
  • RDS Proxy for connection pooling
  • Database read replicas for read scaling
  • MemoryDB for Redis with persistence

āœ… Task 3.4: High-Performing Network Architectures

  • CloudFront for edge caching and content delivery
  • Global Accelerator for global traffic optimization
  • VPC design for performance
  • Direct Connect for dedicated connectivity
  • Load balancer selection (ALB, NLB, GLB)
  • VPC endpoints for private connectivity
  • Enhanced networking for EC2

āœ… Task 3.5: High-Performing Data Ingestion and Transformation

  • Kinesis Data Streams for real-time streaming
  • Kinesis Firehose for data delivery
  • Glue for ETL and data cataloging
  • Athena for serverless SQL queries
  • EMR for big data processing
  • Lake Formation for data lakes
  • Data format optimization (Parquet, ORC)

Critical Takeaways

  1. Choose the Right Storage: Match storage type to access patterns. Use gp3 for general purpose, io2 for high IOPS, S3 for object storage, EFS for shared file systems.

  2. Right-Size Compute: Use Compute Optimizer recommendations. Choose instance families based on workload (C for compute, R for memory, I for storage).

  3. Implement Caching Everywhere: Cache at edge (CloudFront), application (ElastiCache), and database (DAX, read replicas) layers.

  4. Optimize Database Performance: Use Aurora for high performance, DynamoDB for single-digit millisecond latency, ElastiCache for sub-millisecond caching.

  5. Use CDN for Global Performance: CloudFront reduces latency for global users. Use Origin Shield for additional caching layer.

  6. Partition and Compress Data: Use Parquet format for analytics (10x compression). Partition data by query patterns in Athena.

  7. Scale Horizontally: Add more instances rather than bigger instances. Use read replicas for read-heavy workloads.

  8. Monitor Performance: Use CloudWatch for metrics, Performance Insights for databases, X-Ray for distributed tracing.

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

Storage Performance:

  • Choose appropriate EBS volume type for workload (gp3, io2, st1, sc1)
  • Configure S3 Transfer Acceleration for global uploads
  • Implement S3 multipart upload for large files
  • Select EFS performance mode (General Purpose vs Max I/O)
  • Choose FSx file system type (Windows, Lustre, NetApp ONTAP)
  • Optimize S3 performance with prefixes and parallelization
  • Use Storage Gateway for hybrid storage scenarios
  • Configure DataSync for large-scale migrations

Compute Performance:

  • Select appropriate EC2 instance family (C, M, R, I, T, P, G)
  • Configure Lambda memory for optimal performance
  • Implement Lambda Provisioned Concurrency for consistent latency
  • Use EC2 placement groups for low-latency HPC
  • Configure Auto Scaling with appropriate policies
  • Choose between Fargate and EC2 launch type for containers
  • Use Batch for large-scale batch processing
  • Implement Compute Optimizer recommendations

Database Performance:

  • Choose between RDS and Aurora based on performance needs
  • Configure Aurora Serverless v2 for variable workloads
  • Implement DynamoDB DAX for microsecond caching
  • Design DynamoDB partition keys for even distribution
  • Use RDS Proxy for connection pooling
  • Configure read replicas for read-heavy workloads
  • Choose between ElastiCache Redis and Memcached
  • Optimize database queries with Performance Insights

Network Performance:

  • Configure CloudFront for edge caching
  • Use Global Accelerator for global traffic optimization
  • Choose appropriate load balancer (ALB, NLB, GLB)
  • Implement VPC endpoints for private connectivity
  • Configure Direct Connect for dedicated bandwidth
  • Use enhanced networking on EC2 instances
  • Optimize VPC design for performance
  • Implement CloudFront Origin Shield

Data Ingestion and Analytics:

  • Design streaming architecture with Kinesis Data Streams
  • Use Kinesis Firehose for data delivery to S3/Redshift
  • Configure Glue ETL jobs for data transformation
  • Optimize Athena queries with partitioning
  • Choose appropriate data format (Parquet, ORC, JSON)
  • Use EMR for big data processing
  • Implement Lake Formation for data lake governance
  • Design real-time analytics pipelines

Practice Questions

Try these from your practice test bundles:

Beginner Level (Target: 80%+ correct):

  • Domain 3 Bundle 1: Questions 1-20 (Storage basics, compute types, database fundamentals)
  • Storage Services Bundle: Questions 1-15 (S3, EBS, EFS basics)
  • Compute Services Bundle: Questions 1-15 (EC2, Lambda basics)

Intermediate Level (Target: 70%+ correct):

  • Domain 3 Bundle 2: Questions 21-40 (Performance optimization, caching strategies, advanced configurations)
  • Database Services Bundle: Questions 1-25 (RDS, Aurora, DynamoDB, ElastiCache)
  • Full Practice Test 1: Domain 3 questions (Mixed difficulty, realistic scenarios)

Advanced Level (Target: 60%+ correct):

  • Full Practice Test 2: Domain 3 questions (Complex architectures, optimization scenarios)
  • Full Practice Test 3: Domain 3 questions (Advanced performance tuning)

If You Scored Below Target

Below 60% on Beginner Questions:

  • Review sections: Storage Types, EC2 Instance Families, Database Basics
  • Focus on: EBS volume types, instance family use cases, RDS vs DynamoDB
  • Practice: Create different EBS volumes, launch various instance types, compare database options

Below 60% on Intermediate Questions:

  • Review sections: Performance Optimization, Caching Strategies, Advanced Configurations
  • Focus on: S3 performance, Lambda optimization, database caching, CloudFront
  • Practice: Optimize S3 uploads, configure Lambda concurrency, implement ElastiCache

Below 50% on Advanced Questions:

  • Review sections: Complex Architectures, Multi-Layer Caching, Data Lake Design
  • Focus on: Global performance optimization, advanced database tuning, analytics pipelines
  • Practice: Design multi-region architecture, optimize database queries, build data lakes

Quick Reference Card

Copy this to your notes for quick review

Storage Performance

  • gp3: General purpose, 3000 IOPS baseline, 125 MB/s, configurable IOPS/throughput
  • io2: High IOPS, up to 64,000 IOPS, 1000 MB/s, 99.999% durability
  • st1: Throughput-optimized HDD, 500 MB/s, big data, data warehouses
  • sc1: Cold HDD, 250 MB/s, infrequent access, lowest cost
  • EFS: Shared file system, multi-AZ, General Purpose or Max I/O mode
  • S3 Transfer Acceleration: 50-500% faster uploads using CloudFront edge locations

Compute Performance

  • C-family: Compute-optimized (batch processing, HPC, gaming)
  • M-family: General purpose (web servers, app servers)
  • R-family: Memory-optimized (in-memory databases, caching)
  • I-family: Storage-optimized (NoSQL databases, data warehousing)
  • T-family: Burstable (variable workloads, dev/test)
  • P/G-family: GPU-optimized (ML training, graphics rendering)

Database Performance

  • Aurora: 5x MySQL, 3x PostgreSQL, up to 15 read replicas, 6 copies across 3 AZs
  • DynamoDB: Single-digit millisecond latency, DAX for microseconds
  • ElastiCache Redis: Sub-millisecond latency, persistence, replication, clustering
  • RDS Proxy: Connection pooling, 66% faster failover, IAM authentication

Caching Layers

  1. CloudFront: Edge caching (global), 216+ edge locations
  2. API Gateway: Response caching (regional), 300 seconds default TTL
  3. ElastiCache/DAX: Application caching (AZ), microsecond latency
  4. RDS Read Replicas: Read scaling, up to 15 replicas
  5. DynamoDB DAX: Microsecond caching, 10x performance improvement

Network Performance

  • CloudFront: Edge caching, Origin Shield, field-level encryption
  • Global Accelerator: Static anycast IPs, health checks, traffic dials
  • ALB: Layer 7, path/host routing, WebSocket, Lambda targets
  • NLB: Layer 4, ultra-low latency, static IP, millions req/sec
  • Direct Connect: Dedicated 1/10/100 Gbps, consistent latency
  • Enhanced Networking: SR-IOV, up to 100 Gbps, lower latency

Data Formats

  • Parquet: Columnar, 10x compression, best for analytics
  • ORC: Columnar, optimized for Hive, good compression
  • JSON: Human-readable, flexible schema, larger size
  • CSV: Simple, widely supported, no compression
  • Avro: Row-based, schema evolution, good for streaming

Performance Optimization Checklist

  • Use gp3 instead of gp2 (20% cheaper, better performance)
  • Enable S3 Transfer Acceleration for global uploads
  • Implement CloudFront for static content delivery
  • Use ElastiCache/DAX for frequently accessed data
  • Configure RDS read replicas for read-heavy workloads
  • Use Provisioned Concurrency for Lambda (consistent latency)
  • Enable enhanced networking on EC2 instances
  • Use placement groups for low-latency HPC
  • Convert data to Parquet for analytics (10x compression)
  • Partition data by query patterns in Athena

Decision Points

Scenario Solution
Need high IOPS (>16,000) io2 Block Express
Need shared file system EFS or FSx
Need global file caching FSx for Lustre with S3
Need consistent low latency Lambda Provisioned Concurrency
Need HPC with low latency Placement groups (cluster)
Need database caching ElastiCache Redis or DAX
Need global content delivery CloudFront
Need global traffic optimization Global Accelerator
Need real-time streaming Kinesis Data Streams
Need serverless analytics Athena
Need big data processing EMR

Common Exam Traps

  • āŒ Using gp2 instead of gp3 → āœ… gp3 is cheaper and more flexible
  • āŒ Not using caching → āœ… Implement multi-layer caching
  • āŒ Wrong instance family → āœ… Match family to workload (C, R, M, I)
  • āŒ Not using read replicas → āœ… Scale reads with replicas
  • āŒ Storing data in JSON → āœ… Convert to Parquet for analytics
  • āŒ Not partitioning data → āœ… Partition by query patterns
  • āŒ Using wrong database → āœ… Match database to access pattern
  • āŒ Not using CloudFront → āœ… Use CDN for global performance

Next Chapter: 05_domain4_cost_optimized_architectures - Learn how to design cost-optimized solutions.


Chapter Summary

What We Covered

This chapter covered the five critical task areas for designing high-performing architectures on AWS:

āœ… Task 3.1: High-Performing Storage Solutions

  • S3 storage classes and performance optimization
  • EBS volume types (gp3, gp2, io2, io1, st1, sc1)
  • EFS performance modes and throughput modes
  • FSx file systems (Windows, Lustre, NetApp ONTAP, OpenZFS)
  • S3 Transfer Acceleration and multipart upload
  • Storage Gateway for hybrid storage
  • DataSync for data migration

āœ… Task 3.2: High-Performing Compute Solutions

  • EC2 instance families and types
  • Placement groups for low-latency HPC
  • Auto Scaling policies (target tracking, step, scheduled)
  • Lambda memory and concurrency configuration
  • ECS and EKS for container orchestration
  • Fargate for serverless containers
  • Batch for batch processing workloads
  • EMR for big data processing

āœ… Task 3.3: High-Performing Database Solutions

  • RDS instance types and storage options
  • Aurora Serverless and Aurora Global Database
  • DynamoDB capacity modes and DAX caching
  • ElastiCache (Redis vs Memcached)
  • Read replicas for read scaling
  • RDS Proxy for connection pooling
  • Database performance monitoring with Performance Insights

āœ… Task 3.4: High-Performing Network Architectures

  • CloudFront for edge caching and content delivery
  • Global Accelerator for global traffic management
  • Direct Connect for dedicated network connections
  • VPC design and subnet sizing
  • Load balancing strategies (ALB, NLB, GLB)
  • VPC endpoints for private connectivity
  • Enhanced networking for high throughput

āœ… Task 3.5: High-Performing Data Ingestion and Transformation

  • Kinesis Data Streams for real-time streaming
  • Kinesis Data Firehose for data delivery
  • Glue for ETL and data cataloging
  • Athena for serverless SQL queries
  • EMR for big data processing
  • Lake Formation for data lake management
  • QuickSight for data visualization
  • Data format optimization (Parquet, ORC)

Critical Takeaways

  1. Choose the Right Storage: Use gp3 for general purpose (cheaper than gp2), io2 Block Express for high IOPS (>64,000), EFS for shared file systems, and FSx for specialized workloads.

  2. Instance Selection Matters: Match instance type to workload - compute-optimized (C) for CPU-intensive, memory-optimized (R/X) for in-memory databases, storage-optimized (I/D) for high IOPS.

  3. Cache Everything: Use CloudFront for static content, ElastiCache for application data, DAX for DynamoDB, and RDS read replicas for read-heavy workloads.

  4. Serverless for Variable Workloads: Lambda and Fargate automatically scale. Use Provisioned Concurrency for Lambda when you need consistent low latency.

  5. Database Performance: Use Aurora for high performance and scalability. Use DynamoDB for single-digit millisecond latency. Use ElastiCache to reduce database load.

  6. Network Optimization: Use CloudFront to reduce latency globally. Use Direct Connect for consistent network performance. Use VPC endpoints to avoid internet gateway.

  7. Data Format Matters: Convert to Parquet for analytics (10x compression). Partition data by query patterns in Athena. Use columnar formats for analytical workloads.

  8. Monitoring is Essential: Use CloudWatch for metrics, X-Ray for distributed tracing, and Performance Insights for database performance.

Self-Assessment Checklist

Test yourself before moving on:

  • I can choose the right EBS volume type for different workloads
  • I understand when to use EFS vs FSx vs S3
  • I know how to optimize S3 performance with Transfer Acceleration
  • I can select the appropriate EC2 instance type for a workload
  • I understand Lambda memory and concurrency configuration
  • I know when to use ECS vs EKS vs Fargate
  • I can design a caching strategy with multiple layers
  • I understand the difference between Aurora and RDS
  • I know when to use DynamoDB vs RDS
  • I can configure read replicas for read scaling
  • I understand CloudFront caching behaviors
  • I know when to use Global Accelerator vs CloudFront
  • I can design a data ingestion pipeline with Kinesis
  • I understand data format optimization for analytics

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle 1: Questions 1-25 (Storage and compute performance)
  • Domain 3 Bundle 2: Questions 1-25 (Database and network performance)
  • Storage Services Bundle: Questions 1-30
  • Database Services Bundle: Questions 1-30
  • Compute Services Bundle: Questions 1-30

Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • Review EBS volume types and their IOPS limits
  • Focus on understanding Lambda concurrency and memory configuration
  • Study database caching strategies (ElastiCache, DAX, read replicas)
  • Practice CloudFront caching and invalidation scenarios

Quick Reference Card

Storage Performance:

  • gp3: 3,000-16,000 IOPS, 125-1,000 MB/s, $0.08/GB-month
  • io2 Block Express: Up to 256,000 IOPS, 4,000 MB/s, sub-millisecond latency
  • EFS: Bursting or Provisioned throughput, shared across instances
  • S3 Transfer Acceleration: 50-500% faster uploads over long distances

Compute Performance:

  • C instances: Compute-optimized (CPU-intensive)
  • R/X instances: Memory-optimized (in-memory databases)
  • I/D instances: Storage-optimized (high IOPS)
  • Placement groups: Cluster (low latency), Spread (high availability), Partition (distributed)

Database Performance:

  • Aurora: 5x faster than MySQL, 3x faster than PostgreSQL, up to 15 read replicas
  • DynamoDB: Single-digit millisecond latency, unlimited throughput with on-demand
  • ElastiCache Redis: Sub-millisecond latency, persistence, replication
  • DAX: Microsecond latency for DynamoDB, 10x performance improvement

Network Performance:

  • CloudFront: 225+ edge locations, cache static/dynamic content
  • Global Accelerator: Static anycast IPs, health checks, traffic dials
  • Direct Connect: 1-100 Gbps dedicated connection, consistent latency
  • Enhanced Networking: SR-IOV, up to 100 Gbps, lower latency

Data Ingestion:

  • Kinesis Data Streams: Real-time streaming, custom processing
  • Kinesis Firehose: Serverless delivery to S3, Redshift, Elasticsearch
  • Glue: Serverless ETL, data catalog, crawlers
  • Athena: Serverless SQL on S3, pay per query

Key Decision Points:

  • Need high IOPS (>16,000) → io2 Block Express
  • Need shared file system → EFS or FSx
  • Need consistent low latency → Lambda Provisioned Concurrency
  • Need HPC with low latency → Placement groups (cluster)
  • Need database caching → ElastiCache Redis or DAX
  • Need global content delivery → CloudFront
  • Need real-time streaming → Kinesis Data Streams
  • Need serverless analytics → Athena + Glue

Next Chapter: 05_domain4_cost_optimized_architectures - Learn how to design cost-optimized solutions.


Chapter 4: Design Cost-Optimized Architectures (20% of exam)

Chapter Overview

What you'll learn:

  • Cost-optimized storage solutions (S3 lifecycle, storage classes)
  • Cost-optimized compute solutions (Reserved Instances, Savings Plans, Spot)
  • Cost-optimized database solutions (right-sizing, Aurora Serverless)
  • Cost-optimized network architectures (data transfer, VPC endpoints)
  • Cost monitoring and optimization tools

Time to complete: 8-10 hours

Prerequisites: Chapters 1-3 (understanding of services before optimizing costs)

Exam Weight: 20% of exam questions (approximately 13 out of 65 questions)


Section 1: Cost-Optimized Storage Solutions

Introduction

The problem: Storage costs can spiral out of control without proper management. Storing infrequently accessed data in expensive storage, not using lifecycle policies, and paying for unnecessary data transfer all waste money.

The solution: AWS provides multiple storage classes with different price points. Understanding access patterns, implementing lifecycle policies, and optimizing data transfer enables significant cost savings without sacrificing availability or durability.

Why it's tested: Storage is often the largest AWS cost component. This domain represents 20% of the exam and tests your ability to optimize storage costs while meeting performance and availability requirements.

Core Concepts

S3 Storage Classes and Lifecycle Policies

What they are: S3 offers multiple storage classes optimized for different access patterns and durability requirements. Lifecycle policies automatically transition objects between storage classes based on age or access patterns.

Why they exist: Not all data needs the same level of access speed or durability. Frequently accessed data needs fast retrieval. Infrequently accessed data can tolerate slower retrieval for lower cost. Lifecycle policies automate cost optimization without manual intervention.

S3 Storage Classes:

S3 Standard - Frequent access:

  • Durability: 99.999999999% (11 9's)
  • Availability: 99.99%
  • Retrieval: Milliseconds
  • Cost: $0.023/GB-month (first 50 TB)
  • Use Case: Frequently accessed data, primary storage

S3 Intelligent-Tiering - Unknown/changing access:

  • Automatic: Moves objects between tiers based on access patterns
  • Tiers: Frequent (same as Standard), Infrequent (40% cheaper), Archive (68% cheaper), Deep Archive (95% cheaper)
  • Monitoring: $0.0025 per 1,000 objects
  • Cost: Same as Standard for frequent, cheaper for infrequent
  • Use Case: Unknown access patterns, automatic optimization

S3 Standard-IA - Infrequent access:

  • Durability: 99.999999999% (11 9's)
  • Availability: 99.9%
  • Retrieval: Milliseconds
  • Cost: $0.0125/GB-month (46% cheaper than Standard)
  • Retrieval Fee: $0.01/GB
  • Minimum: 30 days, 128 KB per object
  • Use Case: Backups, disaster recovery, infrequently accessed data

S3 One Zone-IA - Infrequent access, single AZ:

  • Durability: 99.999999999% (11 9's) within single AZ
  • Availability: 99.5%
  • Retrieval: Milliseconds
  • Cost: $0.01/GB-month (57% cheaper than Standard)
  • Retrieval Fee: $0.01/GB
  • Use Case: Reproducible data, secondary backups

S3 Glacier Instant Retrieval - Archive with instant access:

  • Durability: 99.999999999% (11 9's)
  • Availability: 99.9%
  • Retrieval: Milliseconds
  • Cost: $0.004/GB-month (83% cheaper than Standard)
  • Retrieval Fee: $0.03/GB
  • Minimum: 90 days, 128 KB per object
  • Use Case: Medical images, news archives (rarely accessed but need instant retrieval)

S3 Glacier Flexible Retrieval - Archive with flexible retrieval:

  • Durability: 99.999999999% (11 9's)
  • Availability: 99.99%
  • Retrieval: Minutes to hours (Expedited: 1-5 min, Standard: 3-5 hours, Bulk: 5-12 hours)
  • Cost: $0.0036/GB-month (84% cheaper than Standard)
  • Retrieval Fee: $0.01-0.03/GB depending on speed
  • Minimum: 90 days
  • Use Case: Long-term backups, compliance archives

S3 Glacier Deep Archive - Lowest cost archive:

  • Durability: 99.999999999% (11 9's)
  • Availability: 99.99%
  • Retrieval: 12-48 hours (Standard: 12 hours, Bulk: 48 hours)
  • Cost: $0.00099/GB-month (96% cheaper than Standard)
  • Retrieval Fee: $0.02/GB
  • Minimum: 180 days
  • Use Case: Regulatory archives, data retained for 7-10 years

Detailed Example 1: S3 Lifecycle Policy for Cost Optimization

Scenario: You're storing application logs in S3. Access patterns:

  • Days 0-30: Frequently accessed for debugging (accessed daily)
  • Days 31-90: Occasionally accessed for analysis (accessed weekly)
  • Days 91-365: Rarely accessed for compliance (accessed monthly)
  • Days 365+: Almost never accessed, kept for 7 years (compliance)

Current Cost (all in S3 Standard):

  • Storage: 10 TB Ɨ $0.023/GB Ɨ 1,024 GB/TB = $235/month
  • Annual: $235 Ɨ 12 = $2,820/year
  • 7 years: $2,820 Ɨ 7 = $19,740

Optimized with Lifecycle Policy:

Lifecycle Configuration:

<LifecycleConfiguration>
  <Rule>
    <ID>log-lifecycle</ID>
    <Status>Enabled</Status>
    <Prefix>logs/</Prefix>
    
    <!-- Days 0-30: S3 Standard (no transition) -->
    
    <!-- Days 31-90: Transition to Standard-IA -->
    <Transition>
      <Days>30</Days>
      <StorageClass>STANDARD_IA</StorageClass>
    </Transition>
    
    <!-- Days 91-365: Transition to Glacier Instant Retrieval -->
    <Transition>
      <Days>90</Days>
      <StorageClass>GLACIER_IR</StorageClass>
    </Transition>
    
    <!-- Days 365+: Transition to Glacier Deep Archive -->
    <Transition>
      <Days>365</Days>
      <StorageClass>DEEP_ARCHIVE</StorageClass>
    </Transition>
    
    <!-- Delete after 7 years -->
    <Expiration>
      <Days>2555</Days>
    </Expiration>
  </Rule>
</LifecycleConfiguration>

Cost Breakdown:

Month 1 (all data in Standard):

  • 10 TB Ɨ $0.023/GB Ɨ 1,024 = $235

Month 2 (30 days Standard, 30 days Standard-IA):

  • Standard: 5 TB Ɨ $0.023 Ɨ 1,024 = $118
  • Standard-IA: 5 TB Ɨ $0.0125 Ɨ 1,024 = $64
  • Total: $182 (23% savings)

Month 4 (30 days Standard, 60 days Standard-IA, 30 days Glacier IR):

  • Standard: 2.5 TB Ɨ $0.023 Ɨ 1,024 = $59
  • Standard-IA: 5 TB Ɨ $0.0125 Ɨ 1,024 = $64
  • Glacier IR: 2.5 TB Ɨ $0.004 Ɨ 1,024 = $10
  • Total: $133 (43% savings)

Month 13 (steady state):

  • Standard: 0.8 TB Ɨ $0.023 Ɨ 1,024 = $19
  • Standard-IA: 1.6 TB Ɨ $0.0125 Ɨ 1,024 = $20
  • Glacier IR: 7.5 TB Ɨ $0.004 Ɨ 1,024 = $31
  • Glacier Deep Archive: 0.1 TB Ɨ $0.00099 Ɨ 1,024 = $0.10
  • Total: $70 (70% savings)

7-Year Cost:

  • Without lifecycle: $19,740
  • With lifecycle: ~$6,000
  • Savings: $13,740 (70% reduction)

Retrieval Costs (occasional access):

  • Standard-IA: 100 GB/month Ɨ $0.01 = $1/month
  • Glacier IR: 50 GB/month Ɨ $0.03 = $1.50/month
  • Glacier Deep Archive: 10 GB/year Ɨ $0.02 = $0.20/year
  • Total: ~$30/year (negligible compared to storage savings)

Section 2: Cost-Optimized Compute Solutions

Introduction

The problem: Running EC2 instances 24/7 at On-Demand prices is expensive. Many workloads don't need continuous availability or can tolerate interruptions. Not using Reserved Instances, Savings Plans, or Spot Instances wastes money.

The solution: AWS provides multiple pricing models for EC2. Understanding workload characteristics and commitment levels enables 50-90% cost savings without sacrificing performance.

Why it's tested: Compute is typically the second-largest AWS cost. This section tests your ability to select appropriate pricing models and optimize compute costs.

Core Concepts

EC2 Pricing Models

On-Demand - Pay by the hour/second:

  • Pricing: Standard hourly rate (e.g., $0.096/hour for m5.xlarge)
  • Commitment: None
  • Flexibility: Start/stop anytime
  • Use Case: Short-term, unpredictable workloads, testing

Reserved Instances - 1 or 3-year commitment:

  • Discount: 40-60% vs On-Demand
  • Payment: All Upfront, Partial Upfront, No Upfront
  • Types:
    • Standard RI: Highest discount (60%), no flexibility
    • Convertible RI: Lower discount (54%), can change instance family
  • Use Case: Steady-state workloads, predictable usage

Savings Plans - 1 or 3-year commitment:

  • Discount: Up to 72% vs On-Demand
  • Flexibility: Apply to any instance family, size, region, OS
  • Types:
    • Compute Savings Plans: Most flexible, 66% discount
    • EC2 Instance Savings Plans: Less flexible, 72% discount
  • Use Case: Flexible workloads, multiple instance types

Spot Instances - Bid on spare capacity:

  • Discount: Up to 90% vs On-Demand
  • Interruption: Can be terminated with 2-minute warning
  • Use Case: Fault-tolerant, flexible workloads (batch, big data, CI/CD)

Detailed Example 2: Compute Cost Optimization Strategy

Scenario: You're running a web application with the following workload:

  • Baseline: 10 m5.xlarge instances (24/7)
  • Peak Hours (9 AM - 6 PM weekdays): Additional 20 m5.xlarge instances
  • Batch Processing (nightly): 50 c5.2xlarge instances (2 hours/night)

Current Cost (all On-Demand):

  • Baseline: 10 Ɨ $0.192/hour Ɨ 730 hours = $1,402/month
  • Peak: 20 Ɨ $0.192/hour Ɨ 200 hours = $768/month
  • Batch: 50 Ɨ $0.34/hour Ɨ 60 hours = $1,020/month
  • Total: $3,190/month = $38,280/year

Optimized Strategy:

1. Baseline: Use Savings Plans:

  • Commit to $1,000/month (covers ~10 instances)
  • Discount: 66% (Compute Savings Plan, 1-year, No Upfront)
  • Cost: $1,000/month (vs $1,402 On-Demand)
  • Savings: $402/month

2. Peak Hours: Use On-Demand:

  • No commitment needed (variable usage)
  • Cost: $768/month (same as before)

3. Batch Processing: Use Spot Instances:

  • Spot price: ~$0.068/hour (80% discount)
  • Interruption handling: Checkpoint progress, resume on new instance
  • Cost: 50 Ɨ $0.068/hour Ɨ 60 hours = $204/month
  • Savings: $816/month

Optimized Total:

  • Savings Plans: $1,000/month
  • On-Demand: $768/month
  • Spot: $204/month
  • Total: $1,972/month = $23,664/year
  • Savings: $14,616/year (38% reduction)

Further Optimization with 3-Year Commitment:

  • Savings Plans: 72% discount (EC2 Instance Savings Plan, 3-year, All Upfront)
  • Baseline cost: $1,402 Ɨ 0.28 = $393/month
  • Total: $1,365/month = $16,380/year
  • Savings: $21,900/year (57% reduction)

Implementation:

Step 1: Purchase Savings Plan:

aws savingsplans create-savings-plan \
  --savings-plan-type ComputeSavingsPlans \
  --commitment 1000 \
  --upfront-payment-amount 0 \
  --term-duration-in-years 1

Step 2: Configure Spot Fleet for Batch:

aws ec2 create-spot-fleet-request \
  --spot-fleet-request-config '{
    "IamFleetRole": "arn:aws:iam::123456789012:role/SpotFleetRole",
    "TargetCapacity": 50,
    "SpotPrice": "0.10",
    "LaunchSpecifications": [{
      "ImageId": "ami-12345678",
      "InstanceType": "c5.2xlarge",
      "KeyName": "my-key",
      "UserData": "base64-encoded-script"
    }],
    "AllocationStrategy": "lowestPrice",
    "InstanceInterruptionBehavior": "terminate"
  }'

Step 3: Handle Spot Interruptions:

# Batch processing script
import boto3
import time

ec2 = boto3.client('ec2')

def process_batch(items):
    for item in items:
        # Check for spot interruption warning
        try:
            response = requests.get(
                'http://169.254.169.254/latest/meta-data/spot/instance-action',
                timeout=1
            )
            if response.status_code == 200:
                # Spot interruption in 2 minutes
                print("Spot interruption warning, checkpointing...")
                checkpoint_progress(item)
                break
        except:
            pass
        
        # Process item
        process_item(item)
        mark_complete(item)

def checkpoint_progress(item):
    # Save progress to DynamoDB
    dynamodb.put_item(
        TableName='batch-progress',
        Item={'item_id': item, 'status': 'in-progress'}
    )

# Resume from checkpoint on new instance
def resume_batch():
    # Get incomplete items
    response = dynamodb.query(
        TableName='batch-progress',
        KeyConditionExpression='status = :status',
        ExpressionAttributeValues={':status': 'in-progress'}
    )
    incomplete_items = [item['item_id'] for item in response['Items']]
    process_batch(incomplete_items)

Chapter Summary

What We Covered

This chapter covered the "Design Cost-Optimized Architectures" domain, which represents 20% of the SAA-C03 exam. We explored two major areas:

āœ… Section 1: Cost-Optimized Storage Solutions

  • S3 storage classes and their use cases
  • S3 lifecycle policies for automatic transitions
  • Cost comparison between storage classes
  • Retrieval costs and minimum storage durations

āœ… Section 2: Cost-Optimized Compute Solutions

  • EC2 pricing models (On-Demand, Reserved, Savings Plans, Spot)
  • Savings Plans vs Reserved Instances
  • Spot Instance use cases and interruption handling
  • Cost optimization strategies for different workloads

Critical Takeaways

  1. S3 Lifecycle Policies: Automatically transition objects to cheaper storage classes based on age. Can save 70-90% on storage costs for infrequently accessed data.

  2. Storage Class Selection: Use Standard for frequent access, Standard-IA for infrequent access (>30 days), Glacier for archives (>90 days), Deep Archive for long-term retention (>180 days).

  3. Savings Plans: Most flexible commitment option. Compute Savings Plans apply to any instance family/region. EC2 Instance Savings Plans offer higher discounts but less flexibility.

  4. Reserved Instances: Good for predictable workloads with specific instance requirements. Standard RIs offer highest discount (60%) but no flexibility. Convertible RIs offer flexibility (54% discount).

  5. Spot Instances: Up to 90% discount for fault-tolerant workloads. Must handle 2-minute interruption warnings. Best for batch processing, big data, CI/CD.

  6. Cost Optimization Strategy: Use Savings Plans for baseline, On-Demand for variable peaks, Spot for fault-tolerant batch workloads. Can achieve 40-60% total cost reduction.

  7. Intelligent-Tiering: Automatic cost optimization for unknown access patterns. Monitors access and moves objects between tiers. No retrieval fees, small monitoring fee.

Self-Assessment Checklist

Test yourself before moving on:

  • I understand all S3 storage classes and their use cases
  • I can design S3 lifecycle policies for cost optimization
  • I know the minimum storage durations for each storage class
  • I understand the difference between Savings Plans and Reserved Instances
  • I know when to use Spot Instances
  • I can handle Spot Instance interruptions
  • I understand how to optimize costs for different workload patterns
  • I can calculate cost savings for different pricing models

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-25 (Storage and compute costs)
  • Domain 4 Bundle 2: Questions 26-50 (Cost optimization strategies)
  • Full Practice Test 1: Questions 54-65 (Domain 4 questions)

Expected score: 70%+ to proceed confidently

Quick Reference Card

S3 Storage Classes (by cost, cheapest to most expensive):

  • Deep Archive: $0.00099/GB-month, 12-48 hour retrieval
  • Glacier Flexible: $0.0036/GB-month, minutes-hours retrieval
  • Glacier Instant: $0.004/GB-month, millisecond retrieval
  • One Zone-IA: $0.01/GB-month, single AZ
  • Standard-IA: $0.0125/GB-month, multi-AZ
  • Standard: $0.023/GB-month, frequent access

EC2 Pricing Models (by discount):

  • Spot: Up to 90% discount, interruptible
  • Savings Plans: Up to 72% discount, 1-3 year commitment
  • Reserved Instances: Up to 60% discount, 1-3 year commitment
  • On-Demand: No discount, no commitment

Decision Points:

  • Infrequent access (>30 days) → Use S3 Standard-IA
  • Archive (>90 days) → Use S3 Glacier
  • Long-term archive (>180 days) → Use S3 Glacier Deep Archive
  • Unknown access pattern → Use S3 Intelligent-Tiering
  • Steady-state workload → Use Savings Plans or Reserved Instances
  • Fault-tolerant batch → Use Spot Instances
  • Variable workload → Use On-Demand

Next Chapter: 06_integration - Integration & Cross-Domain Scenarios


Section 3: Cost-Optimized Database Solutions

Introduction

The problem: Database costs can be significant, especially for high-throughput or large-storage workloads. Running oversized instances, not using serverless options, and paying for unused capacity waste money.

The solution: AWS provides multiple database pricing models and optimization strategies. Understanding workload patterns, using serverless options, and right-sizing instances enables significant cost savings.

Core Concepts

RDS Cost Optimization

RDS Pricing Factors:

  1. Instance Type: db.t3 (burstable) vs db.m5 (general) vs db.r5 (memory)
  2. Storage: gp2 vs gp3 vs io1 (IOPS costs)
  3. Multi-AZ: Doubles instance cost (but necessary for production)
  4. Backups: Automated backups (free up to DB size), manual snapshots (charged)
  5. Data Transfer: Cross-region replication, read replica traffic

Cost Optimization Strategies:

1. Right-Size Instances:

  • Monitor CPU, memory, IOPS utilization
  • Downsize if consistently under 50% utilization
  • Use CloudWatch metrics and RDS Performance Insights

2. Use Reserved Instances:

  • 1-year: 40% discount
  • 3-year: 60% discount
  • All Upfront payment: Highest discount

3. Use Aurora Serverless v2:

  • Pay per ACU (Aurora Capacity Unit) per second
  • Automatically scales based on load
  • No idle capacity costs

4. Optimize Storage:

  • Switch from gp2 to gp3 (20% cheaper)
  • Reduce provisioned IOPS if not needed
  • Delete old snapshots

Detailed Example 3: RDS Cost Optimization

Scenario: Running PostgreSQL on RDS:

  • Instance: db.m5.2xlarge (8 vCPU, 32 GB RAM)
  • Storage: gp2 500 GB
  • Multi-AZ: Yes
  • Utilization: CPU 30%, Memory 40%
  • Cost: $0.544/hour Ɨ 2 (Multi-AZ) Ɨ 730 hours = $794/month

Optimization Steps:

Step 1: Right-Size Instance:

  • Current: db.m5.2xlarge (8 vCPU, 32 GB RAM)
  • Actual need: 30% CPU = 2.4 vCPU, 40% memory = 12.8 GB
  • New: db.m5.large (2 vCPU, 8 GB RAM) - Slightly tight but acceptable
  • Cost: $0.192/hour Ɨ 2 (Multi-AZ) Ɨ 730 hours = $280/month
  • Savings: $514/month (65% reduction)

Step 2: Switch to gp3 Storage:

  • Current: gp2 500 GB = $0.115/GB Ɨ 500 = $57.50/month
  • New: gp3 500 GB = $0.08/GB Ɨ 500 = $40/month
  • Savings: $17.50/month (30% reduction)

Step 3: Purchase Reserved Instance:

  • 1-year, No Upfront: 40% discount
  • Cost: $280 Ɨ 0.6 = $168/month
  • Savings: $112/month (40% reduction)

Total Optimized Cost:

  • Instance: $168/month (Reserved)
  • Storage: $40/month (gp3)
  • Total: $208/month (vs $794 original)
  • Total Savings: $586/month (74% reduction)

Aurora Serverless v2

What it is: Aurora Serverless v2 is an on-demand, auto-scaling configuration for Amazon Aurora. It automatically scales database capacity based on application demand.

Why it exists: Traditional databases require provisioning fixed capacity. During low traffic, you pay for idle capacity. During spikes, you may not have enough capacity. Aurora Serverless eliminates this waste by scaling automatically.

How it works:

  1. Define Capacity Range: Set minimum and maximum ACUs (Aurora Capacity Units)
  2. Automatic Scaling: Aurora scales up/down in 0.5 ACU increments
  3. Pay Per Second: Only pay for ACUs used per second
  4. Instant Scaling: Scales in seconds (vs minutes for instance resizing)

Pricing:

  • ACU: $0.12 per ACU-hour (MySQL/PostgreSQL)
  • Storage: $0.10/GB-month
  • I/O: $0.20 per million requests

Detailed Example 4: Aurora Serverless Cost Comparison

Scenario: E-commerce database with variable traffic:

  • Baseline (nights/weekends): 2 ACUs needed
  • Normal (business hours): 8 ACUs needed
  • Peak (sales events): 32 ACUs needed
  • Pattern: 16 hours/day normal, 2 hours/day peak, 6 hours/day baseline

Option 1: Provisioned Aurora (db.r5.2xlarge):

  • Capacity: 32 ACUs (to handle peak)
  • Cost: $0.58/hour Ɨ 730 hours = $423/month
  • Waste: Paying for 32 ACUs 24/7, only need it 2 hours/day

Option 2: Aurora Serverless v2:

  • Min: 2 ACUs, Max: 32 ACUs
  • Usage:
    • Baseline (6 hours/day): 2 ACUs Ɨ 6 Ɨ 30 = 360 ACU-hours
    • Normal (16 hours/day): 8 ACUs Ɨ 16 Ɨ 30 = 3,840 ACU-hours
    • Peak (2 hours/day): 32 ACUs Ɨ 2 Ɨ 30 = 1,920 ACU-hours
    • Total: 6,120 ACU-hours/month
  • Cost: 6,120 Ɨ $0.12 = $734/month

Wait, that's more expensive! Let's recalculate with realistic scaling:

Realistic Scenario (gradual scaling):

  • Baseline (6 hours/day): 2 ACUs
  • Ramp up (2 hours/day): 4 ACUs average
  • Normal (14 hours/day): 8 ACUs
  • Peak (2 hours/day): 16 ACUs average (not full 32)
  • Usage:
    • 2 ACUs Ɨ 6 Ɨ 30 = 360 ACU-hours
    • 4 ACUs Ɨ 2 Ɨ 30 = 240 ACU-hours
    • 8 ACUs Ɨ 14 Ɨ 30 = 3,360 ACU-hours
    • 16 ACUs Ɨ 2 Ɨ 30 = 960 ACU-hours
    • Total: 4,920 ACU-hours/month
  • Cost: 4,920 Ɨ $0.12 = $590/month

Comparison:

  • Provisioned: $423/month (fixed capacity)
  • Serverless: $590/month (variable capacity)

When Serverless Wins:

  • If traffic is more variable (long idle periods)
  • If peak is rare (< 10% of time)
  • If you want to avoid over-provisioning

When Provisioned Wins:

  • If traffic is consistent (> 50% at peak capacity)
  • If you can use Reserved Instances (40-60% discount)
  • If predictable workload

Section 4: Cost-Optimized Network Architectures

Introduction

The problem: Data transfer costs can be significant, especially for high-traffic applications. Cross-region transfers, NAT Gateway costs, and unnecessary data movement waste money.

The solution: Understanding data transfer pricing, using VPC endpoints, optimizing NAT Gateway usage, and leveraging CloudFront enables significant cost savings.

Core Concepts

Data Transfer Pricing

AWS Data Transfer Costs:

Inbound (to AWS):

  • Free: All data transfer into AWS from internet

Outbound (from AWS to internet):

  • First 10 TB/month: $0.09/GB
  • Next 40 TB/month: $0.085/GB
  • Next 100 TB/month: $0.07/GB
  • Over 150 TB/month: $0.05/GB

Inter-Region (between AWS regions):

  • Cost: $0.02/GB (both directions)

Intra-Region (within same region):

  • Same AZ: Free (if using private IP)
  • Different AZ: $0.01/GB (each direction)

VPC Peering:

  • Same Region: $0.01/GB
  • Different Region: $0.02/GB

NAT Gateway:

  • Hourly: $0.045/hour
  • Data Processed: $0.045/GB

Detailed Example 5: Network Cost Optimization

Scenario: Web application with:

  • EC2 instances: Private subnets, need internet access for updates
  • S3 access: Frequent reads/writes to S3
  • Data transfer: 10 TB/month to internet, 5 TB/month to S3

Current Architecture (NAT Gateway):

  • NAT Gateway: $0.045/hour Ɨ 730 hours = $32.85/month
  • Data processed: 15 TB Ɨ 1,024 GB Ɨ $0.045 = $691/month
  • Data transfer out: 10 TB Ɨ 1,024 GB Ɨ $0.09 = $922/month
  • Total: $1,646/month

Optimized Architecture (VPC Endpoints):

Step 1: Add S3 VPC Endpoint (Gateway):

  • Cost: Free (no hourly or data charges)
  • Benefit: S3 traffic stays within AWS network
  • Savings: 5 TB Ɨ 1,024 GB Ɨ $0.045 = $230/month (NAT Gateway data processing)

Step 2: Add VPC Endpoint for Other Services:

  • DynamoDB Gateway Endpoint: Free
  • Interface Endpoints (EC2, SNS, SQS): $0.01/hour per AZ + $0.01/GB
  • For 2 AZs: $0.01 Ɨ 2 Ɨ 730 = $14.60/month
  • Data processed: Minimal (< 1 TB)

Step 3: Use CloudFront for Static Content:

  • Serve static assets from CloudFront instead of EC2
  • CloudFront: $0.085/GB (first 10 TB) vs $0.09/GB (EC2 data transfer)
  • Caching reduces origin requests by 80%
  • Data transfer: 10 TB Ɨ 0.2 (cache miss rate) Ɨ 1,024 GB Ɨ $0.085 = $174/month
  • Savings: $922 - $174 = $748/month

Optimized Total:

  • NAT Gateway: $32.85/month (still needed for updates)
  • NAT Gateway data: 0.5 TB Ɨ 1,024 GB Ɨ $0.045 = $23/month (only updates)
  • VPC Endpoints: $14.60/month
  • CloudFront: $174/month
  • Total: $244/month (vs $1,646 original)
  • Savings: $1,402/month (85% reduction)

VPC Endpoints

What they are: VPC endpoints enable private connections between your VPC and AWS services without using internet gateway, NAT device, VPN, or AWS Direct Connect.

Types:

Gateway Endpoints (Free):

  • Services: S3, DynamoDB
  • Cost: Free (no hourly or data charges)
  • Routing: Uses route table entries

Interface Endpoints (Paid):

  • Services: Most AWS services (EC2, SNS, SQS, etc.)
  • Cost: $0.01/hour per AZ + $0.01/GB data processed
  • Implementation: ENI in your subnet

When to Use:

  • āœ… High S3/DynamoDB traffic from private subnets
  • āœ… Want to avoid NAT Gateway data processing charges
  • āœ… Need private connectivity to AWS services
  • āœ… Security requirement (no internet access)

Chapter Summary

What We Covered

This chapter covered the "Design Cost-Optimized Architectures" domain, which represents 20% of the SAA-C03 exam. We explored four major areas:

āœ… Section 1: Cost-Optimized Storage Solutions

  • S3 storage classes and pricing
  • S3 lifecycle policies for automatic cost optimization
  • Cost comparison and use cases for each storage class

āœ… Section 2: Cost-Optimized Compute Solutions

  • EC2 pricing models (On-Demand, Reserved, Savings Plans, Spot)
  • Cost optimization strategies for different workload patterns
  • Spot Instance interruption handling

āœ… Section 3: Cost-Optimized Database Solutions

  • RDS right-sizing and Reserved Instances
  • Aurora Serverless v2 for variable workloads
  • Storage optimization (gp2 to gp3)

āœ… Section 4: Cost-Optimized Network Architectures

  • Data transfer pricing and optimization
  • VPC endpoints to reduce NAT Gateway costs
  • CloudFront for reduced data transfer costs

Critical Takeaways

  1. S3 Lifecycle Policies: Automatically transition objects to cheaper storage classes. Can save 70-96% on storage costs for infrequently accessed data.

  2. Storage Class Selection: Standard ($0.023/GB) for frequent access, Standard-IA ($0.0125/GB) for infrequent, Glacier ($0.004/GB) for archives, Deep Archive ($0.00099/GB) for long-term.

  3. Compute Optimization: Use Savings Plans (66-72% discount) for baseline, On-Demand for variable peaks, Spot (90% discount) for fault-tolerant workloads.

  4. Database Right-Sizing: Monitor utilization, downsize if under 50% CPU/memory. Switch to gp3 storage (20% cheaper than gp2). Use Reserved Instances for 40-60% discount.

  5. Aurora Serverless: Best for variable workloads with long idle periods. Pay per ACU per second. Not always cheaper than provisioned for consistent workloads.

  6. Network Optimization: Use VPC endpoints (free for S3/DynamoDB) to avoid NAT Gateway data processing charges ($0.045/GB). Use CloudFront to reduce data transfer costs.

  7. Data Transfer: Inbound is free. Outbound starts at $0.09/GB. Cross-region is $0.02/GB. Cross-AZ is $0.01/GB. Optimize by keeping traffic within same AZ when possible.

Self-Assessment Checklist

Test yourself before moving on:

  • I understand all S3 storage classes and their pricing
  • I can design S3 lifecycle policies for cost optimization
  • I know when to use Reserved Instances vs Savings Plans
  • I understand Spot Instance use cases and limitations
  • I can right-size RDS instances based on utilization
  • I know when Aurora Serverless is cost-effective
  • I understand data transfer pricing (inbound, outbound, cross-region, cross-AZ)
  • I know how VPC endpoints reduce costs
  • I can calculate cost savings for different optimization strategies

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-25
  • Domain 4 Bundle 2: Questions 26-50
  • Full Practice Test 1: Questions 54-65

Expected score: 70%+ to proceed confidently

Quick Reference Card

S3 Storage Classes (by cost):

  • Deep Archive: $0.00099/GB-month (96% cheaper)
  • Glacier Flexible: $0.0036/GB-month (84% cheaper)
  • Glacier Instant: $0.004/GB-month (83% cheaper)
  • One Zone-IA: $0.01/GB-month (57% cheaper)
  • Standard-IA: $0.0125/GB-month (46% cheaper)
  • Standard: $0.023/GB-month (baseline)

EC2 Pricing Discounts:

  • Spot: Up to 90% discount
  • Savings Plans: Up to 72% discount
  • Reserved Instances: Up to 60% discount
  • On-Demand: 0% discount (baseline)

Data Transfer Costs:

  • Inbound: Free
  • Outbound (first 10 TB): $0.09/GB
  • Cross-region: $0.02/GB
  • Cross-AZ: $0.01/GB
  • Same AZ (private IP): Free

Cost Optimization Checklist:

  • Use S3 lifecycle policies for old data
  • Right-size EC2 instances (target 70-80% utilization)
  • Use Savings Plans for baseline compute
  • Use Spot Instances for fault-tolerant workloads
  • Switch EBS from gp2 to gp3
  • Use RDS Reserved Instances for production databases
  • Add VPC endpoints for S3/DynamoDB
  • Use CloudFront for static content delivery
  • Delete old snapshots and unused resources
  • Monitor costs with AWS Cost Explorer

Next Chapter: 06_integration - Integration & Cross-Domain Scenarios


Section 2: Cost-Optimized Compute Solutions

Introduction

The problem: Compute is often the largest AWS cost after storage. Running instances 24/7 when only needed during business hours, using On-Demand pricing for predictable workloads, and over-provisioning instances all waste money.

The solution: AWS provides multiple pricing models (On-Demand, Reserved Instances, Savings Plans, Spot Instances) and instance types optimized for different workloads. Understanding usage patterns and selecting appropriate pricing models can reduce compute costs by 50-90%.

Why it's tested: Compute cost optimization is critical for AWS cost management. This section tests your ability to select appropriate pricing models and instance types for different workload patterns.

Core Concepts

EC2 Pricing Models

What they are: AWS offers four pricing models for EC2 instances, each optimized for different usage patterns and commitment levels.

Why they exist: Different workloads have different characteristics. Production workloads run 24/7 and benefit from commitment discounts. Development workloads run during business hours and benefit from flexible pricing. Batch jobs tolerate interruptions and benefit from spot pricing.

EC2 Pricing Models Comparison:

Pricing Model Discount Commitment Flexibility Interruption Use Case
On-Demand 0% None Full No Variable workloads, short-term
Reserved Instances Up to 72% 1 or 3 years Limited No Steady-state workloads
Savings Plans Up to 72% 1 or 3 years High No Flexible compute usage
Spot Instances Up to 90% None Full Yes (2-min warning) Fault-tolerant workloads

Detailed Example 1: Production Web Application (Reserved Instances)

Scenario: You run a web application on 10 Ɨ m5.large instances (2 vCPUs, 8 GB RAM each) 24/7 for production. Application has been stable for 2 years and will continue for 3+ years.

Option 1: On-Demand Pricing:

  • Cost per instance: $0.096/hour
  • Total cost: 10 instances Ɨ $0.096/hour Ɨ 24 hours Ɨ 365 days = $8,410/year
  • 3-year cost: $25,230

Option 2: 1-Year Standard Reserved Instance (All Upfront):

  • Upfront cost: $561 per instance
  • Hourly cost: $0 (paid upfront)
  • Total cost: 10 instances Ɨ $561 = $5,610/year
  • Savings: $2,800/year (33% discount)
  • 3-year cost: $16,830 (need to renew each year)

Option 3: 3-Year Standard Reserved Instance (All Upfront):

  • Upfront cost: $1,424 per instance
  • Hourly cost: $0 (paid upfront)
  • Total cost: 10 instances Ɨ $1,424 = $14,240 for 3 years
  • Savings: $10,990 over 3 years (44% discount)
  • Annual equivalent: $4,747/year

Option 4: 3-Year Convertible Reserved Instance (All Upfront):

  • Upfront cost: $1,710 per instance
  • Hourly cost: $0 (paid upfront)
  • Total cost: 10 instances Ɨ $1,710 = $17,100 for 3 years
  • Savings: $8,130 over 3 years (32% discount)
  • Benefit: Can change instance type/family during term
  • Annual equivalent: $5,700/year

Recommendation: 3-Year Standard RI (All Upfront) for maximum savings if instance type won't change.

Reserved Instance Payment Options:

  • All Upfront: Highest discount, pay entire amount upfront
  • Partial Upfront: Medium discount, pay ~50% upfront + hourly rate
  • No Upfront: Lowest discount, pay only hourly rate (no upfront payment)

Detailed Example 2: Variable Workload (Savings Plans)

Scenario: You run multiple applications with varying compute needs:

  • Web servers: 5 Ɨ m5.large (always running)
  • Batch processing: 10 Ɨ c5.2xlarge (runs 8 hours/day)
  • Development: 3 Ɨ t3.medium (runs during business hours)

Total Compute Usage:

  • Web: 5 Ɨ $0.096/hour Ɨ 24 hours = $11.52/day
  • Batch: 10 Ɨ $0.34/hour Ɨ 8 hours = $27.20/day
  • Dev: 3 Ɨ $0.0416/hour Ɨ 10 hours = $1.25/day
  • Total: $40/day = $1,200/month = $14,400/year

Option 1: On-Demand (No Commitment):

  • Cost: $14,400/year
  • Flexibility: Full (can change anytime)

Option 2: Reserved Instances (Limited Flexibility):

  • Problem: Need separate RIs for each instance type
  • Complexity: 3 different RI purchases
  • Inflexibility: Can't easily shift between workloads

Option 3: Compute Savings Plan (Recommended):

  • Commitment: $30/day ($900/month, $10,800/year)
  • Discount: 40% on committed amount
  • Savings: $4,320/year (30% overall savings)
  • Flexibility: Applies to any instance family, size, region, OS
  • Overage: $10/day charged at On-Demand rates

How Savings Plans Work:

  1. Commit to $30/day of compute usage
  2. First $30/day gets 40% discount ($18/day actual cost)
  3. Usage above $30/day charged at On-Demand rates
  4. Commitment applies to any EC2, Fargate, or Lambda usage

Detailed Example 3: Batch Processing (Spot Instances)

Scenario: You run nightly batch jobs processing 1,000 files. Each file takes 10 minutes to process. Jobs can be interrupted and restarted without data loss.

Option 1: On-Demand Instances:

  • Instance: c5.4xlarge (16 vCPUs, 32 GB RAM)
  • Cost: $0.68/hour
  • Processing: 6 files/hour (10 min each)
  • Time: 1,000 files Ć· 6 = 167 hours
  • Total cost: 167 hours Ɨ $0.68 = $113.56/day

Option 2: Spot Instances (Recommended):

  • Instance: c5.4xlarge
  • Spot price: $0.068/hour (90% discount)
  • Processing: 6 files/hour
  • Time: 1,000 files Ć· 6 = 167 hours (may take longer due to interruptions)
  • Total cost: 167 hours Ɨ $0.068 = $11.36/day
  • Savings: $102.20/day (90% cheaper)

Handling Spot Interruptions:

  • Spot Instance Interruption Notice: 2-minute warning
  • Strategy: Save progress to S3 every 5 minutes
  • Resume: New spot instance picks up from last checkpoint
  • Result: Minimal wasted work (max 5 minutes lost per interruption)

Spot Fleet Strategy:

  • Diversification: Request multiple instance types (c5.4xlarge, c5.2xlarge, m5.4xlarge)
  • Availability Zones: Spread across multiple AZs
  • Result: Reduces interruption frequency (more capacity pools)

šŸ“Š EC2 Pricing Model Selection Diagram:

graph TD
    A[Select EC2 Pricing Model] --> B{Workload Characteristics?}
    
    B -->|Steady-State 24/7| C{Commitment Length?}
    C -->|3 Years| D[3-Year Reserved Instance<br/>44% discount]
    C -->|1 Year| E[1-Year Reserved Instance<br/>33% discount]
    C -->|Flexible| F[Compute Savings Plan<br/>40% discount]
    
    B -->|Variable Usage| G{Need Flexibility?}
    G -->|Yes| H[Compute Savings Plan<br/>Applies to any instance]
    G -->|No| I[On-Demand<br/>No commitment]
    
    B -->|Fault-Tolerant| J[Spot Instances<br/>Up to 90% discount]
    
    B -->|Short-Term| K[On-Demand<br/>No commitment]
    
    style D fill:#c8e6c9
    style E fill:#c8e6c9
    style F fill:#fff3e0
    style H fill:#fff3e0
    style J fill:#e1f5fe

See: diagrams/05_domain4_ec2_pricing_selection.mmd

Diagram Explanation:
This decision tree helps select the appropriate EC2 pricing model based on workload characteristics. For steady-state 24/7 workloads, use Reserved Instances (3-year for maximum savings, 1-year for shorter commitment) or Compute Savings Plans for flexibility. For variable usage, use Compute Savings Plans if you need flexibility across instance types, or On-Demand if you need no commitment. For fault-tolerant workloads that can handle interruptions, use Spot Instances for up to 90% discount. For short-term or unpredictable workloads, use On-Demand pricing.

⭐ Must Know (EC2 Cost Optimization):

  • Reserved Instances provide up to 72% discount for 1-3 year commitments
  • Savings Plans provide similar discounts with more flexibility (any instance type/region)
  • Spot Instances provide up to 90% discount but can be interrupted with 2-minute notice
  • Use Spot for fault-tolerant workloads (batch processing, data analysis, CI/CD)
  • Compute Optimizer provides right-sizing recommendations based on actual usage
  • Graviton instances (ARM-based) provide 20-40% better price/performance
  • Use Auto Scaling to match capacity to demand (avoid over-provisioning)
  • Stop instances when not needed (dev/test environments during off-hours)

AWS Lambda Cost Optimization

What it is: Lambda charges based on number of requests and duration (GB-seconds). Optimizing memory allocation and execution time directly reduces costs.

Why it matters: Lambda costs can add up quickly with millions of invocations. Understanding the relationship between memory, CPU, and execution time enables cost optimization.

Lambda Pricing:

  • Requests: $0.20 per 1 million requests
  • Duration: $0.0000166667 per GB-second
  • Free Tier: 1 million requests + 400,000 GB-seconds per month

Detailed Example: Lambda Memory Optimization

Scenario: You have a Lambda function that processes images (CPU-intensive). Function runs 10 million times per month.

Option 1: 128 MB Memory:

  • Execution time: 5 seconds
  • CPU: 0.07 vCPU (very slow)
  • Cost per invocation: 5 sec Ɨ 0.128 GB Ɨ $0.0000166667 = $0.0000107
  • Monthly cost: 10M Ɨ $0.0000107 = $107
  • Request cost: 10M Ɨ $0.20/1M = $2
  • Total: $109/month

Option 2: 1,024 MB Memory (Recommended):

  • Execution time: 0.625 seconds (8x faster)
  • CPU: 0.58 vCPU
  • Cost per invocation: 0.625 sec Ɨ 1.024 GB Ɨ $0.0000166667 = $0.0000107
  • Monthly cost: 10M Ɨ $0.0000107 = $107
  • Request cost: $2
  • Total: $109/month
  • Result: Same cost, 8x faster!

Option 3: 1,769 MB Memory (Full vCPU):

  • Execution time: 0.36 seconds (14x faster)
  • CPU: 1.0 vCPU
  • Cost per invocation: 0.36 sec Ɨ 1.769 GB Ɨ $0.0000166667 = $0.0000107
  • Monthly cost: 10M Ɨ $0.0000107 = $107
  • Request cost: $2
  • Total: $109/month
  • Result: Same cost, 14x faster!

Key Insight: For CPU-intensive workloads, increasing memory reduces execution time proportionally, resulting in same cost but better performance.

When Higher Memory Costs More:

  • I/O-bound workloads: Waiting for network/database doesn't use CPU
  • Example: Lambda waits 2 seconds for API response
    • 128 MB: 2 sec Ɨ 0.128 GB = 0.256 GB-sec
    • 1,024 MB: 2 sec Ɨ 1.024 GB = 2.048 GB-sec (8x more expensive)
  • Recommendation: Use minimum memory for I/O-bound workloads

Section 3: Cost-Optimized Database Solutions

Introduction

The problem: Database costs can be significant, especially for production workloads running 24/7. Over-provisioned instances, expensive storage, and inefficient capacity modes waste money.

The solution: AWS provides multiple database pricing models (On-Demand, Reserved Instances, Serverless) and storage options. Understanding workload patterns and selecting appropriate pricing models can reduce database costs by 40-70%.

Why it's tested: Database cost optimization is critical for overall AWS cost management. This section tests your ability to select appropriate database services and pricing models.

Core Concepts

RDS Cost Optimization

What it is: RDS offers Reserved Instances for 1-3 year commitments, providing significant discounts over On-Demand pricing.

RDS Reserved Instance Discounts:

  • 1-Year Standard RI: Up to 40% discount
  • 3-Year Standard RI: Up to 60% discount
  • Payment options: All Upfront, Partial Upfront, No Upfront

Detailed Example: Production Database

Scenario: You run a PostgreSQL database on db.r5.2xlarge (8 vCPUs, 64 GB RAM) 24/7 for production.

Option 1: On-Demand:

  • Cost: $1.008/hour
  • Annual cost: $1.008 Ɨ 24 Ɨ 365 = $8,830/year

Option 2: 1-Year Reserved Instance (All Upfront):

  • Upfront cost: $5,300
  • Hourly cost: $0
  • Annual cost: $5,300
  • Savings: $3,530/year (40% discount)

Option 3: 3-Year Reserved Instance (All Upfront):

  • Upfront cost: $12,700 (for 3 years)
  • Hourly cost: $0
  • Annual equivalent: $4,233/year
  • Savings: $4,597/year (52% discount)

Aurora Serverless Cost Optimization

What it is: Aurora Serverless automatically scales database capacity based on application demand. You pay only for the capacity used (measured in Aurora Capacity Units - ACUs).

Why it exists: Traditional databases require provisioning fixed capacity, resulting in over-provisioning for peak load. Aurora Serverless scales automatically, reducing costs for variable workloads.

Aurora Serverless v2 Pricing:

  • ACU: Aurora Capacity Unit (2 GB RAM, equivalent CPU/network)
  • Cost: $0.12 per ACU-hour
  • Scaling: 0.5 ACU minimum, 128 ACU maximum
  • Scaling speed: Instant (sub-second)

Detailed Example: Development Database

Scenario: You have a development database used during business hours (8 AM - 6 PM, Monday-Friday). Peak usage requires 8 ACUs, idle usage requires 0.5 ACUs.

Option 1: RDS db.r5.large (Provisioned):

  • Capacity: 2 vCPUs, 16 GB RAM (always running)
  • Cost: $0.252/hour Ɨ 24 hours Ɨ 365 days = $2,207/year
  • Utilization: 25% (only used 50 hours/week out of 168 hours)

Option 2: Aurora Serverless v2 (Recommended):

  • Business hours (50 hours/week): 8 ACUs Ɨ $0.12 = $0.96/hour
  • Off hours (118 hours/week): 0.5 ACUs Ɨ $0.12 = $0.06/hour
  • Weekly cost: (50 Ɨ $0.96) + (118 Ɨ $0.06) = $48 + $7.08 = $55.08
  • Annual cost: $55.08 Ɨ 52 = $2,864/year
  • Wait, that's more expensive!

Option 3: Aurora Serverless v2 with Pause (Best):

  • Business hours (50 hours/week): 8 ACUs Ɨ $0.12 = $0.96/hour
  • Off hours: Pause database (0 cost)
  • Weekly cost: 50 Ɨ $0.96 = $48
  • Annual cost: $48 Ɨ 52 = $2,496/year
  • Savings: $2,207 - $2,496 = -$289 (actually more expensive)

Correct Analysis:

  • Aurora Serverless is cost-effective when workload is highly variable
  • For predictable workloads (business hours only), stopping RDS instances is cheaper
  • Better option: RDS with scheduled stop/start
    • Run only 50 hours/week
    • Cost: $0.252 Ɨ 50 Ɨ 52 = $655/year
    • Savings: $1,552/year (70% cheaper)

When Aurora Serverless Makes Sense:

  • āœ… Unpredictable workload (traffic spikes)
  • āœ… Infrequent usage (few times per day)
  • āœ… New applications (unknown capacity needs)
  • āŒ Steady-state workloads (use RDS Reserved Instances)
  • āŒ Predictable schedules (use RDS with stop/start)

DynamoDB Cost Optimization

What it is: DynamoDB offers two capacity modes: On-Demand (pay per request) and Provisioned (pay for reserved capacity).

Capacity Modes Comparison:

Mode Pricing Scaling Use Case
On-Demand $1.25 per million writes, $0.25 per million reads Automatic Unpredictable traffic
Provisioned $0.00065/hour per WCU, $0.00013/hour per RCU Manual or Auto Scaling Predictable traffic

Detailed Example: E-Commerce Product Catalog

Scenario: Product catalog with 1 million reads/day and 10,000 writes/day.

Option 1: On-Demand:

  • Reads: 1M Ɨ $0.25/1M = $0.25/day
  • Writes: 10K Ɨ $1.25/1M = $0.0125/day
  • Total: $0.2625/day = $7.88/month

Option 2: Provisioned Capacity:

  • Reads: 1M/day Ć· 86,400 sec = 12 reads/sec = 12 RCU
  • Writes: 10K/day Ć· 86,400 sec = 0.12 writes/sec = 1 WCU
  • RCU cost: 12 Ɨ $0.00013/hour Ɨ 24 Ɨ 30 = $1.12/month
  • WCU cost: 1 Ɨ $0.00065/hour Ɨ 24 Ɨ 30 = $0.47/month
  • Total: $1.59/month
  • Savings: $6.29/month (80% cheaper)

Break-Even Analysis:

  • On-Demand: Good for <1M requests/month
  • Provisioned: Good for >1M requests/month
  • Rule of thumb: If traffic is predictable and >1M requests/month, use Provisioned

⭐ Must Know (Database Cost Optimization):

  • Use RDS Reserved Instances for production databases (40-60% discount)
  • Use Aurora Serverless for unpredictable or infrequent workloads
  • Stop RDS instances when not needed (dev/test environments)
  • Use DynamoDB Provisioned Capacity for predictable traffic (80% cheaper)
  • Use DynamoDB On-Demand for unpredictable traffic (no capacity planning)
  • Use read replicas to offload read traffic (cheaper than scaling primary)
  • Use Aurora for high-traffic applications (better price/performance than RDS)
  • Delete old database snapshots (storage costs add up)

Section 4: Cost-Optimized Network Architectures

Introduction

The problem: Data transfer costs can be significant, especially for applications with high traffic or multi-region architectures. Inefficient routing, unnecessary data transfer, and not using VPC endpoints waste money.

The solution: AWS provides multiple networking options to optimize costs. VPC endpoints eliminate data transfer charges for AWS services. CloudFront reduces origin requests. Proper network design minimizes cross-region and cross-AZ data transfer.

Why it's tested: Network costs are often overlooked but can be substantial. This section tests your ability to design cost-optimized network architectures.

Core Concepts

Data Transfer Costs

What they are: AWS charges for data transfer between regions, between AZs, and out to the internet. Understanding these costs is critical for cost optimization.

Data Transfer Pricing (simplified):

  • Inbound to AWS: Free
  • Within same AZ (private IP): Free
  • Between AZs (same region): $0.01/GB each direction
  • Between regions: $0.02/GB
  • Out to internet: $0.09/GB (first 10 TB)

Detailed Example 1: Multi-AZ Application

Scenario: You have a web application with EC2 instances in multiple AZs for high availability. Application transfers 1 TB/day between AZs.

Cost Analysis:

  • Data transfer: 1 TB/day Ɨ $0.01/GB Ɨ 1,024 GB/TB = $10.24/day
  • Monthly cost: $10.24 Ɨ 30 = $307/month
  • Annual cost: $3,686/year

Optimization Strategy:

  • Use private IPs: Ensure instances communicate via private IPs (not public)
  • Minimize cross-AZ traffic: Cache data locally, use read replicas in same AZ
  • Result: Reduce cross-AZ traffic by 80% = $2,949/year savings

VPC Endpoints Cost Optimization

What they are: VPC endpoints enable private connectivity to AWS services without using internet gateway, NAT gateway, or VPN. This eliminates data transfer charges and improves security.

VPC Endpoint Types:

  • Gateway Endpoints: Free (S3, DynamoDB)
  • Interface Endpoints: $0.01/hour per AZ + $0.01/GB data processed

Detailed Example: S3 Access from EC2

Scenario: You have 100 EC2 instances accessing S3. Each instance downloads 10 GB/day from S3.

Option 1: NAT Gateway (Without VPC Endpoint):

  • Data transfer: 100 instances Ɨ 10 GB/day = 1,000 GB/day
  • NAT Gateway cost: $0.045/hour Ɨ 24 Ɨ 30 = $32.40/month
  • Data processing: 1,000 GB/day Ɨ 30 days Ɨ $0.045/GB = $1,350/month
  • S3 data transfer: Free (within same region)
  • Total: $1,382.40/month

Option 2: S3 Gateway Endpoint (Recommended):

  • VPC Endpoint cost: Free
  • Data transfer: Free (private connection)
  • Total: $0/month
  • Savings: $1,382.40/month (100% savings)

When to Use VPC Endpoints:

  • āœ… Always use Gateway Endpoints for S3 and DynamoDB (free)
  • āœ… Use Interface Endpoints for other services if traffic is high
  • āœ… Eliminates NAT Gateway costs for AWS service access
  • āœ… Improves security (traffic stays within AWS network)

CloudFront Cost Optimization

What it is: CloudFront caches content at edge locations, reducing origin requests and data transfer costs.

Detailed Example: Static Website

Scenario: You have a static website hosted on S3 with 10 TB/month data transfer to users worldwide.

Option 1: Direct S3 Access:

  • Data transfer out: 10 TB Ɨ $0.09/GB Ɨ 1,024 GB/TB = $921.60/month
  • S3 requests: 10M requests Ɨ $0.0004/1K = $4/month
  • Total: $925.60/month

Option 2: CloudFront (80% cache hit rate):

  • Origin requests: 20% Ɨ 10M = 2M requests
  • S3 cost: (2M Ɨ $0.0004/1K) + (2 TB Ɨ $0.09/GB Ɨ 1,024) = $0.80 + $184.32 = $185.12
  • CloudFront data transfer: 10 TB Ɨ $0.085/GB Ɨ 1,024 = $870.40
  • CloudFront requests: 10M Ɨ $0.0075/10K = $7.50
  • Total: $1,063.02/month
  • Result: 15% more expensive but 3-10x faster for users

Optimization: Regional Edge Caches:

  • CloudFront automatically uses Regional Edge Caches
  • Reduces origin requests further (90% cache hit rate)
  • New origin requests: 10% Ɨ 10M = 1M requests
  • New S3 cost: $92.56
  • New total: $970.46/month
  • Result: 5% more expensive but much better performance

When CloudFront Saves Money:

  • āœ… High traffic from multiple regions (reduces cross-region transfer)
  • āœ… Frequently accessed content (high cache hit ratio)
  • āœ… Dynamic content with caching (API responses, personalized content)
  • āŒ Infrequently accessed content (low cache hit ratio)
  • āŒ Single-region traffic (no cross-region savings)

⭐ Must Know (Network Cost Optimization):

  • Use VPC Gateway Endpoints for S3 and DynamoDB (free, eliminates NAT costs)
  • Minimize cross-AZ data transfer (use private IPs, cache locally)
  • Minimize cross-region data transfer (use CloudFront, regional replicas)
  • Use CloudFront for global content delivery (reduces origin requests)
  • Data transfer within same AZ using private IPs is free
  • Data transfer out to internet is most expensive ($0.09/GB)
  • Use Direct Connect for high-volume data transfer (cheaper than internet)
  • Monitor data transfer costs with Cost Explorer (often overlooked)

Chapter Summary

What We Covered

āœ… Section 1: Cost-Optimized Storage Solutions

  • S3 storage classes and lifecycle policies
  • Intelligent-Tiering for automatic optimization
  • Glacier for long-term archival (96% cheaper)
  • EBS volume type selection (gp3 vs gp2)

āœ… Section 2: Cost-Optimized Compute Solutions

  • EC2 pricing models (On-Demand, Reserved, Savings Plans, Spot)
  • Reserved Instances for steady-state workloads (up to 72% discount)
  • Spot Instances for fault-tolerant workloads (up to 90% discount)
  • Lambda memory optimization for CPU-intensive workloads

āœ… Section 3: Cost-Optimized Database Solutions

  • RDS Reserved Instances for production (40-60% discount)
  • Aurora Serverless for variable workloads
  • DynamoDB capacity modes (On-Demand vs Provisioned)
  • Database right-sizing and stop/start strategies

āœ… Section 4: Cost-Optimized Network Architectures

  • Data transfer costs (cross-AZ, cross-region, internet)
  • VPC endpoints to eliminate NAT Gateway costs
  • CloudFront for global content delivery
  • Network design to minimize data transfer

Critical Takeaways

  1. S3 Lifecycle: Transition infrequently accessed data to cheaper storage classes (Standard-IA, Glacier). Use Intelligent-Tiering for unknown access patterns.

  2. EC2 Pricing: Use Reserved Instances or Savings Plans for steady-state workloads (40-72% discount). Use Spot for fault-tolerant workloads (up to 90% discount).

  3. Right-Sizing: Use Compute Optimizer to identify over-provisioned instances. Target 70-80% utilization. Stop instances when not needed.

  4. Database Optimization: Use RDS Reserved Instances for production databases. Use Aurora Serverless for variable workloads. Use DynamoDB Provisioned Capacity for predictable traffic.

  5. VPC Endpoints: Always use Gateway Endpoints for S3 and DynamoDB (free). Eliminates NAT Gateway costs and improves security.

  6. Data Transfer: Minimize cross-AZ and cross-region data transfer. Use private IPs within same AZ (free). Use CloudFront for global content delivery.

  7. Cost Monitoring: Use AWS Cost Explorer to identify cost trends. Set up billing alerts. Use cost allocation tags to track costs by project/team.

  8. Quick Wins: Switch EBS from gp2 to gp3 (20% cheaper). Delete old snapshots. Use S3 lifecycle policies. Add VPC endpoints for S3/DynamoDB.

Self-Assessment Checklist

Test yourself before moving on:

  • I understand S3 storage classes and when to use each
  • I know how to create S3 lifecycle policies
  • I can explain the difference between Reserved Instances and Savings Plans
  • I understand when to use Spot Instances
  • I know how Lambda memory affects cost
  • I can calculate cost savings for different EC2 pricing models
  • I understand when to use Aurora Serverless vs RDS
  • I know the difference between DynamoDB On-Demand and Provisioned
  • I understand data transfer costs (cross-AZ, cross-region, internet)
  • I know when to use VPC endpoints
  • I can explain how CloudFront reduces costs
  • I understand cost optimization strategies for each service

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-25 (Storage and compute)
  • Domain 4 Bundle 2: Questions 26-50 (Database and network)
  • Full Practice Test 1: Questions 54-65 (Domain 4 questions)

Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • Review sections: Focus on areas where you missed questions
  • Key topics to strengthen:
    • S3 storage class selection criteria
    • EC2 pricing model comparison
    • Reserved Instance vs Savings Plan differences
    • Spot Instance use cases
    • Database pricing optimization
    • Data transfer cost minimization

Quick Reference Card

S3 Storage Classes (by cost):

  • Deep Archive: $0.00099/GB-month (96% cheaper, 12-48 hour retrieval)
  • Glacier Flexible: $0.0036/GB-month (84% cheaper, 3-5 hour retrieval)
  • Glacier Instant: $0.004/GB-month (83% cheaper, instant retrieval)
  • One Zone-IA: $0.01/GB-month (57% cheaper, single AZ)
  • Standard-IA: $0.0125/GB-month (46% cheaper, infrequent access)
  • Standard: $0.023/GB-month (baseline, frequent access)

EC2 Pricing Discounts:

  • Spot: Up to 90% discount (can be interrupted)
  • 3-Year RI: Up to 72% discount (3-year commitment)
  • 1-Year RI: Up to 40% discount (1-year commitment)
  • Savings Plans: Up to 72% discount (flexible)
  • On-Demand: 0% discount (no commitment)

Data Transfer Costs:

  • Inbound: Free
  • Same AZ (private IP): Free
  • Cross-AZ: $0.01/GB
  • Cross-region: $0.02/GB
  • Out to internet: $0.09/GB (first 10 TB)

Cost Optimization Checklist:

  • Use S3 lifecycle policies for old data
  • Switch EBS from gp2 to gp3 (20% cheaper)
  • Use Reserved Instances for steady-state workloads
  • Use Spot Instances for fault-tolerant workloads
  • Right-size EC2 instances (target 70-80% utilization)
  • Stop instances when not needed (dev/test)
  • Use RDS Reserved Instances for production databases
  • Add VPC endpoints for S3/DynamoDB (eliminates NAT costs)
  • Use CloudFront for global content delivery
  • Delete old snapshots and unused resources
  • Set up billing alerts and cost allocation tags
  • Review Cost Explorer monthly for optimization opportunities

Next Chapter: 06_integration - Integration & Cross-Domain Scenarios


Chapter Summary

What We Covered

This chapter covered Domain 4: Design Cost-Optimized Architectures (20% of the exam). We explored four major task areas:

  • āœ… Task 4.1 - Cost-Optimized Storage Solutions: S3 lifecycle policies, storage class selection, EBS optimization, backup strategies, data transfer cost management
  • āœ… Task 4.2 - Cost-Optimized Compute Solutions: EC2 pricing models (On-Demand, Reserved, Savings Plans, Spot), right-sizing, Auto Scaling for cost efficiency, Lambda optimization
  • āœ… Task 4.3 - Cost-Optimized Database Solutions: RDS Reserved Instances, Aurora Serverless, DynamoDB capacity modes, caching to reduce database load, backup retention policies
  • āœ… Task 4.4 - Cost-Optimized Network Architectures: Data transfer costs, NAT Gateway optimization, VPC endpoints, CloudFront cost savings, Direct Connect vs. VPN

Critical Takeaways

  1. Storage Lifecycle Management Saves Money: Implement S3 lifecycle policies to automatically transition objects to cheaper storage classes (S3-IA, Glacier, Deep Archive) based on access patterns.

  2. Compute Pricing Models Matter: Use Reserved Instances or Savings Plans for steady-state workloads (up to 72% savings), Spot Instances for fault-tolerant workloads (up to 90% savings), and On-Demand for unpredictable workloads.

  3. Right-Sizing is Continuous: Use AWS Compute Optimizer and Cost Explorer to identify underutilized resources. Downsize or terminate idle resources regularly.

  4. Data Transfer Costs Add Up: Keep data within the same Region when possible, use VPC endpoints to avoid internet data transfer charges, and leverage CloudFront for content delivery.

  5. Serverless Can Be Cost-Effective: Lambda charges only for execution time, Aurora Serverless scales to zero when not in use, and DynamoDB On-Demand eliminates capacity planning.

  6. Monitoring and Budgets Prevent Surprises: Set up AWS Budgets with alerts, use Cost Allocation Tags for granular tracking, and review Cost Explorer regularly.

  7. Reserved Capacity Requires Planning: Commit to 1-year or 3-year terms for Reserved Instances, Savings Plans, or Reserved Capacity only after analyzing usage patterns.

Self-Assessment Checklist

Test yourself before moving to integration topics. You should be able to:

Cost-Optimized Storage:

  • Design S3 lifecycle policies to transition objects between storage classes
  • Choose appropriate S3 storage class based on access frequency and retrieval time
  • Optimize EBS volumes by selecting appropriate volume types (gp3 vs. gp2)
  • Implement EBS snapshot lifecycle policies to reduce backup costs
  • Use S3 Intelligent-Tiering for unpredictable access patterns
  • Calculate data transfer costs between Regions and to internet
  • Implement S3 Requester Pays for cost sharing

Cost-Optimized Compute:

  • Choose between On-Demand, Reserved Instances, Savings Plans, and Spot Instances
  • Calculate savings from Reserved Instances (Standard vs. Convertible)
  • Implement Spot Instances for fault-tolerant workloads
  • Use Auto Scaling to match capacity with demand
  • Right-size EC2 instances using Compute Optimizer recommendations
  • Optimize Lambda costs by adjusting memory and timeout settings
  • Choose between EC2 and Fargate based on cost and operational overhead

Cost-Optimized Databases:

  • Purchase RDS Reserved Instances for steady-state workloads
  • Use Aurora Serverless for variable workloads
  • Choose between DynamoDB On-Demand and Provisioned capacity
  • Implement caching with ElastiCache to reduce database load
  • Optimize backup retention periods to balance cost and compliance
  • Use read replicas to offload read traffic from primary database
  • Configure database auto-scaling to match demand

Cost-Optimized Networks:

  • Minimize data transfer costs by keeping traffic within same Region
  • Use VPC endpoints to avoid NAT Gateway and internet data transfer charges
  • Choose between NAT Gateway and NAT instance based on cost
  • Implement CloudFront to reduce origin data transfer costs
  • Calculate Direct Connect vs. VPN costs for hybrid connectivity
  • Optimize load balancer costs by choosing appropriate type (ALB vs. NLB)
  • Use Transit Gateway for hub-and-spoke network topology

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-50 (storage and compute cost optimization)
  • Domain 4 Bundle 2: Questions 1-50 (database and network cost optimization)
  • Full Practice Test 1-3: Questions covering all domains with cost considerations

Expected Score: 75%+ to proceed

If you scored below 75%:

  • Storage costs weak: Review S3 lifecycle policies, storage class selection, data transfer costs
  • Compute costs weak: Review pricing models (Reserved, Savings Plans, Spot), right-sizing
  • Database costs weak: Review Reserved Instances, Aurora Serverless, DynamoDB capacity modes
  • Network costs weak: Review data transfer costs, VPC endpoints, NAT Gateway optimization
  • Revisit diagrams: S3 lifecycle, EC2 pricing comparison, cost optimization workflow

Common Exam Traps

Watch out for these in Domain 4 questions:

  1. Reserved Instance Types: Standard (highest discount, no flexibility) vs. Convertible (lower discount, can change instance family)
  2. Savings Plans: Compute Savings Plans (most flexible) vs. EC2 Instance Savings Plans (highest discount)
  3. S3 Storage Classes: Glacier retrieval times: Expedited (1-5 min), Standard (3-5 hours), Bulk (5-12 hours)
  4. Data Transfer Costs: Inbound is free, outbound to internet is charged, inter-AZ is charged
  5. NAT Gateway vs. NAT Instance: NAT Gateway is managed but more expensive; NAT instance is cheaper but requires management
  6. DynamoDB Capacity: On-Demand is more expensive per request but no capacity planning; Provisioned is cheaper with predictable workloads
  7. Spot Instance Interruption: Can be terminated with 2-minute warning; use for fault-tolerant workloads only

Quick Reference Card

S3 Storage Classes (Cost from Lowest to Highest Access):

  1. S3 Glacier Deep Archive: $0.00099/GB/month, 12-hour retrieval
  2. S3 Glacier Flexible Retrieval: $0.0036/GB/month, 1-5 hour retrieval
  3. S3 Glacier Instant Retrieval: $0.004/GB/month, millisecond retrieval
  4. S3 Intelligent-Tiering: $0.0025/GB/month + monitoring fee, automatic tiering
  5. S3 Standard-IA: $0.0125/GB/month, millisecond access, 30-day minimum
  6. S3 One Zone-IA: $0.01/GB/month, single AZ, 30-day minimum
  7. S3 Standard: $0.023/GB/month, frequent access

EC2 Pricing Models (Savings):

  • On-Demand: No commitment, highest cost (baseline)
  • Savings Plans: 1 or 3 years, up to 72% savings, flexible
  • Reserved Instances: 1 or 3 years, up to 72% savings, less flexible
  • Spot Instances: Unused capacity, up to 90% savings, can be interrupted

Database Cost Optimization:

  • RDS Reserved Instances: 1 or 3 years, up to 69% savings
  • Aurora Serverless: Pay per ACU-hour, scales to zero
  • DynamoDB On-Demand: $1.25 per million writes, $0.25 per million reads
  • DynamoDB Provisioned: $0.00065 per WCU-hour, $0.00013 per RCU-hour
  • ElastiCache Reserved Nodes: 1 or 3 years, up to 55% savings

Data Transfer Costs:

  • Inbound: Free
  • Outbound to Internet: $0.09/GB (first 10 TB)
  • Inter-Region: $0.02/GB
  • Inter-AZ: $0.01/GB (in and out)
  • Same AZ: Free
  • VPC Endpoint: $0.01/GB processed

Decision Frameworks

When to use which EC2 pricing:

  • On-Demand: Unpredictable workloads, short-term, development/testing
  • Reserved Instances: Steady-state workloads, 1-3 year commitment, highest savings
  • Savings Plans: Flexible workloads, can change instance family/region
  • Spot Instances: Fault-tolerant, flexible start/end times, batch processing

When to use which S3 storage class:

  • Standard: Frequently accessed data, low latency required
  • Intelligent-Tiering: Unpredictable access patterns, automatic optimization
  • Standard-IA: Infrequently accessed, millisecond access needed
  • One Zone-IA: Non-critical data, infrequent access, cost-sensitive
  • Glacier Instant Retrieval: Archive with immediate access needs
  • Glacier Flexible Retrieval: Archive with 1-5 hour retrieval acceptable
  • Glacier Deep Archive: Long-term archive, 12-hour retrieval acceptable

When to use which database pricing:

  • RDS On-Demand: Variable workloads, short-term, development
  • RDS Reserved: Steady-state production workloads, 1-3 year commitment
  • Aurora Serverless: Variable workloads, infrequent usage, scales to zero
  • DynamoDB On-Demand: Unpredictable traffic, new applications
  • DynamoDB Provisioned: Predictable traffic, cost-sensitive, can forecast capacity

Cost Optimization Best Practices

Immediate Actions:

  1. Delete unused resources (idle EC2, unattached EBS, old snapshots)
  2. Right-size over-provisioned instances
  3. Implement S3 lifecycle policies
  4. Enable S3 Intelligent-Tiering for unknown access patterns
  5. Use gp3 instead of gp2 for EBS volumes

Short-Term Actions (1-3 months):

  1. Analyze usage patterns with Cost Explorer
  2. Purchase Reserved Instances or Savings Plans for steady workloads
  3. Implement Auto Scaling for variable workloads
  4. Use Spot Instances for fault-tolerant workloads
  5. Set up AWS Budgets with alerts

Long-Term Actions (3-12 months):

  1. Implement cost allocation tags for chargeback
  2. Use AWS Organizations for consolidated billing
  3. Regularly review and optimize architectures
  4. Implement FinOps practices and culture
  5. Use AWS Trusted Advisor for ongoing recommendations

Integration with Other Domains

Cost optimization concepts from Domain 4 integrate with:

  • Domain 1 (Secure Architectures): Balance security controls with cost (e.g., Shield Advanced)
  • Domain 2 (Resilient Architectures): Use Spot Instances for fault-tolerant workloads
  • Domain 3 (High-Performing Architectures): Balance performance with cost (right-sizing)

Cost Monitoring Tools

AWS Cost Management Services:

  • Cost Explorer: Visualize and analyze costs, identify trends
  • AWS Budgets: Set custom budgets, receive alerts
  • Cost and Usage Report: Detailed billing data, integrate with analytics tools
  • Cost Allocation Tags: Track costs by project, team, environment
  • Compute Optimizer: Right-sizing recommendations for EC2, Lambda, EBS
  • Trusted Advisor: Cost optimization checks (part of Business/Enterprise Support)

Key Cost Metrics

Storage Costs:

  • S3 Standard: $0.023/GB/month
  • S3 Glacier Deep Archive: $0.00099/GB/month (96% cheaper)
  • EBS gp3: $0.08/GB/month
  • EBS io2: $0.125/GB/month + $0.065/IOPS/month

Compute Costs (example: m5.large):

  • On-Demand: $0.096/hour
  • 1-year Reserved (All Upfront): $0.058/hour (40% savings)
  • 3-year Reserved (All Upfront): $0.035/hour (64% savings)
  • Spot: $0.029/hour (70% savings, variable)

Data Transfer Costs:

  • Outbound to Internet: $0.09/GB (first 10 TB)
  • CloudFront to Internet: $0.085/GB (first 10 TB)
  • Inter-Region: $0.02/GB
  • NAT Gateway: $0.045/hour + $0.045/GB processed

Next Steps

You've now completed all four exam domains! Next, move to:

Chapter 6: Integration & Advanced Topics - Learn how to combine concepts from all domains in complex, real-world scenarios.

After that:

  • Chapter 7: Study Strategies - Test-taking techniques and study methods
  • Chapter 8: Final Checklist - Last-week preparation guide
  • Appendices - Quick reference tables and glossary

Chapter 4 Complete āœ… | Next: Chapter 5 - Integration & Advanced Topics


Chapter Summary

What We Covered

  • āœ… Cost-Optimized Storage Solutions
    • S3 storage classes and lifecycle policies
    • EBS volume optimization (gp3 vs gp2, right-sizing)
    • EFS Infrequent Access
    • Glacier and Deep Archive for long-term storage
  • āœ… Cost-Optimized Compute Solutions
    • EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
    • Right-sizing instances
    • Auto Scaling for cost efficiency
    • Lambda cost optimization
    • Fargate Spot
  • āœ… Cost-Optimized Database Solutions
    • RDS Reserved Instances
    • Aurora Serverless v2
    • DynamoDB capacity modes
    • ElastiCache Reserved Nodes
  • āœ… Cost-Optimized Network Architectures
    • Data transfer cost optimization
    • NAT Gateway vs NAT Instance
    • VPC Endpoints to avoid data transfer charges
    • CloudFront for reduced origin costs
    • Direct Connect for predictable costs

Critical Takeaways

  1. Storage Lifecycle: Use S3 Intelligent-Tiering for automatic cost optimization, transition to Glacier for archives (90% cheaper), use gp3 instead of gp2 (20% cheaper)
  2. Compute Savings: Reserved Instances save 40-60%, Spot Instances save 70-90%, Savings Plans offer flexibility, right-size instances to avoid over-provisioning
  3. Database Cost Control: Aurora Serverless v2 for variable workloads, DynamoDB On-Demand for unpredictable traffic, Reserved capacity for steady-state, use read replicas instead of larger instances
  4. Network Cost Reduction: Use VPC Endpoints to avoid NAT Gateway charges ($0.045/GB), CloudFront to reduce data transfer costs, keep traffic within same AZ when possible
  5. Cost Monitoring: Use Cost Explorer for analysis, AWS Budgets for alerts, Cost Allocation Tags for tracking, Trusted Advisor for recommendations

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain S3 storage classes and when to use each
  • I understand EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
  • I know how to optimize EBS costs (gp3, right-sizing, snapshots)
  • I can calculate savings from Reserved Instances vs On-Demand
  • I understand when to use Spot Instances and how to handle interruptions
  • I know the difference between Compute Savings Plans and EC2 Savings Plans
  • I can explain DynamoDB capacity modes and cost implications
  • I understand data transfer costs and how to minimize them
  • I know when to use NAT Gateway vs NAT Instance
  • I can design a cost-optimized architecture using multiple strategies

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-25 (Storage and compute costs)
  • Domain 4 Bundle 2: Questions 1-25 (Database and network costs)
  • Expected score: 75%+ to proceed

If you scored below 75%:

  • Review sections: S3 lifecycle policies, EC2 pricing models, Data transfer costs
  • Focus on: Understanding cost implications of architectural decisions

Quick Reference Card

Storage Cost Optimization:

  • S3 Standard: $0.023/GB/month (frequent access)
  • S3 Intelligent-Tiering: Auto-optimize, $0.023-$0.0125/GB
  • S3 Standard-IA: $0.0125/GB (infrequent access, 30-day min)
  • S3 One Zone-IA: $0.01/GB (single AZ, 30-day min)
  • S3 Glacier Instant: $0.004/GB (millisecond retrieval)
  • S3 Glacier Flexible: $0.0036/GB (minutes-hours retrieval)
  • S3 Glacier Deep Archive: $0.00099/GB (12-hour retrieval, 96% savings)
  • EBS gp3: $0.08/GB (20% cheaper than gp2)
  • EBS gp2: $0.10/GB

Compute Cost Optimization:

  • On-Demand: Pay per hour, no commitment, highest cost
  • Reserved (1-year): 40% savings, upfront payment
  • Reserved (3-year): 60% savings, upfront payment
  • Spot: 70-90% savings, can be interrupted
  • Savings Plans: Flexible, 1-3 year commitment
  • Lambda: $0.20 per 1M requests + $0.0000166667/GB-second

Database Cost Optimization:

  • RDS On-Demand: Pay per hour
  • RDS Reserved: 40-60% savings
  • Aurora Serverless v2: Pay per ACU-hour, auto-scaling
  • DynamoDB On-Demand: $1.25/million writes, $0.25/million reads
  • DynamoDB Provisioned: $0.00065/WCU/hour, $0.00013/RCU/hour
  • DynamoDB Reserved: 50-75% savings

Network Cost Optimization:

  • Data Transfer Out: $0.09/GB (first 10 TB)
  • CloudFront: $0.085/GB (cheaper than direct S3)
  • NAT Gateway: $0.045/hour + $0.045/GB processed
  • VPC Endpoint: $0.01/hour + $0.01/GB (saves NAT costs)
  • Inter-AZ: $0.01/GB (keep traffic in same AZ when possible)
  • Inter-Region: $0.02/GB

Cost Optimization Strategies:

  1. Right-size: Use Compute Optimizer recommendations
  2. Reserved capacity: For steady-state workloads (40-60% savings)
  3. Spot Instances: For fault-tolerant workloads (70-90% savings)
  4. Auto Scaling: Scale down during low usage
  5. S3 Lifecycle: Transition to cheaper storage classes
  6. VPC Endpoints: Avoid NAT Gateway data transfer charges
  7. CloudFront: Reduce origin data transfer costs
  8. Delete unused resources: Snapshots, volumes, load balancers

Decision Points:

  • Steady workload? → Reserved Instances or Savings Plans
  • Variable workload? → On-Demand or Spot
  • Fault-tolerant? → Spot Instances (70-90% savings)
  • Infrequent access? → S3 IA or Glacier
  • Archive data? → Glacier Deep Archive (96% savings)
  • Unpredictable database traffic? → DynamoDB On-Demand or Aurora Serverless
  • High data transfer? → CloudFront or VPC Endpoints


Chapter Summary

What We Covered

This chapter covered Domain 4: Design Cost-Optimized Architectures (20% of the exam). We explored four major task areas:

āœ… Task 4.1: Design Cost-Optimized Storage Solutions

  • S3 storage classes and lifecycle policies
  • EBS volume optimization and snapshot management
  • Storage tiering strategies (hot, warm, cold, archive)
  • Hybrid storage with Storage Gateway
  • Data transfer cost optimization

āœ… Task 4.2: Design Cost-Optimized Compute Solutions

  • EC2 purchasing options: On-Demand, Reserved, Spot, Savings Plans
  • Right-sizing instances with Compute Optimizer
  • Auto Scaling for cost efficiency
  • Serverless computing with Lambda and Fargate
  • Container optimization strategies

āœ… Task 4.3: Design Cost-Optimized Database Solutions

  • Database engine selection (RDS vs Aurora vs DynamoDB)
  • Aurora Serverless for variable workloads
  • DynamoDB on-demand vs provisioned capacity
  • Read replicas for cost-effective scaling
  • Database backup and retention optimization

āœ… Task 4.4: Design Cost-Optimized Network Architectures

  • Data transfer cost optimization
  • NAT Gateway vs NAT instance cost comparison
  • VPC endpoints to reduce data transfer costs
  • CloudFront for reduced origin data transfer
  • Direct Connect vs VPN cost analysis

Critical Takeaways

  1. Right-sizing is the #1 cost saver: Use Compute Optimizer to identify over-provisioned resources. Downsize instances that are consistently under 40% utilization.

  2. Reserved capacity for steady workloads: 40-60% savings with Reserved Instances or Savings Plans. Commit to 1 or 3 years for predictable workloads.

  3. Spot Instances for fault-tolerant workloads: 70-90% savings for batch processing, data analysis, containerized workloads. Not for databases or stateful applications.

  4. S3 lifecycle policies automate cost savings: Transition to IA after 30 days, Glacier after 90 days, Deep Archive after 180 days. Delete after retention period.

  5. Serverless reduces idle costs: Lambda and Fargate charge only for actual usage. No cost when idle. Perfect for variable or unpredictable workloads.

  6. Data transfer costs add up: Keep traffic within same AZ when possible ($0 vs $0.01/GB). Use VPC endpoints to avoid NAT Gateway charges. Use CloudFront to reduce origin data transfer.

  7. Delete unused resources: Unattached EBS volumes, old snapshots, unused load balancers, idle RDS instances. Set up AWS Budgets alerts to catch waste.

  8. Aurora Serverless for variable databases: Pay per second, auto-scales, pauses when idle. Perfect for dev/test, infrequent workloads, unpredictable traffic.

  9. DynamoDB on-demand for unpredictable traffic: No capacity planning, pay per request. Switch to provisioned when traffic becomes predictable for 20-30% savings.

  10. Monitor and optimize continuously: Use Cost Explorer to identify trends, Trusted Advisor for recommendations, AWS Budgets for alerts. Cost optimization is ongoing.

Key Services Quick Reference

Cost Management Tools:

  • Cost Explorer: Visualize and analyze costs, identify trends, forecast spending
  • AWS Budgets: Set custom budgets, receive alerts when exceeding thresholds
  • Cost and Usage Report: Detailed billing data, integrate with Athena/QuickSight
  • Compute Optimizer: ML-based recommendations for right-sizing EC2, Lambda, EBS
  • Trusted Advisor: Best practice checks, cost optimization recommendations
  • Cost Allocation Tags: Track costs by project, team, environment

Storage Cost Optimization:

  • S3 Standard: $0.023/GB, frequent access
  • S3 Intelligent-Tiering: Automatic cost optimization, $0.023-$0.004/GB
  • S3 IA: $0.0125/GB, infrequent access (>30 days)
  • S3 Glacier: $0.004/GB, archive (>90 days)
  • S3 Glacier Deep Archive: $0.00099/GB, long-term archive (>180 days)
  • EBS gp3: $0.08/GB, cost-effective general purpose
  • EBS Snapshots: Incremental, compress, delete old snapshots

Compute Cost Optimization:

  • On-Demand: $0.096/hour (t3.medium), pay as you go, no commitment
  • Reserved Instances: 40-60% savings, 1 or 3 year commitment
  • Savings Plans: 40-60% savings, flexible across instance families
  • Spot Instances: 70-90% savings, interruptible, fault-tolerant workloads
  • Lambda: $0.20 per 1M requests + $0.0000166667/GB-second
  • Fargate: Pay per vCPU and memory, no idle costs

Database Cost Optimization:

  • RDS: $0.017/hour (db.t3.micro), Reserved for 40-60% savings
  • Aurora: $0.041/hour (db.t3.small), 5x performance, cost-effective at scale
  • Aurora Serverless: $0.06/ACU-hour, auto-scales, pauses when idle
  • DynamoDB On-Demand: $1.25/million writes, $0.25/million reads
  • DynamoDB Provisioned: $0.00065/WCU-hour, $0.00013/RCU-hour (20-30% cheaper)
  • ElastiCache: $0.017/hour (cache.t3.micro), Reserved for 40-60% savings

Network Cost Optimization:

  • Data Transfer Out: $0.09/GB (first 10 TB), $0.085/GB (next 40 TB)
  • CloudFront: $0.085/GB, cheaper than direct S3, caching reduces origin requests
  • NAT Gateway: $0.045/hour + $0.045/GB processed
  • VPC Endpoint: $0.01/hour + $0.01/GB (saves NAT costs for S3/DynamoDB)
  • Inter-AZ: $0.01/GB (keep traffic in same AZ when possible)
  • Inter-Region: $0.02/GB (minimize cross-region traffic)

Decision Frameworks

Choosing EC2 Purchasing Option:

What's the workload pattern?
ā”œā”€ Steady, predictable (24/7)?
│  ā”œā”€ Specific instance type? → Reserved Instances (40-60% savings)
│  └─ Flexible instance family? → Compute Savings Plans (40-60% savings)
ā”œā”€ Variable, unpredictable?
│  ā”œā”€ Can't be interrupted? → On-Demand
│  └─ Fault-tolerant? → Spot Instances (70-90% savings)
ā”œā”€ Short-lived (<15 min)? → Lambda (pay per invocation)
└─ Containers?
   ā”œā”€ Long-running? → ECS on EC2 with Reserved/Spot
   └─ Variable? → Fargate (pay per task)

Choosing S3 Storage Class:

How often is data accessed?
ā”œā”€ Frequently (daily)? → S3 Standard
ā”œā”€ Infrequently (monthly)?
│  ā”œā”€ Predictable access? → S3 IA (50% cheaper)
│  └─ Unpredictable access? → S3 Intelligent-Tiering (automatic)
ā”œā”€ Rarely (quarterly)?
│  ā”œā”€ Need quick retrieval? → S3 Glacier Instant Retrieval
│  └─ Can wait minutes? → S3 Glacier Flexible Retrieval (90% cheaper)
└─ Archive (yearly)? → S3 Glacier Deep Archive (96% cheaper)

Choosing Database Pricing Model:

Workload Pattern Solution Cost Savings Use Case
Steady 24/7 RDS Reserved 40-60% Production databases
Variable, predictable DynamoDB Provisioned 20-30% vs on-demand Known traffic patterns
Variable, unpredictable Aurora Serverless Pay per second Dev/test, infrequent
Spiky, unpredictable DynamoDB On-Demand No capacity planning New applications
Infrequent queries Athena Pay per query Analytics on S3

Optimizing Data Transfer Costs:

Scenario Cost Optimization
Same AZ $0 Keep traffic local when possible
Inter-AZ $0.01/GB Use single AZ for non-critical workloads
Inter-Region $0.02/GB Minimize cross-region replication
To Internet $0.09/GB Use CloudFront ($0.085/GB)
S3 via NAT $0.045/GB Use VPC Endpoint ($0.01/GB)
DynamoDB via NAT $0.045/GB Use VPC Endpoint ($0.01/GB)

Common Exam Patterns

Pattern 1: "Most Cost-Effective" Questions

  • Look for: Reserved Instances, Spot Instances, S3 lifecycle, serverless, right-sizing
  • Eliminate: On-Demand for steady workloads, over-provisioned resources, expensive storage
  • Choose: Committed capacity for steady workloads, Spot for fault-tolerant, lifecycle policies

Pattern 2: "Reduce Data Transfer Costs" Questions

  • Look for: VPC endpoints, CloudFront, same-AZ deployment, Direct Connect
  • Eliminate: NAT Gateway for S3/DynamoDB, cross-region replication, inter-AZ traffic
  • Choose: VPC endpoints for AWS services, CloudFront for internet traffic, local traffic

Pattern 3: "Optimize Storage Costs" Questions

  • Look for: S3 lifecycle policies, Intelligent-Tiering, delete old snapshots, gp3 volumes
  • Eliminate: S3 Standard for infrequent access, keeping all snapshots, io2 for general purpose
  • Choose: Automatic tiering, lifecycle transitions, incremental snapshots, right-sized volumes

Pattern 4: "Variable Workload Costs" Questions

  • Look for: Auto Scaling, Lambda, Fargate, Aurora Serverless, DynamoDB on-demand
  • Eliminate: Always-on resources, over-provisioned capacity, fixed capacity
  • Choose: Pay-per-use services, auto-scaling, serverless options

Pattern 5: "Long-Term Cost Reduction" Questions

  • Look for: Reserved Instances, Savings Plans, 3-year commitments, Compute Optimizer
  • Eliminate: On-Demand for steady workloads, no commitment, ignoring recommendations
  • Choose: 1 or 3 year commitments for predictable workloads, right-sizing recommendations

Self-Assessment Checklist

Test yourself before moving to the next chapter:

Storage Cost Optimization:

  • I can choose the right S3 storage class based on access patterns
  • I understand S3 lifecycle policies and how to automate transitions
  • I know when to use EBS gp3 vs io2 for cost optimization
  • I can optimize EBS snapshot costs with lifecycle management
  • I understand data transfer costs and how to minimize them

Compute Cost Optimization:

  • I can choose between On-Demand, Reserved, Spot, and Savings Plans
  • I understand when to use Lambda vs Fargate vs EC2 for cost efficiency
  • I know how to right-size instances with Compute Optimizer
  • I can implement Auto Scaling to reduce idle costs
  • I understand Spot Instance best practices and use cases

Database Cost Optimization:

  • I can choose between RDS, Aurora, and DynamoDB for cost efficiency
  • I understand Aurora Serverless and when to use it
  • I know when to use DynamoDB on-demand vs provisioned capacity
  • I can optimize database backup and retention costs
  • I understand read replicas for cost-effective scaling

Network Cost Optimization:

  • I understand data transfer pricing (inter-AZ, inter-region, internet)
  • I know when to use VPC endpoints to reduce NAT Gateway costs
  • I can use CloudFront to reduce origin data transfer costs
  • I understand Direct Connect vs VPN cost trade-offs
  • I can optimize network architecture for cost efficiency

Cost Management:

  • I can use Cost Explorer to analyze spending trends
  • I know how to set up AWS Budgets with alerts
  • I understand cost allocation tags for tracking
  • I can use Trusted Advisor for cost optimization recommendations
  • I know how to use Compute Optimizer for right-sizing

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-20 (Storage and compute cost optimization)
  • Domain 4 Bundle 2: Questions 21-40 (Database and network cost optimization)
  • Domain 4 Bundle 3: Questions 41-50 (Cost management and monitoring)
  • Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • 60-74%: Review specific sections where you missed questions
  • Below 60%: Re-read the entire chapter and take detailed notes
  • Focus on:
    • EC2 purchasing options and when to use each
    • S3 storage classes and lifecycle policies
    • Data transfer cost optimization strategies
    • Aurora Serverless vs RDS Reserved cost comparison
    • VPC endpoint cost savings vs NAT Gateway

Quick Reference Card

Copy this to your notes for quick review:

EC2 Purchasing Options:

Option Savings Commitment Use Case
On-Demand 0% None Variable, unpredictable
Reserved 40-60% 1 or 3 years Steady 24/7 workloads
Savings Plans 40-60% 1 or 3 years Flexible instance families
Spot 70-90% None Fault-tolerant, interruptible

S3 Storage Classes:

Class Cost/GB Retrieval Use Case
Standard $0.023 Instant Frequent access
IA $0.0125 Instant Infrequent (>30 days)
Glacier Instant $0.004 Instant Archive, instant retrieval
Glacier Flexible $0.0036 Minutes-hours Archive, flexible retrieval
Glacier Deep $0.00099 12 hours Long-term archive

Data Transfer Costs:

  • Same AZ: $0 (free)
  • Inter-AZ: $0.01/GB
  • Inter-Region: $0.02/GB
  • To Internet: $0.09/GB (first 10 TB)
  • CloudFront: $0.085/GB (cheaper than direct)
  • NAT Gateway: $0.045/hour + $0.045/GB
  • VPC Endpoint: $0.01/hour + $0.01/GB

Database Pricing:

  • RDS On-Demand: $0.017/hour (db.t3.micro)
  • RDS Reserved: 40-60% savings (1 or 3 years)
  • Aurora: $0.041/hour (db.t3.small), 5x performance
  • Aurora Serverless: $0.06/ACU-hour, auto-scales, pauses
  • DynamoDB On-Demand: $1.25/million writes, $0.25/million reads
  • DynamoDB Provisioned: 20-30% cheaper than on-demand

Cost Optimization Strategies:

  1. Right-size: Use Compute Optimizer (save 20-40%)
  2. Reserved capacity: For steady workloads (save 40-60%)
  3. Spot Instances: For fault-tolerant (save 70-90%)
  4. Auto Scaling: Scale down during low usage
  5. S3 Lifecycle: Transition to cheaper storage classes
  6. VPC Endpoints: Avoid NAT Gateway charges
  7. CloudFront: Reduce origin data transfer costs
  8. Delete unused: Snapshots, volumes, load balancers

Must Memorize:

  • Reserved Instances: 40-60% savings, 1 or 3 year commitment
  • Spot Instances: 70-90% savings, can be interrupted
  • S3 IA: 50% cheaper than Standard, $0.0125/GB
  • S3 Glacier Deep Archive: 96% cheaper, $0.00099/GB
  • Data transfer out: $0.09/GB (first 10 TB)
  • CloudFront: $0.085/GB (cheaper than direct S3)
  • NAT Gateway: $0.045/hour + $0.045/GB processed
  • VPC Endpoint: $0.01/hour + $0.01/GB (saves NAT costs)
  • Inter-AZ: $0.01/GB (keep traffic local)
  • Lambda: $0.20 per 1M requests + $0.0000166667/GB-second

Congratulations! You've completed all four exam domains! You've now covered 100% of the exam content:

  • āœ… Domain 1: Design Secure Architectures (30%)
  • āœ… Domain 2: Design Resilient Architectures (26%)
  • āœ… Domain 3: Design High-Performing Architectures (24%)
  • āœ… Domain 4: Design Cost-Optimized Architectures (20%)

Next Chapter: 06_integration - Integration & Advanced Topics (Cross-Domain Scenarios)


Chapter Summary

What We Covered

This chapter covered Domain 4: Design Cost-Optimized Architectures (20% of exam). You learned:

  • āœ… Storage Cost Optimization: S3 lifecycle policies, storage class selection, and data transfer optimization
  • āœ… Compute Cost Optimization: EC2 pricing models (On-Demand, Reserved, Savings Plans, Spot), right-sizing, and Auto Scaling
  • āœ… Database Cost Optimization: RDS pricing, Aurora Serverless, DynamoDB capacity modes, and reserved capacity
  • āœ… Network Cost Optimization: Data transfer costs, NAT Gateway vs NAT instance, VPC endpoints, and CloudFront
  • āœ… Cost Monitoring: Cost Explorer, Budgets, Cost and Usage Reports, and cost allocation tags
  • āœ… Cost Management Tools: Trusted Advisor, Compute Optimizer, and cost anomaly detection
  • āœ… Pricing Models: Understanding different pricing models and when to use each
  • āœ… Cost Allocation: Tagging strategies, multi-account billing, and chargeback/showback

Critical Takeaways

  1. S3 Lifecycle: Transition to IA after 30 days, Glacier after 90 days, Deep Archive for long-term retention
  2. EC2 Pricing: On-Demand for flexibility, Reserved for steady-state (up to 72% savings), Spot for fault-tolerant (up to 90% savings)
  3. Savings Plans: Compute Savings Plans (most flexible, 66% savings), EC2 Instance Savings Plans (72% savings, less flexible)
  4. Reserved Instances: Standard (highest discount, no flexibility), Convertible (lower discount, can change instance type)
  5. Spot Instances: Up to 90% savings, 2-minute interruption notice, use for fault-tolerant workloads
  6. Database Pricing: Aurora Serverless for variable workloads, DynamoDB on-demand for unpredictable, reserved capacity for steady-state
  7. Data Transfer: Most expensive between regions, free within same AZ, use VPC endpoints to avoid NAT Gateway costs
  8. NAT Gateway: $0.045/hour + data transfer, NAT instance cheaper but requires management
  9. CloudFront: Reduces data transfer costs, improves performance, free tier available
  10. Cost Monitoring: Use Cost Explorer for analysis, Budgets for alerts, tags for allocation

Self-Assessment Checklist

Test yourself before moving on. Can you:

Storage Cost Optimization:

  • Design S3 lifecycle policies to transition objects between storage classes?
  • Choose the right S3 storage class (Standard, IA, One Zone-IA, Glacier, Deep Archive)?
  • Use S3 Intelligent-Tiering for automatic cost optimization?
  • Optimize EBS costs by selecting the right volume type?
  • Implement EBS snapshot lifecycle policies?
  • Use EFS Infrequent Access for cost savings?

Compute Cost Optimization:

  • Choose the right EC2 pricing model (On-Demand, Reserved, Savings Plans, Spot)?
  • Calculate savings with Reserved Instances and Savings Plans?
  • Implement Spot Instances for fault-tolerant workloads?
  • Right-size EC2 instances using Compute Optimizer?
  • Use Auto Scaling to match capacity with demand?
  • Optimize Lambda costs by adjusting memory and timeout?

Database Cost Optimization:

  • Choose between RDS and Aurora based on cost and performance?
  • Use Aurora Serverless for variable workloads?
  • Select DynamoDB capacity mode (on-demand vs provisioned)?
  • Purchase DynamoDB reserved capacity for steady-state workloads?
  • Optimize RDS costs with Reserved Instances?
  • Use ElastiCache reserved nodes for cost savings?

Network Cost Optimization:

  • Minimize data transfer costs between regions and AZs?
  • Choose between NAT Gateway and NAT instance?
  • Use VPC endpoints to avoid NAT Gateway costs?
  • Implement CloudFront to reduce data transfer costs?
  • Optimize Direct Connect costs with appropriate bandwidth?
  • Choose the right load balancer based on cost (ALB vs NLB)?

Cost Monitoring & Management:

  • Use Cost Explorer to analyze spending patterns?
  • Set up AWS Budgets with alerts?
  • Implement cost allocation tags for chargeback?
  • Use Trusted Advisor for cost optimization recommendations?
  • Configure Compute Optimizer for right-sizing recommendations?
  • Create Cost and Usage Reports for detailed analysis?

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-50 (Expected score: 70%+ to proceed)
  • Domain 4 Bundle 2: Questions 51-100 (Expected score: 75%+ to proceed)

If you scored below 70%:

  • Review EC2 pricing models and when to use each
  • Focus on S3 storage class selection and lifecycle policies
  • Study data transfer costs and optimization strategies
  • Practice cost monitoring tool selection

If you scored 70-80%:

  • Review advanced topics: Savings Plans vs Reserved Instances
  • Study database cost optimization strategies
  • Practice network cost optimization
  • Focus on cost allocation and tagging strategies

If you scored 80%+:

  • Excellent! You've completed all four domains
  • Continue practicing with full practice tests
  • Review integration scenarios in the next chapter

Congratulations! You've completed all four exam domains (100% of exam content). You're now ready to practice integration scenarios and prepare for the exam.

Next Steps: Proceed to 06_integration to learn about cross-domain integration scenarios and advanced topics.


Chapter Summary

What We Covered

This chapter explored designing cost-optimized architectures on AWS, representing 20% of the SAA-C03 exam. We covered four major task areas:

Task 4.1: Design Cost-Optimized Storage Solutions

  • āœ… S3 storage classes and lifecycle policies
  • āœ… S3 Intelligent-Tiering for automatic cost optimization
  • āœ… Glacier and Glacier Deep Archive for long-term archival
  • āœ… EBS volume optimization (gp3 vs gp2, cold HDD)
  • āœ… EFS lifecycle management and Infrequent Access
  • āœ… Data transfer cost optimization
  • āœ… Backup retention policies and cost management

Task 4.2: Design Cost-Optimized Compute Solutions

  • āœ… EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
  • āœ… Reserved Instances types (Standard, Convertible, Scheduled)
  • āœ… Savings Plans (Compute vs EC2 Instance)
  • āœ… Spot Instances and Spot Fleet strategies
  • āœ… Lambda cost optimization (memory, timeout, concurrency)
  • āœ… Auto Scaling for cost efficiency
  • āœ… Right-sizing with Compute Optimizer

Task 4.3: Design Cost-Optimized Database Solutions

  • āœ… RDS pricing models and Reserved Instances
  • āœ… Aurora Serverless for variable workloads
  • āœ… DynamoDB pricing modes (On-Demand vs Provisioned)
  • āœ… DynamoDB Reserved Capacity
  • āœ… ElastiCache Reserved Nodes
  • āœ… Database right-sizing and storage optimization
  • āœ… Backup and snapshot cost management

Task 4.4: Design Cost-Optimized Network Architectures

  • āœ… Data transfer pricing and optimization
  • āœ… NAT Gateway vs NAT Instance cost comparison
  • āœ… VPC endpoints for cost savings
  • āœ… CloudFront for reduced data transfer costs
  • āœ… Direct Connect vs VPN cost analysis
  • āœ… Load balancer cost optimization
  • āœ… Transit Gateway and VPC peering costs

Critical Takeaways

Cost Optimization Principles:

  1. Right-Sizing: Use Compute Optimizer and Cost Explorer to identify oversized resources
  2. Reserved Capacity: Commit to 1-year or 3-year terms for predictable workloads (up to 72% savings)
  3. Spot Instances: Use for fault-tolerant workloads (up to 90% savings)
  4. Lifecycle Policies: Automatically transition data to cheaper storage classes
  5. Monitor and Optimize: Use Cost Explorer, Budgets, and Trusted Advisor regularly

Storage Cost Strategies:

  • S3 Lifecycle: Transition to IA after 30 days, Glacier after 90 days, Deep Archive after 180 days
  • Intelligent-Tiering: Automatic cost optimization for unknown or changing access patterns
  • EBS gp3: 20% cheaper than gp2 with better performance
  • EFS IA: 92% cost savings for infrequently accessed files
  • Delete Unused: Remove old snapshots, unattached volumes, incomplete multipart uploads

Compute Cost Strategies:

  • Savings Plans: Most flexible, up to 72% savings, applies to Lambda and Fargate
  • Reserved Instances: Up to 72% savings, specific instance type commitment
  • Spot Instances: Up to 90% savings, can be interrupted with 2-minute warning
  • Auto Scaling: Scale down during off-peak hours, use scheduled scaling
  • Lambda: Pay per request and duration, optimize memory allocation

Database Cost Strategies:

  • Aurora Serverless: Pay per ACU-second, automatic scaling, ideal for variable workloads
  • DynamoDB On-Demand: Pay per request, no capacity planning, good for unpredictable traffic
  • Reserved Capacity: 1-year or 3-year commitment for predictable workloads
  • Read Replicas: Cheaper than Multi-AZ for read scaling (but no automatic failover)
  • Storage Optimization: Use appropriate storage type, enable storage autoscaling

Network Cost Strategies:

  • VPC Endpoints: Eliminate data transfer costs to S3 and DynamoDB
  • CloudFront: Reduce origin data transfer costs, cheaper than direct S3 access
  • Same AZ: Keep traffic within same AZ to avoid inter-AZ charges
  • Direct Connect: Lower data transfer costs for high-volume hybrid connectivity
  • NAT Gateway: Consider NAT instance for low-traffic scenarios

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

Storage Cost Optimization:

  • Design S3 lifecycle policies for cost optimization
  • Choose appropriate S3 storage class based on access patterns
  • Implement S3 Intelligent-Tiering for automatic optimization
  • Select cost-effective EBS volume types (gp3 vs gp2)
  • Configure EFS lifecycle management for IA storage
  • Optimize data transfer costs with CloudFront and VPC endpoints
  • Implement backup retention policies to control costs

Compute Cost Optimization:

  • Compare EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
  • Select appropriate Reserved Instance type (Standard vs Convertible)
  • Choose between Compute Savings Plans and EC2 Instance Savings Plans
  • Implement Spot Instances for fault-tolerant workloads
  • Optimize Lambda costs (memory, timeout, concurrency)
  • Use Auto Scaling to match capacity with demand
  • Right-size instances with Compute Optimizer recommendations

Database Cost Optimization:

  • Select appropriate RDS pricing model (On-Demand vs Reserved)
  • Use Aurora Serverless for variable workloads
  • Choose DynamoDB pricing mode (On-Demand vs Provisioned)
  • Implement DynamoDB Reserved Capacity for predictable workloads
  • Use ElastiCache Reserved Nodes for long-term caching
  • Right-size database instances based on utilization
  • Optimize backup and snapshot retention

Network Cost Optimization:

  • Understand data transfer pricing (inter-AZ, inter-region, internet)
  • Choose between NAT Gateway and NAT Instance based on traffic
  • Use VPC endpoints to eliminate data transfer costs
  • Implement CloudFront to reduce origin data transfer
  • Compare Direct Connect and VPN costs for hybrid connectivity
  • Optimize load balancer costs (ALB vs NLB)
  • Use Transit Gateway or VPC peering appropriately

Practice Questions

Try these from your practice test bundles:

Beginner Level (Target: 80%+ correct):

  • Domain 4 Bundle 1: Questions 1-20 (pricing models, storage classes, basic optimization)
  • Full Practice Test 1: Domain 4 questions (foundational cost concepts)

Intermediate Level (Target: 70%+ correct):

  • Domain 4 Bundle 2: Questions 21-40 (Reserved Instances, Savings Plans, lifecycle policies)
  • Full Practice Test 2: Domain 4 questions (cost optimization strategies)

Advanced Level (Target: 60%+ correct):

  • Full Practice Test 3: Domain 4 questions (complex cost scenarios)
  • Mixed difficulty: Cost optimization across all services

If you scored below target:

  • Below 60%: Review pricing models, storage classes, and basic cost concepts
  • 60-70%: Focus on Reserved Instances, Savings Plans, and lifecycle policies
  • 70-80%: Study advanced optimization techniques and cost allocation strategies
  • Above 80%: Excellent! You're ready for the exam

Quick Reference Card

Copy this to your notes for quick review:

EC2 Pricing Models Comparison

Model Discount Commitment Flexibility Best For
On-Demand 0% None Full Variable, short-term
Savings Plans (Compute) Up to 66% 1-3 years High (any instance, region) Flexible commitment
Savings Plans (EC2) Up to 72% 1-3 years Medium (instance family, region) Specific family
Reserved (Standard) Up to 72% 1-3 years Low (specific instance) Predictable workload
Reserved (Convertible) Up to 54% 1-3 years Medium (can change type) Changing needs
Spot Up to 90% None Low (can be interrupted) Fault-tolerant

S3 Storage Classes Cost Comparison

Class Cost Retrieval Min Duration Use Case
Standard Highest Free None Frequent access
Intelligent-Tiering Auto-optimized Free None Unknown/changing patterns
Standard-IA Low Per GB 30 days Infrequent access
One Zone-IA Lower Per GB 30 days Non-critical, infrequent
Glacier Instant Lower Per GB 90 days Archive, instant retrieval
Glacier Flexible Very low Per GB + time 90 days Archive, minutes-hours
Glacier Deep Archive Lowest Per GB + time 180 days Long-term archive, 12h

DynamoDB Pricing Modes

Mode Best For Pricing Capacity Planning
On-Demand Unpredictable traffic Per request None required
Provisioned Predictable traffic Per RCU/WCU Manual or auto-scaling
Reserved Capacity Steady, predictable Upfront discount 1-year commitment

Data Transfer Cost Optimization

  • Free: Data IN from internet, between services in same region (most cases)
  • Charged: Data OUT to internet, inter-region, inter-AZ (some services)
  • Optimization: Use VPC endpoints (S3, DynamoDB), CloudFront, same-AZ placement

Cost Monitoring Tools

Tool Purpose Features
Cost Explorer Analyze spending Historical data, forecasting, filtering
AWS Budgets Set spending limits Alerts, custom thresholds, forecasts
Cost and Usage Report Detailed billing Hourly granularity, comprehensive data
Compute Optimizer Right-sizing ML-based recommendations, savings estimates
Trusted Advisor Best practices Cost optimization checks, recommendations

Cost Optimization Checklist

  • āœ… Use Reserved Instances or Savings Plans for predictable workloads (up to 72% savings)
  • āœ… Implement Spot Instances for fault-tolerant workloads (up to 90% savings)
  • āœ… Configure S3 lifecycle policies to transition to cheaper storage classes
  • āœ… Use S3 Intelligent-Tiering for unknown access patterns
  • āœ… Right-size instances with Compute Optimizer recommendations
  • āœ… Enable Auto Scaling to match capacity with demand
  • āœ… Use VPC endpoints to eliminate data transfer costs
  • āœ… Implement CloudFront to reduce origin data transfer costs
  • āœ… Delete unused resources (snapshots, volumes, elastic IPs)
  • āœ… Set up AWS Budgets with alerts for cost control
  • āœ… Use cost allocation tags for detailed tracking
  • āœ… Review Trusted Advisor recommendations monthly

Common Exam Scenarios

  • Scenario: Predictable workload → Solution: Reserved Instances or Savings Plans (up to 72% savings)
  • Scenario: Fault-tolerant batch processing → Solution: Spot Instances (up to 90% savings)
  • Scenario: Unknown access patterns → Solution: S3 Intelligent-Tiering (automatic optimization)
  • Scenario: Infrequent access (>30 days) → Solution: S3 Standard-IA or One Zone-IA
  • Scenario: Long-term archive → Solution: Glacier Flexible or Glacier Deep Archive
  • Scenario: Variable database workload → Solution: Aurora Serverless or DynamoDB On-Demand
  • Scenario: High data transfer to S3 → Solution: VPC endpoint (eliminate transfer costs)
  • Scenario: Global content delivery → Solution: CloudFront (reduce origin transfer costs)
  • Scenario: Oversized instances → Solution: Compute Optimizer recommendations + right-sizing
  • Scenario: Unused resources → Solution: Delete unattached volumes, old snapshots, unused elastic IPs

Next Chapter: 06_integration - Integration & Advanced Topics

Chapter Summary

What We Covered

This chapter covered Domain 4: Design Cost-Optimized Architectures (20% of the exam), focusing on four critical task areas:

āœ… Task 4.1: Design cost-optimized storage solutions

  • S3 storage classes and lifecycle policies
  • S3 Intelligent-Tiering for automatic optimization
  • Glacier and Glacier Deep Archive for long-term archival
  • EBS volume optimization (gp3 vs gp2, right-sizing)
  • EFS lifecycle management and Infrequent Access
  • Storage Gateway for hybrid cloud cost optimization
  • Data transfer cost optimization strategies

āœ… Task 4.2: Design cost-optimized compute solutions

  • EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
  • Reserved Instances vs Savings Plans comparison
  • Spot Instances for fault-tolerant workloads (up to 90% savings)
  • Auto Scaling for matching capacity with demand
  • Lambda cost optimization (memory, timeout, concurrency)
  • Fargate Spot for cost-effective containers
  • Graviton instances for better price-performance
  • Compute Optimizer for right-sizing recommendations

āœ… Task 4.3: Design cost-optimized database solutions

  • RDS pricing models and Reserved Instances
  • Aurora Serverless for variable workloads
  • DynamoDB pricing modes (On-Demand vs Provisioned)
  • DynamoDB Reserved Capacity for predictable workloads
  • ElastiCache Reserved Nodes
  • Database right-sizing and storage optimization
  • Backup retention and snapshot lifecycle management

āœ… Task 4.4: Design cost-optimized network architectures

  • Data transfer pricing and optimization strategies
  • NAT Gateway vs NAT instance cost comparison
  • VPC endpoints to eliminate data transfer costs
  • CloudFront for reducing origin data transfer costs
  • Direct Connect vs VPN cost analysis
  • Transit Gateway and VPC peering cost considerations
  • Load balancer cost optimization (ALB vs NLB)

Critical Takeaways

Cost optimization is about maximizing value, not minimizing spend:

  • Right-Sizing: Use only the resources you need, not more
  • Elasticity: Scale up during peak, scale down during off-peak
  • Pricing Models: Choose the right pricing model for each workload
  • Monitoring: Track costs continuously and optimize proactively
  • Automation: Automate cost optimization (lifecycle policies, Auto Scaling)

Key Cost Optimization Principles:

  1. Pay for What You Use: Use Auto Scaling, Lambda, and serverless services
  2. Reserved Capacity: Commit to 1-3 years for predictable workloads (up to 72% savings)
  3. Spot Instances: Use for fault-tolerant workloads (up to 90% savings)
  4. Storage Tiering: Move infrequently accessed data to cheaper storage classes
  5. Data Transfer: Minimize cross-region and internet data transfer costs

Most Important Services to Master:

  • Cost Explorer: Visualize and analyze costs, identify optimization opportunities
  • AWS Budgets: Set cost and usage budgets with alerts
  • Compute Optimizer: ML-based recommendations for right-sizing
  • Trusted Advisor: Best practice checks including cost optimization
  • Cost and Usage Report: Detailed cost and usage data for analysis
  • Cost Allocation Tags: Track costs by project, team, or environment

Common Exam Patterns:

  • Questions about predictable workload → Reserved Instances or Savings Plans (up to 72% savings)
  • Questions about fault-tolerant batch processing → Spot Instances (up to 90% savings)
  • Questions about unknown access patterns → S3 Intelligent-Tiering (automatic optimization)
  • Questions about infrequent access → S3 Standard-IA or One Zone-IA
  • Questions about long-term archive → Glacier Flexible or Glacier Deep Archive
  • Questions about variable database workload → Aurora Serverless or DynamoDB On-Demand
  • Questions about data transfer to S3 → VPC endpoint (eliminate transfer costs)
  • Questions about oversized instances → Compute Optimizer recommendations + right-sizing

Self-Assessment Checklist

Test yourself before moving to the next chapter. You should be able to:

Storage Cost Optimization

  • Choose appropriate S3 storage class based on access patterns
  • Configure S3 lifecycle policies for automatic transitions
  • Use S3 Intelligent-Tiering for unknown access patterns
  • Select Glacier retrieval option based on urgency
  • Optimize EBS volumes (gp3 vs gp2, right-sizing)
  • Implement EFS lifecycle management
  • Calculate data transfer costs and optimize
  • Use VPC endpoints to reduce data transfer costs

Compute Cost Optimization

  • Compare EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
  • Choose between Reserved Instances and Savings Plans
  • Implement Spot Instances for fault-tolerant workloads
  • Configure Auto Scaling to match capacity with demand
  • Optimize Lambda costs (memory, timeout, concurrency)
  • Use Fargate Spot for cost-effective containers
  • Implement Graviton instances for better price-performance
  • Use Compute Optimizer for right-sizing recommendations

Database Cost Optimization

  • Choose appropriate RDS pricing model
  • Use Aurora Serverless for variable workloads
  • Select DynamoDB pricing mode (On-Demand vs Provisioned)
  • Implement DynamoDB Reserved Capacity for predictable workloads
  • Use ElastiCache Reserved Nodes for long-term workloads
  • Right-size database instances based on actual usage
  • Optimize backup retention and snapshot lifecycle
  • Monitor database costs with Cost Explorer

Network Cost Optimization

  • Understand data transfer pricing (inter-AZ, inter-region, internet)
  • Choose between NAT Gateway and NAT instance
  • Use VPC endpoints to eliminate data transfer costs
  • Implement CloudFront to reduce origin data transfer costs
  • Compare Direct Connect and VPN costs
  • Optimize Transit Gateway and VPC peering costs
  • Choose appropriate load balancer (ALB vs NLB) based on cost
  • Monitor network costs with Cost and Usage Report

Cost Monitoring and Governance

  • Use Cost Explorer to analyze spending patterns
  • Set up AWS Budgets with alerts
  • Implement cost allocation tags for tracking
  • Review Trusted Advisor recommendations
  • Use Cost and Usage Report for detailed analysis
  • Implement cost anomaly detection
  • Create cost optimization dashboards

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-25 (Storage and compute cost optimization)
  • Domain 4 Bundle 2: Questions 26-50 (Database and network cost optimization)
  • Full Practice Tests: Look for cost optimization questions across all domains

Expected Score: 75%+ to proceed confidently

If you scored below 75%:

  • 60-74%: Review specific sections where you struggled, then retry
  • Below 60%: Re-read this entire chapter, focusing on pricing comparisons
  • Focus on understanding cost trade-offs between different options

Quick Reference Card

Copy this to your notes for quick review:

EC2 Pricing Quick Facts

  • On-Demand: Pay per hour/second, no commitment, highest cost
  • Reserved Instances: 1-3 year commitment, up to 72% savings, specific instance type
  • Savings Plans: 1-3 year commitment, up to 72% savings, flexible instance family
  • Spot Instances: Bid on spare capacity, up to 90% savings, can be interrupted
  • Dedicated Hosts: Physical server, compliance requirements, most expensive

Storage Pricing Quick Facts

  • S3 Standard: $0.023/GB, frequent access, highest cost
  • S3 Standard-IA: $0.0125/GB, infrequent access (>30 days), retrieval fee
  • S3 One Zone-IA: $0.01/GB, single AZ, infrequent access, retrieval fee
  • S3 Glacier Flexible: $0.004/GB, archive, minutes-hours retrieval
  • S3 Glacier Deep Archive: $0.00099/GB, long-term archive, 12-48h retrieval
  • S3 Intelligent-Tiering: Automatic optimization, small monitoring fee

Database Pricing Quick Facts

  • RDS On-Demand: Pay per hour, no commitment
  • RDS Reserved: 1-3 year commitment, up to 69% savings
  • Aurora Serverless: Pay per second, auto-scaling, good for variable workloads
  • DynamoDB On-Demand: Pay per request, unpredictable workloads
  • DynamoDB Provisioned: Pay per hour, predictable workloads, cheaper at scale
  • DynamoDB Reserved: 1-3 year commitment, up to 77% savings

Network Pricing Quick Facts

  • Data Transfer In: Free (from internet to AWS)
  • Data Transfer Out: $0.09/GB (first 10 TB), decreases with volume
  • Inter-AZ: $0.01/GB in each direction
  • Inter-Region: $0.02/GB (varies by region pair)
  • VPC Endpoints: $0.01/GB processed, eliminates internet transfer costs
  • NAT Gateway: $0.045/hour + $0.045/GB processed
  • CloudFront: $0.085/GB (first 10 TB), cheaper than direct S3 transfer

Cost Optimization Tools Quick Facts

  • Cost Explorer: Visualize costs, identify trends, forecast spending
  • AWS Budgets: Set cost/usage budgets, alerts when exceeded
  • Compute Optimizer: ML-based right-sizing recommendations
  • Trusted Advisor: Best practice checks, cost optimization recommendations
  • Cost and Usage Report: Detailed hourly/daily cost data, S3 delivery
  • Cost Allocation Tags: Track costs by project, team, environment

Decision Points

  • Predictable workload → Reserved Instances or Savings Plans (up to 72% savings)
  • Fault-tolerant batch processing → Spot Instances (up to 90% savings)
  • Unknown access patterns → S3 Intelligent-Tiering (automatic optimization)
  • Infrequent access (>30 days) → S3 Standard-IA or One Zone-IA
  • Long-term archive → Glacier Flexible or Glacier Deep Archive
  • Variable database workload → Aurora Serverless or DynamoDB On-Demand
  • High data transfer to S3 → VPC endpoint (eliminate transfer costs)
  • Global content delivery → CloudFront (reduce origin transfer costs)
  • Oversized instances → Compute Optimizer recommendations + right-sizing
  • Unused resources → Delete unattached volumes, old snapshots, unused elastic IPs

Congratulations! You've completed Domain 4: Design Cost-Optimized Architectures. Cost optimization (20% of the exam) is critical for real-world AWS deployments, and understanding pricing models and optimization strategies will help you design cost-effective solutions.

Next Chapter: 06_integration - Integration & Advanced Topics


Chapter Summary

What We Covered

This chapter covered the four major task areas of Domain 4: Design Cost-Optimized Architectures (20% of exam):

Task 4.1: Design Cost-Optimized Storage Solutions

  • āœ… S3 storage classes and lifecycle policies
  • āœ… S3 Intelligent-Tiering for automatic optimization
  • āœ… Glacier and Glacier Deep Archive for long-term archival
  • āœ… EBS volume optimization (gp3 vs gp2, right-sizing)
  • āœ… EFS lifecycle management and Infrequent Access
  • āœ… Data transfer cost optimization
  • āœ… Backup retention and archival strategies

Task 4.2: Design Cost-Optimized Compute Solutions

  • āœ… EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
  • āœ… Reserved Instances (Standard, Convertible, Scheduled)
  • āœ… Savings Plans (Compute, EC2 Instance)
  • āœ… Spot Instances for fault-tolerant workloads
  • āœ… Lambda cost optimization
  • āœ… Fargate Spot for container cost savings
  • āœ… Auto Scaling for right-sizing
  • āœ… Compute Optimizer recommendations

Task 4.3: Design Cost-Optimized Database Solutions

  • āœ… RDS pricing and Reserved Instances
  • āœ… Aurora Serverless for variable workloads
  • āœ… DynamoDB On-Demand vs Provisioned capacity
  • āœ… DynamoDB Reserved Capacity
  • āœ… ElastiCache Reserved Nodes
  • āœ… Redshift Reserved Nodes and Spectrum
  • āœ… Database right-sizing and storage optimization

Task 4.4: Design Cost-Optimized Network Architectures

  • āœ… Data transfer pricing (inter-AZ, inter-region, internet)
  • āœ… NAT Gateway vs NAT Instance cost comparison
  • āœ… VPC endpoints to eliminate data transfer costs
  • āœ… CloudFront for reduced origin transfer costs
  • āœ… Direct Connect vs VPN cost analysis
  • āœ… Load balancer cost optimization
  • āœ… Network cost monitoring and allocation

Critical Takeaways

  1. Commitment Saves Money: Reserved Instances and Savings Plans offer up to 72% savings for predictable workloads. Commit for 1-3 years based on usage patterns.

  2. Spot for Fault-Tolerant: Use Spot Instances for batch processing, big data, and containerized workloads. Save up to 90% compared to On-Demand.

  3. Storage Lifecycle Management: Implement S3 lifecycle policies to automatically transition objects to cheaper storage classes. Use Intelligent-Tiering for unknown access patterns.

  4. Right-Size Everything: Use Compute Optimizer, Trusted Advisor, and CloudWatch metrics to identify oversized resources. Downsize or use burstable instances.

  5. Eliminate Data Transfer: Use VPC endpoints for AWS service access to avoid data transfer charges. Use CloudFront to reduce origin transfer costs.

  6. Serverless for Variable Workloads: Aurora Serverless, Lambda, and DynamoDB On-Demand automatically scale and you pay only for what you use.

  7. Monitor and Alert: Set up AWS Budgets with alerts, use Cost Explorer to identify trends, and implement cost allocation tags for accountability.

  8. Delete Unused Resources: Regularly audit and delete unattached EBS volumes, old snapshots, unused Elastic IPs, and idle load balancers.

Self-Assessment Checklist

Test yourself before moving on. Can you:

Storage Cost Optimization

  • Choose the appropriate S3 storage class for access patterns?
  • Implement S3 lifecycle policies for automatic transitions?
  • Use S3 Intelligent-Tiering for unknown access patterns?
  • Select the right EBS volume type for cost vs performance?
  • Implement EFS lifecycle management for cost savings?
  • Optimize data transfer costs with VPC endpoints?

Compute Cost Optimization

  • Explain the difference between Reserved Instances and Savings Plans?
  • Choose between Standard and Convertible Reserved Instances?
  • Identify workloads suitable for Spot Instances?
  • Optimize Lambda costs with appropriate memory settings?
  • Use Fargate Spot for container cost savings?
  • Implement Auto Scaling for right-sizing?
  • Use Compute Optimizer for recommendations?

Database Cost Optimization

  • Choose between RDS and Aurora based on cost?
  • Use Aurora Serverless for variable workloads?
  • Select DynamoDB On-Demand vs Provisioned capacity?
  • Purchase DynamoDB Reserved Capacity for predictable workloads?
  • Optimize database storage and backup retention?
  • Use read replicas vs caching for cost efficiency?

Network Cost Optimization

  • Understand data transfer pricing between AZs and regions?
  • Choose between NAT Gateway and NAT Instance?
  • Use VPC endpoints to eliminate data transfer costs?
  • Implement CloudFront to reduce origin transfer costs?
  • Choose between Direct Connect and VPN based on cost?
  • Optimize load balancer costs?

Cost Monitoring

  • Set up AWS Budgets with alerts?
  • Use Cost Explorer to analyze spending trends?
  • Implement cost allocation tags?
  • Use Trusted Advisor for cost optimization recommendations?
  • Analyze Cost and Usage Reports?

Practice Questions

Try these from your practice test bundles:

Beginner Level (Build Confidence):

  • Domain 4 Bundle 1: Questions 1-20
  • Expected score: 70%+ to proceed

Intermediate Level (Test Understanding):

  • Domain 4 Bundle 2: Questions 1-20
  • Full Practice Test 1: Domain 4 questions
  • Expected score: 75%+ to proceed

Advanced Level (Challenge Yourself):

  • Full Practice Test 3: Domain 4 questions
  • Expected score: 70%+ to proceed

If you scored below target:

  • Below 60%: Review pricing models and storage classes
  • 60-70%: Focus on Reserved Instances and Savings Plans
  • 70-80%: Review quick facts and decision points
  • 80%+: Perfect! Move to integration chapter

Quick Reference Card

Copy this to your notes for quick review:

Storage Cost Optimization

  • S3 Standard: $0.023/GB, frequent access
  • S3 Standard-IA: $0.0125/GB, infrequent access (>30 days)
  • S3 One Zone-IA: $0.01/GB, infrequent, non-critical
  • S3 Glacier Flexible: $0.004/GB, archive (minutes-hours retrieval)
  • S3 Glacier Deep Archive: $0.00099/GB, long-term (12-48h retrieval)
  • S3 Intelligent-Tiering: Automatic, $0.0025/1000 objects monitoring

Compute Cost Optimization

  • On-Demand: No commitment, highest cost, pay per hour/second
  • Reserved Instances: 1-3 year commitment, up to 72% savings
  • Savings Plans: Flexible commitment, up to 72% savings
  • Spot Instances: Bid on spare capacity, up to 90% savings
  • Dedicated Hosts: Physical server, compliance, highest cost

Reserved Instance Types

  • Standard RI: 75% discount, no flexibility, specific instance type
  • Convertible RI: 54% discount, change instance family
  • Scheduled RI: Specific time windows, predictable schedules

Savings Plans

  • Compute Savings Plans: Most flexible, any instance family, region, OS
  • EC2 Instance Savings Plans: Specific instance family, any size/AZ/OS

Database Cost Optimization

  • RDS On-Demand: No commitment, highest cost
  • RDS Reserved: 1-3 year, up to 69% savings
  • Aurora Serverless: Pay per second, auto-scaling
  • DynamoDB On-Demand: Pay per request, unpredictable workloads
  • DynamoDB Provisioned: Pay per RCU/WCU, predictable workloads
  • DynamoDB Reserved: 1-3 year, up to 77% savings

Network Cost Optimization

  • Same AZ: Free
  • Cross-AZ (same region): $0.01/GB in, $0.01/GB out
  • Cross-Region: $0.02/GB
  • Internet Out: $0.09/GB (first 10 TB)
  • VPC Endpoint: $0.01/GB processed, eliminates internet costs
  • NAT Gateway: $0.045/hour + $0.045/GB processed
  • CloudFront: $0.085/GB (first 10 TB), cheaper than S3 direct

Key Decision Points

Scenario Solution
Predictable workload Reserved Instances or Savings Plans (72% savings)
Fault-tolerant batch Spot Instances (90% savings)
Unknown access patterns S3 Intelligent-Tiering
Infrequent access (>30 days) S3 Standard-IA or One Zone-IA
Long-term archive Glacier Flexible or Deep Archive
Variable database workload Aurora Serverless or DynamoDB On-Demand
High S3 data transfer VPC endpoint (eliminate transfer costs)
Global content delivery CloudFront (reduce origin costs)
Oversized instances Compute Optimizer + right-sizing
Unused resources Delete unattached volumes, old snapshots

Chapter Summary

What We Covered

This chapter explored Design Cost-Optimized Architectures (20% of the exam), covering four major task areas:

āœ… Task 4.1: Design cost-optimized storage solutions

  • S3 lifecycle policies and storage class transitions
  • S3 Intelligent-Tiering for unknown access patterns
  • Glacier and Deep Archive for long-term archival
  • EBS volume optimization (gp3 vs gp2, right-sizing)
  • EFS lifecycle management and Infrequent Access
  • Data transfer cost optimization

āœ… Task 4.2: Design cost-optimized compute solutions

  • EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
  • Reserved Instances (Standard, Convertible, Scheduled)
  • Savings Plans (Compute, EC2 Instance)
  • Spot Instances for fault-tolerant workloads
  • Lambda cost optimization (memory, timeout)
  • Auto Scaling for right-sizing
  • Compute Optimizer recommendations

āœ… Task 4.3: Design cost-optimized database solutions

  • RDS pricing and Reserved Instances
  • Aurora Serverless for variable workloads
  • DynamoDB On-Demand vs Provisioned capacity
  • DynamoDB Reserved Capacity
  • ElastiCache Reserved Nodes
  • Database right-sizing and storage optimization

āœ… Task 4.4: Design cost-optimized network architectures

  • Data transfer pricing (same AZ, cross-AZ, cross-region, internet)
  • NAT Gateway vs NAT instance cost comparison
  • VPC endpoints to eliminate data transfer costs
  • CloudFront for reduced origin costs
  • Direct Connect vs VPN cost analysis
  • Load balancer cost optimization

Critical Takeaways

  1. Reserved Capacity: Use Reserved Instances or Savings Plans for predictable workloads (up to 72% savings over On-Demand).

  2. Spot Instances: Use Spot for fault-tolerant batch processing, data analysis, and containerized workloads (up to 90% savings).

  3. S3 Lifecycle Policies: Automatically transition objects to cheaper storage classes (Standard → IA → Glacier → Deep Archive) based on access patterns.

  4. Right-Sizing: Use Compute Optimizer and Cost Explorer to identify oversized resources and right-size them.

  5. Data Transfer Optimization: Use VPC endpoints to eliminate data transfer costs to S3/DynamoDB, CloudFront to reduce origin costs.

  6. Serverless for Variable Workloads: Use Lambda, Aurora Serverless, or DynamoDB On-Demand for unpredictable workloads to pay only for what you use.

  7. Cost Monitoring: Enable cost allocation tags, set up AWS Budgets with alerts, use Cost Explorer for analysis.

  8. Delete Unused Resources: Regularly delete unattached EBS volumes, old snapshots, unused Elastic IPs, and idle resources.

Self-Assessment Checklist

Test yourself before moving on:

  • I understand the difference between Reserved Instances and Savings Plans
  • I know when to use Spot Instances vs On-Demand
  • I can design S3 lifecycle policies for cost optimization
  • I understand data transfer pricing between AZs and regions
  • I know how to use VPC endpoints to reduce costs
  • I can select the right database pricing model for a workload
  • I understand NAT Gateway vs NAT instance cost trade-offs
  • I know how to use Cost Explorer and AWS Budgets
  • I can identify cost optimization opportunities in an architecture
  • I understand the cost implications of different design choices

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-50 (Expected score: 70%+)
  • Domain 4 Bundle 2: Questions 1-50 (Expected score: 70%+)
  • Full Practice Test 1: Domain 4 questions (Expected score: 75%+)

If you scored below 70%:

  • Review EC2 pricing models and when to use each
  • Focus on S3 storage class selection
  • Study data transfer pricing patterns
  • Practice cost optimization scenario analysis

Quick Reference Card

Compute Pricing:

  • On-Demand: Pay per hour/second, no commitment, highest cost
  • Reserved (1-3 year): Up to 72% savings, predictable workloads
  • Savings Plans: Up to 72% savings, flexible instance types
  • Spot: Up to 90% savings, fault-tolerant workloads
  • Lambda: $0.20/million requests + $0.0000166667/GB-second

Storage Pricing:

  • S3 Standard: $0.023/GB, frequent access
  • S3 Standard-IA: $0.0125/GB, infrequent access (>30 days)
  • S3 One Zone-IA: $0.01/GB, infrequent, non-critical
  • S3 Intelligent-Tiering: $0.023/GB + $0.0025/1000 objects, unknown patterns
  • Glacier Flexible: $0.004/GB, 3-5 hour retrieval
  • Glacier Deep Archive: $0.00099/GB, 12-hour retrieval

Database Pricing:

  • RDS On-Demand: Pay per hour, no commitment
  • RDS Reserved: 1-3 year, up to 69% savings
  • Aurora Serverless: Pay per second, auto-scaling
  • DynamoDB On-Demand: Pay per request, unpredictable workloads
  • DynamoDB Provisioned: Pay per RCU/WCU, predictable workloads
  • DynamoDB Reserved: 1-3 year, up to 77% savings

Network Pricing:

  • Same AZ: Free
  • Cross-AZ (same region): $0.01/GB in, $0.01/GB out
  • Cross-Region: $0.02/GB
  • Internet Out: $0.09/GB (first 10 TB)
  • VPC Endpoint: $0.01/GB processed, eliminates internet costs
  • NAT Gateway: $0.045/hour + $0.045/GB processed
  • CloudFront: $0.085/GB (first 10 TB), cheaper than S3 direct

Decision Points:

  • Predictable workload? → Reserved Instances or Savings Plans (72% savings)
  • Fault-tolerant batch? → Spot Instances (90% savings)
  • Unknown access patterns? → S3 Intelligent-Tiering
  • Infrequent access (>30 days)? → S3 Standard-IA or One Zone-IA
  • Long-term archive? → Glacier Flexible or Deep Archive
  • Variable database workload? → Aurora Serverless or DynamoDB On-Demand
  • High S3 data transfer? → VPC endpoint (eliminate transfer costs)
  • Global content delivery? → CloudFront (reduce origin costs)
  • Oversized instances? → Compute Optimizer + right-sizing
  • Unused resources? → Delete unattached volumes, old snapshots

Next Chapter: Proceed to 06_integration to learn about cross-domain integration patterns and advanced scenarios.

Chapter Summary

What We Covered

This chapter covered cost-optimized architecture design, representing 20% of the exam content. You learned:

  • āœ… Storage Cost Optimization: S3 lifecycle policies, storage classes, and data transfer optimization
  • āœ… Compute Cost Optimization: EC2 pricing models, Savings Plans, Reserved Instances, and Spot Instances
  • āœ… Database Cost Optimization: RDS pricing, Aurora Serverless, DynamoDB pricing modes, and right-sizing
  • āœ… Network Cost Optimization: Data transfer costs, NAT Gateway alternatives, VPC endpoints, and CloudFront
  • āœ… Cost Monitoring: Cost Explorer, Budgets, Cost and Usage Reports, and cost allocation tags
  • āœ… Cost Management: Right-sizing, resource cleanup, and continuous optimization

Critical Takeaways

  1. Use the Right Pricing Model: Reserved Instances and Savings Plans for predictable workloads (72% savings), Spot for fault-tolerant batch (90% savings), On-Demand for variable
  2. Optimize Storage Lifecycle: Use S3 Intelligent-Tiering for unknown patterns, transition to IA after 30 days, archive to Glacier for long-term retention
  3. Minimize Data Transfer: Use VPC endpoints to eliminate internet transfer costs, CloudFront to reduce origin costs, same-region transfers when possible
  4. Right-Size Resources: Use Compute Optimizer recommendations, delete unused resources (unattached volumes, old snapshots), and match instance types to workload
  5. Leverage Serverless: Use Lambda, Fargate, Aurora Serverless, and DynamoDB On-Demand for variable workloads to pay only for actual usage
  6. Monitor and Alert: Set up Cost Explorer for analysis, Budgets for alerts, and cost allocation tags for tracking spending by project/team

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

Storage Cost Optimization:

  • Design S3 lifecycle policies to transition objects between storage classes
  • Choose appropriate S3 storage class based on access patterns
  • Calculate cost savings from S3 Intelligent-Tiering
  • Optimize EBS volumes (gp3 vs gp2, delete unattached volumes)
  • Implement data transfer optimization strategies

Compute Cost Optimization:

  • Compare Reserved Instances, Savings Plans, and Spot Instances
  • Calculate cost savings from different pricing models
  • Design Spot Fleet strategies for fault-tolerant workloads
  • Right-size EC2 instances using Compute Optimizer
  • Optimize Lambda costs (memory, timeout, provisioned concurrency)

Database Cost Optimization:

  • Choose between RDS and Aurora based on cost requirements
  • Implement Aurora Serverless for variable workloads
  • Select DynamoDB pricing mode (On-Demand vs Provisioned)
  • Use Reserved Capacity for predictable DynamoDB workloads
  • Optimize database backup retention and snapshot lifecycle

Network Cost Optimization:

  • Calculate data transfer costs between regions and AZs
  • Use VPC endpoints to eliminate internet transfer costs
  • Optimize NAT Gateway usage (single vs per-AZ)
  • Implement CloudFront to reduce origin data transfer
  • Choose cost-effective connectivity (VPN vs Direct Connect)

Cost Monitoring & Management:

  • Set up Cost Explorer for spending analysis
  • Create Budgets with alerts for cost thresholds
  • Implement cost allocation tags for tracking
  • Use Trusted Advisor for cost optimization recommendations
  • Generate Cost and Usage Reports for detailed analysis

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-25 (Storage and compute cost optimization)
  • Domain 4 Bundle 2: Questions 26-50 (Database and network cost optimization)
  • Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • Review EC2 pricing models and when to use each
  • Practice calculating cost savings from Reserved Instances and Savings Plans
  • Focus on understanding S3 storage class transitions
  • Revisit data transfer costs and optimization strategies

Quick Reference Card

EC2 Pricing Models:

  • On-Demand: $0.096/hour (t3.medium), no commitment
  • Reserved (1-year): $0.062/hour (35% savings), upfront payment
  • Reserved (3-year): $0.043/hour (55% savings), upfront payment
  • Savings Plans: 72% savings, flexible instance types
  • Spot: $0.029/hour (70% savings), can be interrupted

S3 Storage Classes:

  • Standard: $0.023/GB, frequent access
  • Intelligent-Tiering: $0.023/GB + $0.0025/1000 objects, automatic
  • Standard-IA: $0.0125/GB, infrequent access (>30 days)
  • One Zone-IA: $0.01/GB, infrequent, single AZ
  • Glacier Flexible: $0.004/GB, archive (minutes-hours retrieval)
  • Glacier Deep Archive: $0.00099/GB, long-term (12 hours retrieval)

Database Pricing:

  • RDS: $0.017/hour (db.t3.micro), storage $0.115/GB-month
  • Aurora: $0.041/hour (db.t3.small), storage $0.10/GB-month
  • Aurora Serverless: $0.06/ACU-hour, auto-scaling
  • DynamoDB On-Demand: $1.25/million writes, $0.25/million reads
  • DynamoDB Provisioned: $0.00065/WCU-hour, $0.00013/RCU-hour
  • DynamoDB Reserved: Up to 77% savings (1-3 year)

Data Transfer Costs:

  • Same AZ: Free
  • Cross-AZ (same region): $0.01/GB in, $0.01/GB out
  • Cross-Region: $0.02/GB
  • Internet Out: $0.09/GB (first 10 TB)
  • VPC Endpoint: $0.01/GB processed, eliminates internet costs
  • NAT Gateway: $0.045/hour + $0.045/GB processed
  • CloudFront: $0.085/GB (first 10 TB), cheaper than S3 direct

Cost Optimization Tools:

  • Cost Explorer: Visualize spending, forecast costs
  • Budgets: Set alerts at thresholds ($100, 80% of budget)
  • Cost and Usage Report: Detailed hourly/daily data
  • Compute Optimizer: Right-sizing recommendations
  • Trusted Advisor: Cost optimization checks (5 free, 50+ with Business/Enterprise)

Common Exam Scenarios:

  • Predictable workload? → Reserved Instances or Savings Plans (72% savings)
  • Fault-tolerant batch? → Spot Instances (90% savings)
  • Unknown access patterns? → S3 Intelligent-Tiering
  • Infrequent access (>30 days)? → S3 Standard-IA or One Zone-IA
  • Long-term archive? → Glacier Flexible or Deep Archive
  • Variable database workload? → Aurora Serverless or DynamoDB On-Demand
  • High S3 data transfer? → VPC endpoint (eliminate transfer costs)
  • Global content delivery? → CloudFront (reduce origin costs)
  • Oversized instances? → Compute Optimizer + right-sizing
  • Unused resources? → Delete unattached volumes, old snapshots

Cost Optimization Checklist:

  • Use Reserved Instances/Savings Plans for steady-state workloads
  • Implement Spot Instances for fault-tolerant batch processing
  • Configure S3 lifecycle policies to transition to cheaper storage
  • Delete unattached EBS volumes and old snapshots
  • Use VPC endpoints to eliminate data transfer costs
  • Right-size instances using Compute Optimizer
  • Enable S3 Intelligent-Tiering for unknown access patterns
  • Use CloudFront to reduce origin data transfer costs
  • Implement cost allocation tags for tracking
  • Set up Budgets with alerts for cost thresholds

You're ready to proceed when you can:

  • Choose the most cost-effective pricing model for each workload
  • Design S3 lifecycle policies to minimize storage costs
  • Calculate cost savings from different optimization strategies
  • Implement VPC endpoints and CloudFront to reduce data transfer costs
  • Use cost monitoring tools to track and optimize spending

Next: Move to Chapter 5: Integration & Advanced Topics to learn about cross-domain scenarios and real-world architectures.


Chapter Summary

What We Covered

This chapter covered the essential concepts for designing cost-optimized architectures on AWS, which accounts for 20% of the SAA-C03 exam. We explored four major task areas:

Task 4.1: Cost-Optimized Storage Solutions

  • āœ… S3 storage classes and lifecycle policies
  • āœ… S3 Intelligent-Tiering for automatic cost optimization
  • āœ… Glacier and Glacier Deep Archive for long-term archival
  • āœ… EBS volume types and cost optimization strategies
  • āœ… EFS lifecycle management and Infrequent Access
  • āœ… Data transfer cost optimization techniques
  • āœ… Backup retention policies and cost management

Task 4.2: Cost-Optimized Compute Solutions

  • āœ… EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
  • āœ… Reserved Instances types and payment options
  • āœ… Compute Savings Plans vs EC2 Instance Savings Plans
  • āœ… Spot Instances and Spot Fleet strategies
  • āœ… Lambda pricing and cost optimization
  • āœ… Fargate pricing and Fargate Spot
  • āœ… Auto Scaling for cost efficiency
  • āœ… EC2 right-sizing and Compute Optimizer

Task 4.3: Cost-Optimized Database Solutions

  • āœ… RDS pricing models and Reserved Instances
  • āœ… Aurora Serverless for variable workloads
  • āœ… DynamoDB On-Demand vs Provisioned capacity
  • āœ… DynamoDB Reserved Capacity
  • āœ… ElastiCache Reserved Nodes
  • āœ… Database backup and snapshot costs
  • āœ… Read replica cost considerations
  • āœ… Database migration cost optimization

Task 4.4: Cost-Optimized Network Architectures

  • āœ… Data transfer pricing and optimization
  • āœ… NAT Gateway vs NAT Instance cost comparison
  • āœ… VPC endpoints for eliminating data transfer costs
  • āœ… PrivateLink cost considerations
  • āœ… CloudFront for reducing origin costs
  • āœ… Direct Connect vs VPN cost analysis
  • āœ… Load balancer cost optimization
  • āœ… Transit Gateway and VPC peering costs

Critical Takeaways

  1. Compute Pricing Models: On-Demand (flexibility), Reserved Instances (up to 72% savings), Spot (up to 90% savings), Savings Plans (flexible commitment).

  2. Reserved Instances: Standard RI (highest discount, no flexibility), Convertible RI (lower discount, can change instance family), 1-year or 3-year terms.

  3. Savings Plans: Compute Savings Plans (most flexible, any instance family/region), EC2 Instance Savings Plans (higher discount, specific family/region).

  4. Spot Instances: Up to 90% discount, 2-minute interruption notice, best for fault-tolerant batch processing, not for databases or stateful apps.

  5. S3 Storage Classes: Standard ($0.023/GB), Standard-IA ($0.0125/GB, 30-day minimum), One Zone-IA ($0.01/GB, single AZ), Glacier ($0.004/GB, 90-day minimum), Glacier Deep Archive ($0.00099/GB, 180-day minimum).

  6. S3 Lifecycle Policies: Automatically transition objects to cheaper storage classes based on age (e.g., Standard → Standard-IA after 30 days → Glacier after 90 days).

  7. S3 Intelligent-Tiering: Automatic cost optimization for unknown access patterns, $0.0025/1,000 objects monitoring fee, no retrieval fees.

  8. EBS Cost Optimization: Use gp3 instead of gp2 (20% cheaper), delete unattached volumes, delete old snapshots, use st1/sc1 for throughput-intensive workloads.

  9. Lambda Pricing: $0.20 per 1M requests + $0.0000166667 per GB-second, optimize memory allocation (more memory = faster execution = lower cost).

  10. DynamoDB Pricing: On-Demand ($1.25/million writes, $0.25/million reads) for unpredictable, Provisioned ($0.00065/WCU-hour, $0.00013/RCU-hour) for steady-state.

  11. Aurora Serverless: Pay per ACU-hour ($0.06/ACU-hour), auto-scales from 0.5 to 128 ACUs, ideal for variable workloads, can pause when idle.

  12. Data Transfer Costs: Free inbound, $0.09/GB outbound to internet, $0.01/GB between regions, $0.01/GB between AZs, free within same AZ.

  13. VPC Endpoints: Gateway endpoints (S3, DynamoDB) are free, Interface endpoints cost $0.01/hour + $0.01/GB, eliminate data transfer costs to AWS services.

  14. NAT Gateway: $0.045/hour + $0.045/GB processed, NAT instance can be cheaper for low traffic but requires management.

  15. CloudFront Cost Savings: Reduces origin data transfer costs by 60-90%, caches at edge locations, $0.085/GB (cheaper than S3 direct access for global users).

  16. Cost Monitoring: Use Cost Explorer for analysis, Budgets for alerts, Cost Allocation Tags for tracking, Cost and Usage Report for detailed billing.

  17. Right-Sizing: Use Compute Optimizer for recommendations, can save 20-40% by downsizing over-provisioned instances.

  18. Unused Resources: Delete unattached EBS volumes, old snapshots, unused Elastic IPs, idle load balancers, stopped instances (still charged for EBS).

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

Compute Cost Optimization:

  • Explain the difference between Reserved Instances and Savings Plans
  • Calculate cost savings from different pricing models
  • Choose appropriate Spot Instance strategies for different workloads
  • Determine when to use Standard vs Convertible Reserved Instances
  • Optimize Lambda costs through memory and timeout configuration
  • Use Compute Optimizer for right-sizing recommendations

Storage Cost Optimization:

  • Design S3 lifecycle policies for automatic cost optimization
  • Select appropriate S3 storage class based on access patterns
  • Explain when to use S3 Intelligent-Tiering
  • Calculate storage costs for different S3 storage classes
  • Optimize EBS costs by selecting appropriate volume types
  • Implement EFS lifecycle management for cost savings

Database Cost Optimization:

  • Choose between RDS On-Demand and Reserved Instances
  • Determine when to use Aurora Serverless vs provisioned Aurora
  • Select DynamoDB On-Demand vs Provisioned capacity mode
  • Calculate DynamoDB Reserved Capacity savings
  • Optimize database backup retention policies
  • Design cost-effective read replica strategies

Network Cost Optimization:

  • Explain data transfer pricing between regions and AZs
  • Calculate cost savings from VPC endpoints
  • Choose between NAT Gateway and NAT Instance
  • Determine when to use CloudFront for cost optimization
  • Compare Direct Connect vs VPN costs
  • Optimize load balancer costs (ALB vs NLB)

Cost Monitoring and Management:

  • Use Cost Explorer to analyze spending patterns
  • Configure Budgets with alerts for cost thresholds
  • Implement cost allocation tags for tracking
  • Analyze Cost and Usage Report for detailed billing
  • Use AWS Cost Anomaly Detection for unusual spending
  • Create cost optimization action plans

Cost Optimization Strategies:

  • Identify and delete unused resources
  • Right-size over-provisioned instances
  • Implement Auto Scaling for variable workloads
  • Use Spot Instances for fault-tolerant workloads
  • Configure S3 lifecycle policies for automatic tiering
  • Implement VPC endpoints to eliminate data transfer costs

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-25 (Focus: Storage and compute costs)
  • Domain 4 Bundle 2: Questions 26-50 (Focus: Database and network costs)
  • Full Practice Test 3: Domain 4 questions (Mixed difficulty)

Expected score: 70%+ to proceed confidently

If you scored below 70%:

  • Review EC2 pricing models and Reserved Instances
  • Focus on S3 storage classes and lifecycle policies
  • Study DynamoDB capacity modes and pricing
  • Practice data transfer cost calculations
  • Review VPC endpoint cost savings

Quick Reference Card

Copy this to your notes for quick review:

EC2 Pricing Models:

  • On-Demand: $0.096/hour (t3.large), no commitment, pay as you go
  • Reserved (1-year): 40% discount, upfront payment options
  • Reserved (3-year): 72% discount, highest savings
  • Spot: Up to 90% discount, 2-min interruption notice
  • Savings Plans: 1-year (40% discount) or 3-year (66% discount)

S3 Storage Classes:

  • Standard: $0.023/GB, frequent access, 99.99% availability
  • Standard-IA: $0.0125/GB, 30-day minimum, $0.01/GB retrieval
  • One Zone-IA: $0.01/GB, single AZ, 30-day minimum
  • Intelligent-Tiering: $0.023/GB + $0.0025/1,000 objects
  • Glacier Flexible: $0.004/GB, 90-day minimum, 1-5 min retrieval
  • Glacier Deep Archive: $0.00099/GB, 180-day minimum, 12-hour retrieval

Database Pricing:

  • RDS On-Demand: $0.136/hour (db.t3.medium)
  • RDS Reserved (3-year): 60% discount
  • Aurora Serverless: $0.06/ACU-hour, auto-scaling
  • DynamoDB On-Demand: $1.25/million writes, $0.25/million reads
  • DynamoDB Provisioned: $0.00065/WCU-hour, $0.00013/RCU-hour
  • ElastiCache Reserved: Up to 55% discount (3-year)

Data Transfer Costs:

  • Inbound: Free
  • Outbound to Internet: $0.09/GB (first 10 TB)
  • Between Regions: $0.02/GB
  • Between AZs: $0.01/GB (in/out)
  • Within Same AZ: Free
  • VPC Endpoint: Free (Gateway), $0.01/hour + $0.01/GB (Interface)

Network Services:

  • NAT Gateway: $0.045/hour + $0.045/GB processed
  • VPC Endpoint (Interface): $0.01/hour + $0.01/GB
  • CloudFront: $0.085/GB (first 10 TB)
  • Direct Connect: $0.30/hour (1 Gbps) + $0.02/GB outbound
  • ALB: $0.0225/hour + $0.008/LCU-hour
  • NLB: $0.0225/hour + $0.006/NLCU-hour

Cost Optimization Checklist:

  • Use Reserved Instances/Savings Plans for steady-state (40-72% savings)
  • Implement Spot Instances for batch processing (up to 90% savings)
  • Configure S3 lifecycle policies (transition to cheaper storage)
  • Delete unattached EBS volumes and old snapshots
  • Use VPC endpoints (eliminate data transfer costs)
  • Right-size instances using Compute Optimizer (20-40% savings)
  • Enable S3 Intelligent-Tiering (automatic optimization)
  • Use CloudFront (reduce origin costs by 60-90%)
  • Implement cost allocation tags (track spending)
  • Set up Budgets with alerts (prevent overspending)
  • Use Aurora Serverless for variable workloads (pay per use)
  • Configure Auto Scaling (scale down during low usage)
  • Use gp3 instead of gp2 (20% cheaper)
  • Delete unused Elastic IPs ($0.005/hour when not attached)
  • Use DynamoDB On-Demand for unpredictable workloads

Common Cost Optimization Scenarios:

  • Steady-state workload? → Reserved Instances or Savings Plans
  • Variable workload? → Auto Scaling + On-Demand or Spot
  • Batch processing? → Spot Instances (up to 90% savings)
  • Infrequent access (>30 days)? → S3 Standard-IA or One Zone-IA
  • Long-term archive? → Glacier Flexible or Deep Archive
  • Variable database workload? → Aurora Serverless or DynamoDB On-Demand
  • High S3 data transfer? → VPC endpoint (eliminate transfer costs)
  • Global content delivery? → CloudFront (reduce origin costs)
  • Oversized instances? → Compute Optimizer + right-sizing
  • Unused resources? → Delete unattached volumes, old snapshots

Congratulations! You've completed Chapter 4: Design Cost-Optimized Architectures. You now understand how to minimize costs while maintaining performance, availability, and security on AWS.

Next Steps:

  1. Complete the self-assessment checklist above
  2. Practice with Domain 4 test bundles
  3. Review any weak areas identified
  4. When ready, proceed to Chapter 5: Integration & Advanced Topics

Chapter Summary

What We Covered

Task 4.1: Cost-Optimized Storage Solutions

  • āœ… S3 storage classes and lifecycle policies
  • āœ… S3 Intelligent-Tiering for automatic optimization
  • āœ… Glacier retrieval options and Deep Archive
  • āœ… EBS volume optimization (gp3 vs gp2)
  • āœ… EFS lifecycle management
  • āœ… Data transfer cost optimization

Task 4.2: Cost-Optimized Compute Solutions

  • āœ… EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
  • āœ… Reserved Instances vs Savings Plans
  • āœ… Spot Instances for fault-tolerant workloads
  • āœ… Lambda cost optimization
  • āœ… Auto Scaling for cost efficiency
  • āœ… Graviton instances for 20% cost savings

Task 4.3: Cost-Optimized Database Solutions

  • āœ… RDS pricing models and Reserved Instances
  • āœ… Aurora Serverless for variable workloads
  • āœ… DynamoDB On-Demand vs Provisioned capacity
  • āœ… ElastiCache Reserved Nodes
  • āœ… Database right-sizing and optimization

Task 4.4: Cost-Optimized Network Architectures

  • āœ… Data transfer costs (inter-AZ, inter-region, internet)
  • āœ… NAT Gateway vs NAT instance costs
  • āœ… VPC endpoints to eliminate data transfer costs
  • āœ… CloudFront for reduced origin costs
  • āœ… Direct Connect vs VPN cost comparison

Critical Takeaways

  1. Right-Sizing: Use Compute Optimizer to identify oversized resources
  2. Reserved Capacity: Commit to 1-3 years for 40-75% savings on steady workloads
  3. Spot Instances: Use for fault-tolerant workloads (up to 90% savings)
  4. Storage Lifecycle: Automatically transition data to cheaper storage classes
  5. VPC Endpoints: Eliminate data transfer costs for AWS service access
  6. Auto Scaling: Scale down during low usage to reduce costs
  7. Monitoring: Use Cost Explorer, Budgets, and Cost Allocation Tags
  8. Serverless: Pay only for what you use (Lambda, Aurora Serverless, DynamoDB On-Demand)

Self-Assessment Checklist

Test yourself before moving on:

Storage Cost Optimization

  • I can design S3 lifecycle policies for cost optimization
  • I understand when to use each S3 storage class
  • I know how to optimize EBS costs (gp3, snapshot lifecycle)
  • I can explain S3 Intelligent-Tiering benefits
  • I understand Glacier retrieval options and costs

Compute Cost Optimization

  • I can choose between Reserved Instances and Savings Plans
  • I understand when to use Spot Instances
  • I know how to optimize Lambda costs (memory, timeout)
  • I can explain EC2 hibernation for cost savings
  • I understand Graviton instance benefits (20% cheaper)

Database Cost Optimization

  • I can choose between RDS and Aurora based on cost
  • I understand when to use Aurora Serverless
  • I know when to use DynamoDB On-Demand vs Provisioned
  • I can explain database Reserved Instance benefits
  • I understand how to optimize database storage costs

Network Cost Optimization

  • I understand data transfer costs (inter-AZ, inter-region, internet)
  • I know when to use VPC endpoints to reduce costs
  • I can explain NAT Gateway vs NAT instance cost comparison
  • I understand CloudFront cost benefits
  • I know how to optimize Direct Connect costs

Cost Monitoring & Management

  • I can set up AWS Budgets with alerts
  • I understand how to use Cost Explorer for analysis
  • I know how to implement cost allocation tags
  • I can explain Trusted Advisor cost optimization checks
  • I understand how to use Compute Optimizer

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-50 (storage and compute cost optimization)
  • Domain 4 Bundle 2: Questions 51-100 (database and network cost optimization)
  • Full Practice Tests: Focus on cost-related questions (20% of each test)

Expected Score: 70%+ to proceed confidently

If you scored below 70%:

  • Review EC2 pricing models (Reserved, Spot, Savings Plans)
  • Practice calculating cost savings for different scenarios
  • Focus on understanding data transfer costs
  • Revisit storage lifecycle and tiering strategies

Quick Reference Card

Copy this to your notes for quick review:

EC2 Pricing Models:

  • On-Demand: Pay per hour/second, no commitment, highest cost
  • Reserved (1-3 years): 40-75% savings, steady workloads
  • Savings Plans: Flexible, 1-3 years, 40-72% savings
  • Spot: Up to 90% savings, fault-tolerant workloads

S3 Storage Classes:

  • Standard: $0.023/GB, frequent access
  • Standard-IA: $0.0125/GB, infrequent access (>30 days)
  • One Zone-IA: $0.01/GB, infrequent, non-critical
  • Glacier Flexible: $0.004/GB, archive (minutes-hours retrieval)
  • Glacier Deep Archive: $0.00099/GB, long-term archive (12 hours)

Database Cost Optimization:

  • Aurora Serverless: Pay per ACU-second, variable workloads
  • DynamoDB On-Demand: Pay per request, unpredictable workloads
  • RDS Reserved: 40-60% savings, steady workloads
  • Read Replicas: Offload reads, reduce primary instance size

Network Cost Optimization:

  • VPC Endpoints: Eliminate data transfer costs to AWS services
  • CloudFront: Reduce origin data transfer costs
  • NAT Gateway: $0.045/GB processed (consider NAT instance for high volume)
  • Inter-AZ: $0.01/GB (minimize cross-AZ traffic)
  • Inter-Region: $0.02/GB (use same region when possible)

Cost Monitoring Tools:

  • Cost Explorer: Visualize and analyze spending
  • Budgets: Set alerts for spending thresholds
  • Cost Allocation Tags: Track costs by project/team
  • Trusted Advisor: Automated cost optimization recommendations
  • Compute Optimizer: Right-sizing recommendations

Quick Wins:

  • Switch gp2 to gp3 (20% cheaper, better performance)
  • Implement S3 lifecycle policies (auto-tier old data)
  • Use VPC endpoints for S3/DynamoDB (eliminate transfer costs)
  • Enable S3 Intelligent-Tiering (automatic optimization)
  • Delete unattached EBS volumes and old snapshots
  • Use Spot Instances for batch processing (90% savings)
  • Implement cost allocation tags (track spending)
  • Set up Budgets with alerts (prevent overspending)
  • Use Aurora Serverless for variable workloads (pay per use)
  • Configure Auto Scaling (scale down during low usage)
  • Use gp3 instead of gp2 (20% cheaper)
  • Delete unused Elastic IPs ($0.005/hour when not attached)
  • Use DynamoDB On-Demand for unpredictable workloads

Common Cost Optimization Scenarios:

  • Steady-state workload? → Reserved Instances or Savings Plans
  • Variable workload? → Auto Scaling + On-Demand or Spot
  • Batch processing? → Spot Instances (up to 90% savings)
  • Infrequent access (>30 days)? → S3 Standard-IA or One Zone-IA
  • Long-term archive? → Glacier Flexible or Deep Archive
  • Variable database workload? → Aurora Serverless or DynamoDB On-Demand
  • High S3 data transfer? → VPC endpoint (eliminate transfer costs)
  • Global content delivery? → CloudFront (reduce origin costs)
  • Oversized instances? → Compute Optimizer + right-sizing
  • Unused resources? → Delete unattached volumes, old snapshots

Chapter Summary

What We Covered

This chapter covered the four critical task areas for designing cost-optimized architectures on AWS:

āœ… Task 4.1: Cost-Optimized Storage Solutions

  • S3 storage classes and lifecycle policies
  • S3 Intelligent-Tiering for automatic optimization
  • EBS volume optimization (gp3 vs gp2)
  • Glacier and Deep Archive for long-term storage
  • EFS lifecycle management
  • Storage Gateway cost optimization
  • Data transfer cost reduction strategies

āœ… Task 4.2: Cost-Optimized Compute Solutions

  • EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
  • Reserved Instances vs Savings Plans
  • Spot Instances for fault-tolerant workloads
  • Lambda cost optimization
  • Auto Scaling for cost efficiency
  • Graviton instances for 20% cost savings
  • Right-sizing with Compute Optimizer

āœ… Task 4.3: Cost-Optimized Database Solutions

  • RDS pricing models and Reserved Instances
  • Aurora Serverless for variable workloads
  • DynamoDB On-Demand vs Provisioned capacity
  • Database right-sizing and storage optimization
  • ElastiCache Reserved Nodes
  • Redshift Reserved Nodes and Spectrum
  • Database backup cost optimization

āœ… Task 4.4: Cost-Optimized Network Architectures

  • Data transfer pricing and optimization
  • NAT Gateway vs NAT Instance cost comparison
  • VPC endpoints to eliminate data transfer costs
  • CloudFront for cost-effective content delivery
  • Direct Connect vs VPN cost analysis
  • Load balancer cost optimization
  • Inter-AZ and inter-region data transfer costs

Critical Takeaways

  1. Use the Right Pricing Model: Reserved Instances and Savings Plans for steady-state workloads (up to 72% savings). Spot Instances for fault-tolerant workloads (up to 90% savings).

  2. Implement Lifecycle Policies: Automatically transition S3 objects to cheaper storage classes. Use Intelligent-Tiering for unpredictable access patterns.

  3. Right-Size Resources: Use Compute Optimizer and Trusted Advisor recommendations. Don't over-provision - scale horizontally instead.

  4. Eliminate Data Transfer Costs: Use VPC endpoints for S3 and DynamoDB. Keep data in same region when possible. Use CloudFront to reduce origin costs.

  5. Use Serverless for Variable Workloads: Lambda, Aurora Serverless, and DynamoDB On-Demand eliminate idle capacity costs.

  6. Monitor and Optimize Continuously: Use Cost Explorer to identify trends. Set up Budgets with alerts. Tag resources for cost allocation.

  7. Delete Unused Resources: Unattached EBS volumes, old snapshots, unused Elastic IPs, idle load balancers all cost money.

  8. Choose Cost-Effective Services: gp3 instead of gp2 (20% cheaper), Graviton instances (20% cheaper), S3 Standard-IA for infrequent access.

Self-Assessment Checklist

Test yourself before moving on. You should be able to:

Storage Cost Optimization:

  • Design S3 lifecycle policies to transition objects to cheaper storage classes
  • Choose appropriate S3 storage class for access patterns
  • Configure S3 Intelligent-Tiering for automatic optimization
  • Select Glacier retrieval option based on urgency (Expedited, Standard, Bulk)
  • Optimize EBS costs by switching gp2 to gp3
  • Implement EFS lifecycle management to move to Infrequent Access
  • Calculate data transfer costs and optimize with VPC endpoints
  • Use S3 Requester Pays for shared datasets

Compute Cost Optimization:

  • Choose between Reserved Instances and Savings Plans
  • Calculate break-even point for Reserved Instances
  • Implement Spot Instances for fault-tolerant workloads
  • Configure Spot Fleet with multiple instance types
  • Optimize Lambda costs by adjusting memory allocation
  • Use Auto Scaling to match capacity to demand
  • Implement scheduled scaling for predictable patterns
  • Right-size instances using Compute Optimizer

Database Cost Optimization:

  • Choose between RDS and Aurora based on cost and performance
  • Configure Aurora Serverless for variable workloads
  • Select DynamoDB On-Demand vs Provisioned capacity
  • Purchase DynamoDB Reserved Capacity for predictable workloads
  • Optimize RDS storage with autoscaling
  • Use RDS Reserved Instances for steady-state databases
  • Configure appropriate backup retention periods
  • Implement read replicas only when needed

Network Cost Optimization:

  • Calculate data transfer costs between regions and AZs
  • Use VPC endpoints to eliminate NAT Gateway data transfer costs
  • Choose between NAT Gateway and NAT Instance based on cost
  • Implement CloudFront to reduce data transfer from origin
  • Select appropriate Direct Connect bandwidth
  • Optimize load balancer costs (ALB vs NLB)
  • Use VPC peering instead of Transit Gateway when appropriate
  • Minimize inter-region data transfer

Practice Questions

Try these from your practice test bundles:

Beginner Level (Target: 80%+ correct):

  • Domain 4 Bundle 1: Questions 1-20 (Pricing models, storage classes, basic optimization)
  • Full Practice Test 1: Domain 4 questions (Cost fundamentals)

Intermediate Level (Target: 70%+ correct):

  • Domain 4 Bundle 2: Questions 21-40 (Advanced optimization, Reserved Instances, data transfer)
  • Full Practice Test 2: Domain 4 questions (Mixed difficulty, realistic scenarios)

Advanced Level (Target: 60%+ correct):

  • Full Practice Test 3: Domain 4 questions (Complex cost optimization scenarios)

If You Scored Below Target

Below 60% on Beginner Questions:

  • Review sections: EC2 Pricing Models, S3 Storage Classes, Basic Cost Optimization
  • Focus on: On-Demand vs Reserved vs Spot, S3 lifecycle policies, right-sizing basics
  • Practice: Calculate Reserved Instance savings, design lifecycle policies, use Cost Explorer

Below 60% on Intermediate Questions:

  • Review sections: Savings Plans, Data Transfer Costs, Database Optimization
  • Focus on: Compute vs EC2 Savings Plans, VPC endpoints, Aurora Serverless, DynamoDB capacity modes
  • Practice: Compare pricing models, optimize data transfer, right-size databases

Below 50% on Advanced Questions:

  • Review sections: Complex Cost Architectures, Multi-Service Optimization
  • Focus on: Hybrid pricing strategies, cross-region cost optimization, total cost of ownership
  • Practice: Design cost-optimized multi-tier architecture, calculate TCO, optimize for specific budgets

Quick Reference Card

Copy this to your notes for quick review

EC2 Pricing Models

  • On-Demand: No commitment, highest cost, pay per hour/second
  • Reserved Instances: 1 or 3 year commitment, up to 72% savings, specific instance type
  • Savings Plans: 1 or 3 year commitment, up to 72% savings, flexible instance family
  • Spot Instances: Bid on spare capacity, up to 90% savings, can be interrupted

S3 Storage Classes (Cost Order: Cheapest to Most Expensive)

  1. Glacier Deep Archive: $0.00099/GB/month, 12-hour retrieval, 180-day minimum
  2. Glacier Flexible Retrieval: $0.0036/GB/month, 1-5 minute retrieval, 90-day minimum
  3. Glacier Instant Retrieval: $0.004/GB/month, millisecond retrieval, 90-day minimum
  4. S3 One Zone-IA: $0.01/GB/month, single AZ, 30-day minimum
  5. S3 Standard-IA: $0.0125/GB/month, multi-AZ, 30-day minimum
  6. S3 Intelligent-Tiering: $0.0025/1000 objects monitoring, automatic tiering
  7. S3 Standard: $0.023/GB/month, frequent access, no minimum

Database Pricing

  • RDS On-Demand: Pay per hour, no commitment
  • RDS Reserved: 1 or 3 year, up to 69% savings
  • Aurora Serverless v2: Pay per ACU-hour, auto-scaling
  • DynamoDB On-Demand: Pay per request, unpredictable workloads
  • DynamoDB Provisioned: Pay per RCU/WCU, predictable workloads
  • DynamoDB Reserved: 1 year commitment, up to 77% savings

Data Transfer Costs

  • Within Same AZ: Free
  • Between AZs: $0.01/GB in, $0.01/GB out
  • Between Regions: $0.02/GB out (varies by region)
  • To Internet: $0.09/GB (first 10 TB)
  • From Internet: Free
  • Via VPC Endpoint: Free (S3, DynamoDB)
  • Via CloudFront: $0.085/GB (cheaper than direct)

Cost Optimization Quick Wins

  • Switch gp2 to gp3 (20% cheaper, better performance)
  • Implement S3 lifecycle policies (auto-tier old data)
  • Use VPC endpoints for S3/DynamoDB (eliminate transfer costs)
  • Enable S3 Intelligent-Tiering (automatic optimization)
  • Delete unattached EBS volumes and old snapshots
  • Use Spot Instances for batch processing (90% savings)
  • Implement cost allocation tags (track spending)
  • Set up Budgets with alerts (prevent overspending)
  • Use Aurora Serverless for variable workloads (pay per use)
  • Configure Auto Scaling (scale down during low usage)

Cost Monitoring Tools

  • Cost Explorer: Visualize spending, identify trends, forecast costs
  • AWS Budgets: Set custom budgets, receive alerts, track usage
  • Cost and Usage Report: Detailed billing data, export to S3, analyze with Athena
  • Cost Allocation Tags: Track costs by project, department, environment
  • Compute Optimizer: Right-sizing recommendations for EC2, EBS, Lambda
  • Trusted Advisor: Cost optimization checks, unused resources, Reserved Instance recommendations

Decision Points

Scenario Solution
Steady-state workload Reserved Instances or Savings Plans
Variable workload Auto Scaling + On-Demand or Spot
Batch processing Spot Instances (up to 90% savings)
Infrequent access (>30 days) S3 Standard-IA or One Zone-IA
Long-term archive Glacier Flexible or Deep Archive
Variable database workload Aurora Serverless or DynamoDB On-Demand
High S3 data transfer VPC endpoint (eliminate transfer costs)
Global content delivery CloudFront (reduce origin costs)
Oversized instances Compute Optimizer + right-sizing
Unused resources Delete unattached volumes, old snapshots

Common Cost Optimization Scenarios

  • Steady-state workload: Reserved Instances or Savings Plans (up to 72% savings)
  • Variable workload: Auto Scaling + On-Demand or Spot
  • Batch processing: Spot Instances (up to 90% savings)
  • Infrequent access (>30 days): S3 Standard-IA or One Zone-IA
  • Long-term archive: Glacier Flexible or Deep Archive
  • Variable database workload: Aurora Serverless or DynamoDB On-Demand
  • High S3 data transfer: VPC endpoint (eliminate transfer costs)
  • Global content delivery: CloudFront (reduce origin costs)
  • Oversized instances: Compute Optimizer + right-sizing
  • Unused resources: Delete unattached volumes, old snapshots

Common Exam Traps

  • āŒ Using On-Demand for steady workloads → āœ… Use Reserved Instances or Savings Plans
  • āŒ Not using lifecycle policies → āœ… Automatically tier S3 data
  • āŒ Paying for NAT Gateway data transfer → āœ… Use VPC endpoints
  • āŒ Using gp2 instead of gp3 → āœ… gp3 is 20% cheaper
  • āŒ Not deleting unused resources → āœ… Delete unattached volumes, old snapshots
  • āŒ Not using Spot for batch jobs → āœ… Spot saves up to 90%
  • āŒ Not monitoring costs → āœ… Use Cost Explorer and Budgets
  • āŒ Not tagging resources → āœ… Implement cost allocation tags

Next Chapter: 06_integration - Learn how to integrate concepts across all domains for complex scenarios.


Chapter Summary

What We Covered

This chapter covered the four critical task areas for designing cost-optimized architectures on AWS:

āœ… Task 4.1: Cost-Optimized Storage Solutions

  • S3 storage classes and lifecycle policies
  • S3 Intelligent-Tiering for automatic cost optimization
  • Glacier and Glacier Deep Archive for long-term archival
  • EBS volume optimization (gp3 vs gp2)
  • EFS lifecycle management and Infrequent Access
  • Snapshot lifecycle policies
  • Data transfer cost optimization
  • Storage Gateway for hybrid storage

āœ… Task 4.2: Cost-Optimized Compute Solutions

  • EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
  • Reserved Instances (Standard, Convertible, Scheduled)
  • Savings Plans (Compute, EC2 Instance)
  • Spot Instances for fault-tolerant workloads
  • Auto Scaling for elastic capacity
  • Lambda cost optimization
  • Fargate Spot for container workloads
  • Graviton instances for better price-performance
  • Right-sizing with Compute Optimizer

āœ… Task 4.3: Cost-Optimized Database Solutions

  • RDS pricing models and Reserved Instances
  • Aurora Serverless for variable workloads
  • DynamoDB pricing modes (On-Demand vs Provisioned)
  • DynamoDB Reserved Capacity
  • ElastiCache Reserved Nodes
  • Database right-sizing and storage autoscaling
  • Read replicas vs Multi-AZ cost considerations
  • Backup retention policies

āœ… Task 4.4: Cost-Optimized Network Architectures

  • Data transfer cost patterns
  • NAT Gateway vs NAT instance cost comparison
  • VPC endpoints to eliminate data transfer costs
  • PrivateLink for private connectivity
  • CloudFront to reduce origin costs
  • Direct Connect for high-volume data transfer
  • Transit Gateway vs VPC peering cost
  • Load balancer cost optimization

Critical Takeaways

  1. Reserved Capacity for Steady Workloads: Use Reserved Instances or Savings Plans for predictable workloads. Save up to 72% compared to On-Demand.

  2. Spot Instances for Fault-Tolerant Workloads: Use Spot for batch processing, data analysis, and stateless applications. Save up to 90% compared to On-Demand.

  3. Storage Lifecycle Policies: Automatically transition S3 objects to cheaper storage classes. Use Intelligent-Tiering when access patterns are unknown.

  4. Right-Size Everything: Use Compute Optimizer to identify oversized resources. Downsize or stop unused resources.

  5. Data Transfer is Expensive: Use VPC endpoints to avoid data transfer charges. Use CloudFront to reduce origin data transfer. Keep data in the same region when possible.

  6. Serverless for Variable Workloads: Aurora Serverless and DynamoDB On-Demand automatically scale and you only pay for what you use.

  7. Monitor and Alert: Use Cost Explorer to identify trends. Set up AWS Budgets to alert on overspending. Use cost allocation tags to track spending by project.

  8. Delete Unused Resources: Regularly audit and delete unattached EBS volumes, old snapshots, unused Elastic IPs, and idle load balancers.

Self-Assessment Checklist

Test yourself before moving on:

  • I understand the difference between Reserved Instances and Savings Plans
  • I know when to use Spot Instances and their limitations
  • I can design S3 lifecycle policies for cost optimization
  • I understand S3 storage class selection criteria
  • I know how to optimize EBS costs (gp3 vs gp2)
  • I can calculate cost savings with Reserved Instances
  • I understand DynamoDB pricing modes (On-Demand vs Provisioned)
  • I know when to use Aurora Serverless
  • I understand data transfer cost patterns
  • I know how VPC endpoints reduce costs
  • I can optimize NAT Gateway costs
  • I understand CloudFront cost benefits
  • I know how to use Cost Explorer and AWS Budgets
  • I can implement cost allocation tags

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle 1: Questions 1-25 (Storage and compute cost optimization)
  • Domain 4 Bundle 2: Questions 1-25 (Database and network cost optimization)
  • Full Practice Test 1: Questions focusing on cost optimization

Expected score: 75%+ to proceed confidently

If you scored below 75%:

  • Review Reserved Instance types and commitment terms
  • Focus on understanding S3 storage class transitions
  • Study data transfer cost patterns (inter-AZ, inter-region, internet)
  • Practice calculating cost savings scenarios

Quick Reference Card

EC2 Pricing Models:

  • On-Demand: Pay per hour/second, no commitment, highest cost
  • Reserved (1 or 3 year): Up to 72% savings, upfront payment options
  • Spot: Up to 90% savings, can be interrupted, for fault-tolerant workloads
  • Savings Plans: Flexible commitment, applies to Lambda and Fargate too

S3 Storage Classes (by cost, cheapest to most expensive):

  1. Glacier Deep Archive: $0.00099/GB-month, 12-hour retrieval
  2. Glacier Flexible Retrieval: $0.0036/GB-month, 1-5 minute retrieval
  3. Intelligent-Tiering Archive: Automatic tiering, no retrieval fees
  4. S3 One Zone-IA: $0.01/GB-month, single AZ, 99.5% availability
  5. S3 Standard-IA: $0.0125/GB-month, multi-AZ, 99.9% availability
  6. S3 Standard: $0.023/GB-month, frequent access, 99.99% availability

Database Cost Optimization:

  • RDS Reserved: Up to 69% savings for 1 or 3 year commitment
  • Aurora Serverless: Pay per ACU-hour, auto-scales, no idle costs
  • DynamoDB On-Demand: Pay per request, no capacity planning
  • DynamoDB Provisioned: Reserve capacity, up to 76% savings with Reserved Capacity

Network Cost Optimization:

  • VPC Endpoints: Eliminate data transfer costs to S3 and DynamoDB
  • CloudFront: Reduce origin data transfer, cache at edge
  • Direct Connect: Lower cost for high-volume data transfer (>1TB/month)
  • Same AZ: Free data transfer within same AZ
  • Cross-AZ: $0.01/GB data transfer
  • Cross-Region: $0.02/GB data transfer

Cost Monitoring Tools:

  • Cost Explorer: Visualize spending, identify trends
  • AWS Budgets: Set alerts, track against budget
  • Cost and Usage Report: Detailed billing data
  • Cost Allocation Tags: Track costs by project/department
  • Compute Optimizer: Right-sizing recommendations
  • Trusted Advisor: Cost optimization checks

Key Decision Points:

  • Steady-state workload → Reserved Instances or Savings Plans
  • Variable workload → Auto Scaling + On-Demand or Spot
  • Batch processing → Spot Instances (up to 90% savings)
  • Infrequent access (>30 days) → S3 Standard-IA or One Zone-IA
  • Long-term archive → Glacier Flexible or Deep Archive
  • Variable database workload → Aurora Serverless or DynamoDB On-Demand
  • High S3 data transfer → VPC endpoint (eliminate transfer costs)
  • Global content delivery → CloudFront (reduce origin costs)

Next Chapter: 06_integration - Learn how to integrate multiple services and design cross-domain solutions.


Integration & Advanced Topics: Putting It All Together

Chapter Overview

This chapter demonstrates how to combine concepts from all four domains to design complete, production-ready AWS architectures. You'll learn to integrate security, resilience, performance, and cost optimization into cohesive solutions.

What you'll learn:

  • Design complete three-tier web applications
  • Build serverless architectures from scratch
  • Implement event-driven systems
  • Create hybrid cloud solutions
  • Design microservices architectures
  • Build data processing pipelines
  • Solve complex cross-domain scenarios

Time to complete: 6-8 hours
Prerequisites: Chapters 1-5 (all domain chapters)


Section 1: Three-Tier Web Application Architecture

Complete Architecture Design

šŸ“Š Three-Tier Architecture Diagram:

graph TB
    subgraph "Presentation Tier"
        CF[CloudFront CDN]
        S3Web[S3 Static Website<br/>HTML/CSS/JS]
        CF --> S3Web
    end
    
    subgraph "Application Tier"
        ALB[Application Load Balancer]
        ASG[Auto Scaling Group]
        EC2_1[EC2 Instance 1]
        EC2_2[EC2 Instance 2]
        EC2_3[EC2 Instance 3]
        
        ALB --> ASG
        ASG --> EC2_1
        ASG --> EC2_2
        ASG --> EC2_3
    end
    
    subgraph "Data Tier"
        RDS[RDS Multi-AZ<br/>Primary + Standby]
        ElastiCache[ElastiCache Redis<br/>Session Store]
        S3Data[S3 Bucket<br/>User Uploads]
    end
    
    CF --> ALB
    EC2_1 --> RDS
    EC2_2 --> RDS
    EC2_3 --> RDS
    EC2_1 --> ElastiCache
    EC2_2 --> ElastiCache
    EC2_3 --> ElastiCache
    EC2_1 --> S3Data
    EC2_2 --> S3Data
    EC2_3 --> S3Data
    
    style CF fill:#ff9800
    style ALB fill:#4caf50
    style RDS fill:#2196f3
    style ElastiCache fill:#f44336

See: diagrams/06_integration_three_tier_architecture.mmd

Diagram Explanation (Comprehensive):
This diagram illustrates a complete three-tier web application architecture that integrates all four exam domains. The Presentation Tier uses CloudFront CDN (Domain 3: Performance) to cache and deliver static content (HTML, CSS, JavaScript) stored in an S3 bucket configured as a static website. CloudFront provides global low-latency access (10-50ms) and reduces load on the application tier. The S3 bucket uses server-side encryption (Domain 1: Security) and versioning for data protection. The Application Tier consists of an Application Load Balancer distributing traffic across an Auto Scaling Group of EC2 instances deployed across three Availability Zones (Domain 2: Resilience). The ALB performs health checks every 30 seconds and automatically removes unhealthy instances. Auto Scaling maintains 3-10 instances based on CPU utilization (target: 70%), ensuring the application handles traffic spikes while minimizing costs (Domain 4: Cost Optimization). EC2 instances run in private subnets with no direct internet access, using NAT Gateways for outbound connectivity. Security Groups allow only HTTPS traffic from the ALB. The Data Tier includes RDS Multi-AZ for the relational database (Domain 2: Resilience), providing automatic failover in 60-120 seconds if the primary fails. ElastiCache Redis stores user sessions, enabling stateless application servers and improving performance by caching frequently accessed data (Domain 3: Performance). S3 stores user-uploaded files with lifecycle policies to transition old files to Glacier after 90 days (Domain 4: Cost Optimization). All data is encrypted at rest using KMS (Domain 1: Security). This architecture achieves 99.99% availability, handles 10,000 requests per second, and costs approximately $2,000/month for a medium-sized application.

Detailed Example 1: E-commerce Platform Implementation
An e-commerce company needs to build a scalable online store that handles 50,000 concurrent users during Black Friday sales. They implement the three-tier architecture as follows: Presentation Tier: CloudFront caches product images, CSS, and JavaScript files for 24 hours (Cache-Control: max-age=86400), reducing origin requests by 95%. The S3 bucket hosts the React single-page application, which makes API calls to the application tier. CloudFront uses Origin Access Identity (OAI) to restrict S3 access, preventing direct bucket access. Application Tier: The ALB routes requests to 20 EC2 instances (m5.large) running Node.js application servers. Auto Scaling is configured with target tracking policy (CPU 70%) and scheduled scaling (scale to 50 instances at 8 AM on Black Friday). EC2 instances use IAM roles to access S3 and RDS without embedded credentials. Security Groups allow HTTPS (443) from ALB only. Data Tier: RDS PostgreSQL (db.r5.2xlarge) Multi-AZ stores product catalog, orders, and customer data. ElastiCache Redis (cache.r5.large) with 3 nodes stores shopping cart sessions and product cache, reducing database queries by 80%. S3 stores product images with CloudFront distribution. During Black Friday, the system handles 100,000 requests per second with 200ms average response time. Auto Scaling adds 30 instances in 10 minutes to handle the spike. Total cost for the day: $500 (mostly EC2 and data transfer), compared to $50,000 potential revenue loss from downtime.

Detailed Example 2: SaaS Application with Multi-Tenancy
A SaaS company provides project management software to 1,000 enterprise customers. They use the three-tier architecture with tenant isolation: Presentation Tier: CloudFront serves the Angular application with custom domain names per tenant (customer1.saas.com, customer2.saas.com) using alternate domain names (CNAMEs). Each tenant's static assets are stored in separate S3 prefixes (s3://saas-app/customer1/, s3://saas-app/customer2/). Application Tier: ALB uses host-based routing to route requests to different target groups based on subdomain. EC2 instances (c5.xlarge) run Java Spring Boot applications with tenant context extracted from JWT tokens. Auto Scaling maintains 5-20 instances based on request count (target: 1000 requests per instance). Data Tier: RDS MySQL (db.r5.xlarge) Multi-AZ uses separate databases per tenant (customer1_db, customer2_db) for data isolation. ElastiCache Redis stores tenant-specific cache with key prefixes (customer1:, customer2:). S3 stores tenant files with bucket policies enforcing tenant isolation. The architecture supports 10,000 concurrent users across all tenants with 99.95% uptime SLA. Cost per tenant: $50/month (shared infrastructure), enabling profitable pricing at $200/month per customer.

Detailed Example 3: Media Streaming Platform
A video streaming platform serves 1 million users watching videos simultaneously. They implement the three-tier architecture optimized for media delivery: Presentation Tier: CloudFront caches video segments (HLS .ts files) at 400+ edge locations worldwide, reducing latency to 10-30ms. S3 stores video files in multiple resolutions (1080p, 720p, 480p, 360p) using Intelligent-Tiering storage class to optimize costs. CloudFront uses signed URLs with 1-hour expiration to prevent unauthorized access. Application Tier: ALB routes API requests (user authentication, video metadata, playback tracking) to 30 EC2 instances (c5.2xlarge) running Python Flask applications. Auto Scaling uses custom CloudWatch metrics (concurrent streams) to scale from 10 to 100 instances during peak hours (8 PM - 11 PM). Data Tier: Aurora PostgreSQL Serverless (1-16 ACUs) stores user profiles, video metadata, and viewing history, automatically scaling based on load. ElastiCache Redis (cache.r5.2xlarge) with 5 read replicas caches video metadata and user sessions, handling 100,000 requests per second. S3 stores 10 PB of video content with lifecycle policies moving old content to Glacier Deep Archive after 2 years (96% cost savings). The platform delivers 10 Gbps of video traffic with 99.99% availability and costs $50,000/month (mostly CloudFront and S3 storage).

⭐ Must Know (Critical Facts):

  • Presentation tier: Use CloudFront + S3 for static content (HTML, CSS, JS, images) - reduces latency and costs
  • Application tier: Use ALB + Auto Scaling + EC2 in private subnets - provides resilience and scalability
  • Data tier: Use RDS Multi-AZ + ElastiCache + S3 - ensures data durability and performance
  • Security: Implement defense in depth (WAF, Security Groups, NACLs, encryption, IAM roles)
  • Resilience: Deploy across 3+ AZs, use Multi-AZ databases, implement health checks
  • Performance: Use caching at multiple layers (CloudFront, ElastiCache, application cache)
  • Cost optimization: Use Auto Scaling, Reserved Instances, S3 lifecycle policies, CloudFront caching

Section 2: Serverless Architecture

Complete Serverless Application

šŸ“Š Serverless Architecture Diagram:

graph TB
    subgraph "Frontend"
        User[Users]
        CF[CloudFront]
        S3[S3 Static Website]
        User --> CF --> S3
    end
    
    subgraph "API Layer"
        APIGW[API Gateway<br/>REST API]
        Cognito[Cognito<br/>Authentication]
        S3 --> APIGW
        APIGW --> Cognito
    end
    
    subgraph "Compute Layer"
        Lambda1[Lambda: Get Items]
        Lambda2[Lambda: Create Item]
        Lambda3[Lambda: Update Item]
        Lambda4[Lambda: Delete Item]
        
        APIGW --> Lambda1
        APIGW --> Lambda2
        APIGW --> Lambda3
        APIGW --> Lambda4
    end
    
    subgraph "Data Layer"
        DDB[DynamoDB Table]
        S3Data[S3 Bucket<br/>File Storage]
        
        Lambda1 --> DDB
        Lambda2 --> DDB
        Lambda3 --> DDB
        Lambda4 --> DDB
        Lambda2 --> S3Data
    end
    
    style CF fill:#ff9800
    style APIGW fill:#4caf50
    style Lambda1 fill:#9c27b0
    style Lambda2 fill:#9c27b0
    style Lambda3 fill:#9c27b0
    style Lambda4 fill:#9c27b0
    style DDB fill:#2196f3

See: diagrams/06_integration_serverless_architecture.mmd

Diagram Explanation (Comprehensive):
This diagram shows a complete serverless application architecture that eliminates server management and scales automatically. The Frontend consists of a React single-page application hosted on S3 and delivered via CloudFront CDN. Users access the application through CloudFront, which caches static assets (HTML, CSS, JavaScript) at edge locations worldwide. The API Layer uses API Gateway to expose RESTful endpoints (/items GET, POST, PUT, DELETE) that the frontend calls. API Gateway integrates with Cognito User Pools for authentication - users must include a JWT token in the Authorization header. API Gateway validates tokens and rejects unauthorized requests before invoking Lambda functions. The Compute Layer consists of four Lambda functions, each handling a specific operation (CRUD operations on items). Lambda functions are stateless and scale automatically - AWS can run 1,000 concurrent executions simultaneously to handle traffic spikes. Each function has an IAM execution role granting permissions to access DynamoDB and S3. The Data Layer uses DynamoDB for structured data (items table with partition key: itemId) and S3 for file storage (user-uploaded images). DynamoDB provides single-digit millisecond latency and scales automatically to handle any request volume. This architecture has zero servers to manage, scales from 0 to millions of requests automatically, and costs only for actual usage (no idle costs). A typical application with 1 million requests per month costs approximately $50 (API Gateway: $3.50, Lambda: $20, DynamoDB: $25, S3: $1, CloudFront: $0.50).

Detailed Example 1: Todo List Application
A startup builds a todo list application using serverless architecture. Frontend: React application hosted on S3 (s3://todo-app-frontend/) and delivered via CloudFront. The application makes API calls to API Gateway endpoints. Authentication: Cognito User Pool manages user registration, login, and password reset. Users sign up with email/password, receive verification emails, and get JWT tokens upon login. The frontend stores tokens in localStorage and includes them in API requests. API Layer: API Gateway exposes 5 endpoints: GET /todos (list todos), POST /todos (create todo), PUT /todos/{id} (update todo), DELETE /todos/{id} (delete todo), GET /todos/{id} (get single todo). Each endpoint has a Lambda authorizer that validates JWT tokens. Compute Layer: Five Lambda functions (Node.js 18) handle CRUD operations. Each function is allocated 512 MB memory (equivalent to 0.5 vCPU) and has a 30-second timeout. Functions use AWS SDK to interact with DynamoDB. Data Layer: DynamoDB table (todos) with partition key userId and sort key todoId, enabling efficient queries for all todos belonging to a user. The table uses on-demand billing, automatically scaling to handle any request volume. The application supports 10,000 users with 100,000 todos, costs $30/month, and requires zero server management. Deployment uses AWS SAM (Serverless Application Model) with infrastructure as code.

Detailed Example 2: Image Processing Service
A company builds an image processing service using serverless architecture. Frontend: Vue.js application on S3 allows users to upload images. Authentication: Cognito User Pool with social identity providers (Google, Facebook) for easy sign-up. API Layer: API Gateway exposes POST /images endpoint for image uploads. The endpoint returns a pre-signed S3 URL, allowing direct upload from browser to S3 (bypassing API Gateway's 10 MB payload limit). Compute Layer: Three Lambda functions: (1) Upload Lambda generates pre-signed URLs for S3 uploads, (2) Process Lambda (triggered by S3 event) creates thumbnails (100x100, 300x300, 600x600) using Sharp library, (3) Metadata Lambda extracts EXIF data and stores it in DynamoDB. Data Layer: S3 bucket (images-original) stores original images, S3 bucket (images-processed) stores thumbnails, DynamoDB table (image-metadata) stores metadata. The Process Lambda is allocated 3 GB memory (2 vCPUs) to handle image processing quickly. The service processes 10,000 images per day, costs $100/month (mostly Lambda compute for image processing), and scales automatically during traffic spikes. Users upload images directly to S3 (no API Gateway bottleneck), and processing completes in 5 seconds on average.

Detailed Example 3: Real-Time Chat Application
A company builds a real-time chat application using serverless architecture with WebSocket support. Frontend: React application on S3 uses WebSocket API to maintain persistent connections. Authentication: Cognito User Pool with MFA for secure authentication. API Layer: API Gateway WebSocket API with three routes: $connect (establish connection), $disconnect (close connection), sendMessage (send chat message). Compute Layer: Three Lambda functions: (1) Connect Lambda stores connection ID in DynamoDB when users connect, (2) Disconnect Lambda removes connection ID when users disconnect, (3) SendMessage Lambda receives messages, stores them in DynamoDB, and broadcasts to all connected users using API Gateway Management API. Data Layer: DynamoDB table (connections) stores active WebSocket connections (connectionId, userId, timestamp), DynamoDB table (messages) stores chat history (roomId, timestamp, userId, message). The SendMessage Lambda queries the connections table to find all users in the chat room and sends messages to each connection. The application supports 1,000 concurrent users with 10,000 messages per hour, costs $50/month, and provides real-time messaging with < 100ms latency. WebSocket connections can stay open for up to 2 hours before automatic reconnection.

⭐ Must Know (Critical Facts):

  • Serverless benefits: No server management, automatic scaling, pay-per-use pricing, high availability built-in
  • API Gateway: Exposes REST and WebSocket APIs, handles authentication, throttling, caching, CORS
  • Lambda: Stateless functions, 15-minute timeout, 10 GB memory max, 1,000 concurrent executions default
  • Cognito: User authentication, JWT tokens, social identity providers, MFA support
  • DynamoDB: NoSQL database, single-digit millisecond latency, automatic scaling, on-demand or provisioned billing
  • S3 pre-signed URLs: Allow direct uploads from browser to S3, bypassing API Gateway payload limits
  • Cold starts: First invocation takes 1-5 seconds (initialize runtime), subsequent invocations take 10-100ms
  • Cost model: API Gateway ($3.50 per million requests), Lambda ($0.20 per million requests + $0.0000166667 per GB-second), DynamoDB ($1.25 per million writes, $0.25 per million reads)

Section 3: Event-Driven Architecture

Event-Driven Processing Pipeline

šŸ“Š Event-Driven Architecture Diagram:

sequenceDiagram
    participant User
    participant S3
    participant EventBridge
    participant Lambda1 as Lambda: Thumbnail
    participant Lambda2 as Lambda: Metadata
    participant SQS
    participant Lambda3 as Lambda: ML Analysis
    participant DDB as DynamoDB

    User->>S3: Upload image
    S3->>EventBridge: ObjectCreated event
    
    EventBridge->>Lambda1: Trigger (sync)
    Lambda1->>S3: Create thumbnail
    Lambda1->>DDB: Store thumbnail URL
    
    EventBridge->>Lambda2: Trigger (sync)
    Lambda2->>DDB: Extract & store metadata
    
    EventBridge->>SQS: Queue for ML processing
    SQS->>Lambda3: Batch processing
    Lambda3->>Lambda3: ML image analysis
    Lambda3->>DDB: Store tags & labels
    
    DDB-->>User: Image fully processed

See: diagrams/06_integration_event_driven_architecture.mmd

Diagram Explanation (Comprehensive):
This sequence diagram illustrates an event-driven architecture where a single event (image upload) triggers multiple independent processing workflows. When a User uploads an image to S3, S3 emits an ObjectCreated event to EventBridge. EventBridge evaluates the event against multiple rules and routes it to three different targets simultaneously: (1) Lambda Thumbnail function is invoked synchronously to create thumbnail images (100x100, 300x300) and stores thumbnail URLs in DynamoDB, (2) Lambda Metadata function is invoked synchronously to extract EXIF data (camera model, GPS coordinates, timestamp) and stores it in DynamoDB, (3) SQS queue receives the event for asynchronous ML processing. The SQS queue buffers events and Lambda ML Analysis function polls the queue in batches of 10 messages. This function performs computationally expensive ML image analysis (object detection, facial recognition, scene classification) using Amazon Rekognition and stores results in DynamoDB. The event-driven pattern decouples components - if the ML function fails, it doesn't affect thumbnail generation or metadata extraction. Each component scales independently based on its workload. EventBridge provides at-least-once delivery with automatic retries, ensuring no events are lost. The architecture processes 10,000 images per hour with 5-second average latency for thumbnails and 30-second average latency for ML analysis. Cost is approximately $200/month (mostly Lambda compute for ML processing and Rekognition API calls).

Detailed Example 1: E-commerce Order Processing
An e-commerce platform uses event-driven architecture to process orders. When a customer places an order, the Order Service publishes an "OrderPlaced" event to EventBridge. EventBridge fans out to multiple subscribers: (1) Payment Lambda charges the credit card and publishes "PaymentCompleted" event, (2) Inventory Lambda reserves items and publishes "InventoryReserved" event, (3) Shipping Lambda creates shipping label and publishes "ShippingLabelCreated" event, (4) Email Lambda sends order confirmation to customer, (5) Analytics SQS queue receives event for business intelligence processing. Each service is independent and can be deployed, scaled, and updated separately. If the email service is down, it doesn't affect payment or shipping. EventBridge's event archive feature stores all events for 90 days, allowing replay for debugging or reprocessing. The system processes 10,000 orders per day with 2-second average order confirmation time (parallel processing) compared to 10 seconds with sequential processing. Event-driven architecture reduces coupling between services and improves resilience - if one service fails, others continue operating.

Detailed Example 2: IoT Data Processing
An IoT platform collects sensor data from 100,000 devices and processes it using event-driven architecture. Devices publish temperature readings to AWS IoT Core every minute. IoT Core routes events to EventBridge based on rules (e.g., temperature > 80°F triggers alert rule). EventBridge fans out to multiple targets: (1) Lambda Alert function sends SNS notifications to operations team for high temperatures, (2) Kinesis Firehose streams all data to S3 for long-term storage and analysis, (3) Lambda Aggregation function calculates hourly averages and stores them in DynamoDB, (4) SQS queue buffers events for ML anomaly detection. The ML Lambda function polls SQS in batches of 100 messages and uses Amazon Lookout for Equipment to detect anomalies. The event-driven pattern allows adding new consumers without modifying IoT devices or existing consumers. When the company adds a new dashboard, they simply add another EventBridge rule routing to a new Lambda function. The system processes 6 million events per hour (100,000 devices Ɨ 60 minutes) with < 1 second latency for alerts and costs $500/month (mostly IoT Core message processing and S3 storage).

Detailed Example 3: Video Transcoding Pipeline
A video platform uses event-driven architecture for video transcoding. When a user uploads a video to S3, S3 emits an ObjectCreated event to EventBridge. EventBridge routes the event to multiple targets: (1) Lambda Validation function checks video format and duration, rejecting invalid videos, (2) Step Functions workflow orchestrates the transcoding process: (a) Lambda Extract function extracts video metadata (resolution, codec, duration), (b) MediaConvert job transcodes video to multiple formats (1080p, 720p, 480p, 360p) and stores outputs in S3, (c) Lambda Thumbnail function generates video thumbnails at 10-second intervals, (d) Lambda Notification function sends completion email to user. (3) DynamoDB Streams captures changes to the video metadata table and triggers Lambda Analytics function to update video statistics. The event-driven pattern allows the transcoding workflow to scale independently - MediaConvert can process 100 videos simultaneously while Lambda functions scale to 1,000 concurrent executions. The system processes 1,000 videos per day with 10-minute average transcoding time and costs $1,000/month (mostly MediaConvert transcoding costs).

⭐ Must Know (Critical Facts):

  • Event-driven benefits: Loose coupling, independent scaling, asynchronous processing, resilience to failures
  • EventBridge: Central event bus, pattern matching, multiple targets per rule, event archive and replay
  • Decoupling: Producers don't know about consumers, consumers can be added/removed without affecting producers
  • Asynchronous processing: Long-running tasks (ML, transcoding) don't block user requests
  • Scalability: Each component scales independently based on its workload
  • Resilience: If one consumer fails, others continue processing (no cascading failures)
  • Event replay: Archive events and replay them for debugging or reprocessing
  • Cost model: Pay only for events processed, no idle costs for unused capacity

Section 4: Hybrid Cloud Architecture

On-Premises to AWS Integration

šŸ“Š Hybrid Cloud Architecture Diagram:

graph TB
    subgraph "On-Premises Data Center"
        OnPrem[Corporate Network]
        AD[Active Directory]
        App[Legacy Application]
        OnPrem --> AD
        OnPrem --> App
    end
    
    subgraph "AWS Cloud"
        subgraph "Connectivity"
            DX[Direct Connect<br/>10 Gbps]
            VPN[VPN Backup<br/>1.25 Gbps]
        end
        
        subgraph "Directory Services"
            ADConnector[AD Connector<br/>Proxy to On-Prem AD]
        end
        
        subgraph "Compute"
            EC2[EC2 Instances<br/>Cloud Applications]
        end
        
        subgraph "Storage"
            SGW[Storage Gateway<br/>File Gateway]
            S3[S3 Bucket<br/>Cloud Storage]
            SGW --> S3
        end
    end
    
    OnPrem --> DX
    OnPrem -.Backup.-> VPN
    DX --> ADConnector
    VPN -.-> ADConnector
    ADConnector --> AD
    EC2 --> ADConnector
    App --> SGW
    
    style DX fill:#ff9800
    style VPN fill:#fff3e0
    style ADConnector fill:#4caf50
    style SGW fill:#2196f3

See: diagrams/06_integration_hybrid_cloud_architecture.mmd

Diagram Explanation (Comprehensive):
This diagram shows a hybrid cloud architecture connecting on-premises infrastructure to AWS. The On-Premises Data Center contains the corporate network, Active Directory (AD) for user authentication, and legacy applications that can't be migrated to the cloud. Connectivity is established through AWS Direct Connect (10 Gbps dedicated connection) for primary connectivity and Site-to-Site VPN (1.25 Gbps over internet) as backup. Direct Connect provides consistent network performance (1-2ms latency) and reduced data transfer costs ($0.02/GB vs $0.09/GB for internet). The VPN backup ensures connectivity if Direct Connect fails. Directory Services uses AD Connector, which acts as a proxy to the on-premises Active Directory. EC2 instances in AWS can authenticate users against on-premises AD without replicating the directory to AWS. This enables single sign-on (SSO) - users log in with their corporate credentials. Compute consists of EC2 instances running cloud-native applications that need to authenticate users. Storage uses Storage Gateway File Gateway, which presents an NFS/SMB file share to on-premises applications. Files written to the gateway are automatically uploaded to S3 and cached locally for low-latency access. This allows legacy applications to use cloud storage without modification. The hybrid architecture enables gradual cloud migration - new applications run in AWS while legacy applications remain on-premises. Total cost: $5,000/month (Direct Connect: $2,000, VPN: $100, AD Connector: $200, Storage Gateway: $200, EC2: $2,000, S3: $500).

Detailed Example 1: Enterprise File Sharing
A company with 5,000 employees uses hybrid cloud for file sharing. On-Premises: Employees access file shares on Windows File Servers (10 TB of data). AWS: Storage Gateway File Gateway is deployed on-premises as a VM. The gateway presents an SMB file share to employees, caching frequently accessed files locally (1 TB cache). Files are automatically uploaded to S3 (s3://company-files/) with lifecycle policies moving old files to Glacier after 90 days. Connectivity: Direct Connect (10 Gbps) provides high-bandwidth connection for file uploads. Benefits: (1) Unlimited cloud storage - no need to provision additional on-premises storage, (2) Disaster recovery - files are replicated to S3 across multiple AZs, (3) Cost savings - Glacier storage costs $0.00099/GB-month vs $0.10/GB-month for on-premises SAN, (4) Remote access - employees can access files from AWS WorkSpaces or EC2 instances. The company saves $50,000/year on storage costs and improves disaster recovery (RPO: 1 hour, RTO: 4 hours).

Detailed Example 2: Hybrid Active Directory
A company with 10,000 employees uses hybrid cloud for identity management. On-Premises: Active Directory Domain Services (AD DS) manages user accounts, groups, and policies. AWS: AD Connector proxies authentication requests to on-premises AD. EC2 instances running Windows Server join the domain through AD Connector. Connectivity: Direct Connect (10 Gbps) with VPN backup ensures reliable connectivity. Use Cases: (1) EC2 instances authenticate users against corporate AD, (2) AWS Management Console uses AD credentials for SSO, (3) RDS SQL Server uses Windows Authentication with AD users, (4) Amazon WorkSpaces uses AD credentials for user login. Benefits: (1) Single source of truth - no need to replicate AD to AWS, (2) Centralized management - IT manages users in one place, (3) Compliance - meets requirements for centralized identity management, (4) Cost savings - no need for AWS Managed Microsoft AD ($2/hour). The company saves $15,000/year on directory services costs and simplifies user management.

Detailed Example 3: Disaster Recovery for On-Premises Applications
A company uses hybrid cloud for disaster recovery of on-premises applications. On-Premises: Production applications run on VMware vSphere (100 VMs). AWS: AWS Application Migration Service (MGN) continuously replicates VMs to AWS. Replicated VMs are stored as EBS snapshots in a staging area. Connectivity: Direct Connect (10 Gbps) provides high-bandwidth replication. DR Strategy: Pilot Light - only replication infrastructure runs in AWS (cost: $500/month). During a disaster, the company launches EC2 instances from EBS snapshots (RTO: 1 hour, RPO: 15 minutes). Testing: The company performs quarterly DR drills by launching test instances in an isolated VPC. Benefits: (1) Low cost - pay only for EBS snapshots ($0.05/GB-month) and replication, (2) Fast recovery - launch instances in 15 minutes, (3) No data loss - continuous replication with 15-minute RPO, (4) Compliance - meets regulatory requirements for disaster recovery. The company saves $100,000/year compared to maintaining a secondary data center.

⭐ Must Know (Critical Facts):

  • Direct Connect: Dedicated connection, 1-100 Gbps, consistent latency, reduced data transfer costs ($0.02/GB)
  • VPN: Encrypted tunnel over internet, up to 1.25 Gbps per tunnel, $0.05/hour, backup for Direct Connect
  • AD Connector: Proxy to on-premises AD, $0.05/hour per directory, supports SSO and domain join
  • Storage Gateway: File Gateway (NFS/SMB), Volume Gateway (iSCSI), Tape Gateway (VTL)
  • Hybrid benefits: Gradual migration, leverage existing investments, meet compliance requirements
  • Use cases: Disaster recovery, cloud bursting, data archival, hybrid applications
  • Cost optimization: Use Direct Connect for high-volume data transfer, VPN for low-volume or backup

Section 5: Microservices Architecture

Container-Based Microservices

šŸ“Š Microservices Architecture Diagram:

graph TB
    subgraph "API Gateway"
        APIGW[API Gateway<br/>Single Entry Point]
    end
    
    subgraph "Service Mesh"
        UserSvc[User Service<br/>ECS Fargate]
        OrderSvc[Order Service<br/>ECS Fargate]
        ProductSvc[Product Service<br/>ECS Fargate]
        PaymentSvc[Payment Service<br/>ECS Fargate]
    end
    
    subgraph "Data Stores"
        UserDB[(User DB<br/>RDS)]
        OrderDB[(Order DB<br/>DynamoDB)]
        ProductDB[(Product DB<br/>Aurora)]
        PaymentDB[(Payment DB<br/>RDS)]
    end
    
    subgraph "Messaging"
        SNS[SNS Topic<br/>Order Events]
        SQS1[SQS: Inventory]
        SQS2[SQS: Shipping]
        SQS3[SQS: Notifications]
    end
    
    APIGW --> UserSvc
    APIGW --> OrderSvc
    APIGW --> ProductSvc
    APIGW --> PaymentSvc
    
    UserSvc --> UserDB
    OrderSvc --> OrderDB
    ProductSvc --> ProductDB
    PaymentSvc --> PaymentDB
    
    OrderSvc --> SNS
    SNS --> SQS1
    SNS --> SQS2
    SNS --> SQS3
    
    style APIGW fill:#ff9800
    style UserSvc fill:#4caf50
    style OrderSvc fill:#4caf50
    style ProductSvc fill:#4caf50
    style PaymentSvc fill:#4caf50
    style SNS fill:#f44336

See: diagrams/06_integration_microservices_architecture.mmd

Diagram Explanation (Comprehensive):
This diagram illustrates a microservices architecture where the application is decomposed into independent services, each with its own database (database per service pattern). API Gateway serves as the single entry point, routing requests to appropriate microservices based on URL path (/users/* → User Service, /orders/* → Order Service, /products/* → Product Service, /payments/* → Payment Service). Each microservice runs on ECS Fargate (serverless containers), eliminating server management. Services scale independently - the Order Service can scale to 20 tasks during peak hours while the User Service maintains 5 tasks. Each service has its own database optimized for its use case: User Service uses RDS PostgreSQL for relational user data, Order Service uses DynamoDB for high-throughput order processing, Product Service uses Aurora for complex product catalog queries, Payment Service uses RDS MySQL for transactional payment data. Services communicate asynchronously through SNS/SQS for loose coupling. When an order is placed, the Order Service publishes an event to SNS, which fans out to three SQS queues: Inventory queue (reserve items), Shipping queue (create shipping label), Notifications queue (send confirmation email). This event-driven communication prevents cascading failures - if the shipping service is down, it doesn't affect order placement. The architecture enables independent deployment, scaling, and technology choices per service. Cost: $3,000/month (ECS Fargate: $2,000, RDS/Aurora: $800, DynamoDB: $100, API Gateway: $50, SNS/SQS: $50).

Detailed Example 1: E-commerce Platform Microservices
An e-commerce company decomposes its monolithic application into microservices. User Service (Node.js, 5 Fargate tasks, 0.5 vCPU, 1 GB RAM each) manages user registration, authentication, and profiles. It uses RDS PostgreSQL (db.t3.medium) for user data. Product Service (Java Spring Boot, 10 Fargate tasks, 1 vCPU, 2 GB RAM each) manages product catalog with complex search and filtering. It uses Aurora PostgreSQL (db.r5.large) with 2 read replicas for read-heavy workload. Order Service (Python Flask, 20 Fargate tasks, 1 vCPU, 2 GB RAM each) handles order placement and tracking. It uses DynamoDB (on-demand billing) for high-throughput writes (1,000 orders per minute). Payment Service (Go, 5 Fargate tasks, 0.5 vCPU, 1 GB RAM each) processes payments through Stripe API. It uses RDS MySQL (db.t3.small) for payment records. Benefits: (1) Independent scaling - Order Service scales to 50 tasks during Black Friday while others remain at baseline, (2) Independent deployment - Product Service can be updated without affecting Order Service, (3) Technology diversity - each service uses the best language/database for its needs, (4) Fault isolation - if Payment Service fails, users can still browse products and add to cart. Challenges: (1) Distributed transactions - order placement involves multiple services (order, payment, inventory), solved using Saga pattern with compensating transactions, (2) Service discovery - services find each other using AWS Cloud Map, (3) Monitoring - distributed tracing using AWS X-Ray to track requests across services.

⭐ Must Know (Critical Facts):

  • Microservices benefits: Independent deployment, scaling, technology choices, fault isolation
  • Database per service: Each service owns its database, no shared databases
  • API Gateway: Single entry point, routing, authentication, throttling, caching
  • ECS Fargate: Serverless containers, no EC2 management, pay per vCPU/GB-second
  • Asynchronous communication: SNS/SQS for loose coupling, prevents cascading failures
  • Service discovery: AWS Cloud Map or ECS Service Discovery for finding services
  • Distributed tracing: AWS X-Ray for tracking requests across services
  • Challenges: Distributed transactions, eventual consistency, increased complexity

Section 6: Data Processing Pipeline

Real-Time Data Analytics

šŸ“Š Data Pipeline Architecture Diagram:

graph LR
    subgraph "Data Sources"
        App[Application Logs]
        IoT[IoT Sensors]
        DB[Database CDC]
    end
    
    subgraph "Ingestion"
        Kinesis[Kinesis Data Streams<br/>Real-time ingestion]
        Firehose[Kinesis Firehose<br/>Batch delivery]
    end
    
    subgraph "Processing"
        Lambda[Lambda<br/>Transform]
        Glue[AWS Glue<br/>ETL Jobs]
    end
    
    subgraph "Storage"
        S3Raw[S3 Raw Data<br/>Data Lake]
        S3Processed[S3 Processed<br/>Parquet Format]
    end
    
    subgraph "Analytics"
        Athena[Athena<br/>SQL Queries]
        QuickSight[QuickSight<br/>Dashboards]
        Redshift[Redshift<br/>Data Warehouse]
    end
    
    App --> Kinesis
    IoT --> Kinesis
    DB --> Kinesis
    
    Kinesis --> Lambda
    Lambda --> Firehose
    Firehose --> S3Raw
    
    S3Raw --> Glue
    Glue --> S3Processed
    
    S3Processed --> Athena
    S3Processed --> Redshift
    Athena --> QuickSight
    Redshift --> QuickSight
    
    style Kinesis fill:#ff9800
    style Lambda fill:#9c27b0
    style S3Processed fill:#4caf50
    style QuickSight fill:#2196f3

See: diagrams/06_integration_data_pipeline_architecture.mmd

Diagram Explanation (Comprehensive):
This diagram shows a complete data processing pipeline for real-time analytics. Data Sources include application logs (web server access logs), IoT sensors (temperature, humidity readings), and database change data capture (CDC) from RDS. Ingestion uses Kinesis Data Streams to collect data in real-time. Producers send records to Kinesis shards (each shard handles 1 MB/sec input, 2 MB/sec output). Processing uses Lambda functions to transform data (parse logs, enrich with metadata, filter invalid records) and Kinesis Firehose to batch and deliver data to S3. Firehose buffers data for 60 seconds or 5 MB (whichever comes first) before writing to S3, reducing S3 PUT requests and costs. Storage uses S3 as a data lake. Raw data is stored in JSON format (s3://data-lake/raw/), and AWS Glue ETL jobs transform it to Parquet format (s3://data-lake/processed/) for efficient querying. Parquet is columnar format, reducing query costs by 90% compared to JSON. Analytics uses Athena for ad-hoc SQL queries on S3 data (serverless, pay per query), Redshift for complex analytics and aggregations (data warehouse), and QuickSight for interactive dashboards. The pipeline processes 1 million records per hour with < 5 minute latency from ingestion to availability in Athena. Cost: $1,000/month (Kinesis: $400, Lambda: $100, S3: $200, Glue: $100, Athena: $100, Redshift: $100).

Detailed Example 1: Web Analytics Pipeline
A media company processes web server logs for real-time analytics. Ingestion: Web servers (100 EC2 instances) send access logs to Kinesis Data Streams (10 shards, 10 MB/sec total throughput). Each log entry contains timestamp, user ID, page URL, response time, user agent. Processing: Lambda function (512 MB, 30-second timeout) parses logs, extracts fields, enriches with geolocation data (from IP address), and filters bot traffic. Kinesis Firehose buffers transformed logs and delivers to S3 every 60 seconds. Storage: S3 stores raw logs (JSON) and processed logs (Parquet). Glue Crawler automatically discovers schema and creates Glue Data Catalog tables. Analytics: Athena queries processed logs for ad-hoc analysis (e.g., "top 10 pages by traffic"). QuickSight dashboards show real-time metrics (page views per minute, average response time, geographic distribution). Redshift loads daily aggregates for historical analysis. Benefits: (1) Real-time visibility - dashboards update every minute, (2) Cost-effective - Athena charges $5 per TB scanned, Parquet reduces scans by 90%, (3) Scalable - handles 10x traffic spikes automatically, (4) Flexible - can add new analytics without changing ingestion. The pipeline processes 100 million log entries per day and costs $500/month.

⭐ Must Know (Critical Facts):

  • Kinesis Data Streams: Real-time ingestion, 1 MB/sec per shard, 24-hour to 365-day retention
  • Kinesis Firehose: Batch delivery to S3/Redshift/Elasticsearch, automatic scaling, 60-second buffer
  • Lambda: Transform data in real-time, 15-minute timeout, 10 GB memory max
  • AWS Glue: Serverless ETL, discovers schema, transforms data, creates Data Catalog
  • S3 Data Lake: Store raw and processed data, lifecycle policies for cost optimization
  • Parquet format: Columnar format, 90% smaller than JSON, 90% cheaper to query
  • Athena: Serverless SQL queries on S3, $5 per TB scanned, no infrastructure
  • QuickSight: BI dashboards, $9/user/month, integrates with Athena/Redshift

Chapter Summary

What We Covered

  • āœ… Three-tier web application architecture (presentation, application, data tiers)
  • āœ… Serverless architecture (API Gateway, Lambda, DynamoDB, Cognito)
  • āœ… Event-driven architecture (EventBridge, SNS, SQS, asynchronous processing)
  • āœ… Hybrid cloud architecture (Direct Connect, VPN, AD Connector, Storage Gateway)
  • āœ… Microservices architecture (ECS Fargate, database per service, API Gateway)
  • āœ… Data processing pipeline (Kinesis, Lambda, Glue, S3, Athena, QuickSight)

Critical Takeaways

  1. Integration patterns: Combine services from all domains to create complete solutions
  2. Loose coupling: Use messaging (SNS/SQS) and events (EventBridge) to decouple components
  3. Scalability: Design for independent scaling of components (Auto Scaling, Lambda, DynamoDB)
  4. Resilience: Deploy across multiple AZs, use Multi-AZ databases, implement health checks
  5. Security: Defense in depth (WAF, Security Groups, encryption, IAM roles)
  6. Cost optimization: Use serverless where possible, Reserved Instances for steady-state, lifecycle policies

Self-Assessment Checklist

Test yourself before moving on:

  • I can design a complete three-tier web application with all AWS services
  • I can explain when to use serverless vs container-based architectures
  • I can design event-driven systems with proper decoupling
  • I can integrate on-premises infrastructure with AWS (hybrid cloud)
  • I can decompose monoliths into microservices with appropriate patterns
  • I can design data processing pipelines for real-time analytics
  • I can explain trade-offs between different architectural patterns

Practice Questions

Try these from your practice test bundles:

  • Integration Bundle: Questions 1-20
  • Cross-Domain Scenarios: Questions 1-30
  • Expected score: 75%+ to proceed

If you scored below 75%:

  • Review sections: Focus on areas where you missed questions
  • Key topics to strengthen:
    • Service integration patterns
    • Event-driven vs request-response
    • Hybrid cloud connectivity options
    • Microservices communication patterns
    • Data pipeline components

Next Chapter: 07_study_strategies - Study Techniques & Test-Taking Strategies


Chapter Summary

What We Covered

This integration chapter brought together concepts from all four domains:

  • āœ… Cross-Domain Scenarios: Real-world architectures combining security, resilience, performance, and cost
  • āœ… Multi-Service Integration: How AWS services work together in production systems
  • āœ… Architecture Patterns: Three-tier, microservices, serverless, event-driven, and hybrid architectures
  • āœ… Migration Strategies: The 7 Rs (Rehost, Replatform, Repurchase, Refactor, Retire, Retain, Relocate)
  • āœ… Well-Architected Review: Applying all six pillars to complete solutions
  • āœ… Common Patterns: CI/CD pipelines, data lakes, disaster recovery, and hybrid cloud

Critical Takeaways

  1. Holistic Thinking: Exam questions often test multiple domains simultaneously
  2. Trade-offs: Every architecture decision involves trade-offs between security, performance, cost, and complexity
  3. Service Integration: Understanding how services work together is more important than knowing individual services
  4. Real-World Scenarios: Practice with realistic scenarios that combine multiple requirements
  5. Well-Architected: Use the framework to evaluate and improve architectures

Self-Assessment Checklist

Test your integration knowledge:

  • Can you design a complete three-tier web application with security, HA, and cost optimization?
  • Can you explain how to migrate an on-premises application to AWS?
  • Can you design a disaster recovery solution that meets specific RTO/RPO requirements?
  • Can you architect a serverless data processing pipeline?
  • Can you apply all six Well-Architected pillars to evaluate an architecture?
  • Can you identify and resolve architectural anti-patterns?

If you scored below 80% on practice tests: Review the specific domains where you're weak.

If you scored 80%+ on practice tests: You're ready for final exam preparation!


Next Steps: Proceed to 07_study_strategies to learn effective study techniques and test-taking strategies.


Study Strategies & Test-Taking Techniques

Effective Study Techniques

The 3-Pass Method

Pass 1: Understanding (Weeks 1-6)

  • Read each domain chapter thoroughly
  • Take notes on ⭐ Must Know items
  • Complete self-assessment checklists
  • Score 70%+ on domain-specific practice tests

Pass 2: Application (Week 7-8)

  • Review chapter summaries only
  • Focus on decision frameworks and comparison tables
  • Practice full-length tests (target: 75%+)
  • Review incorrect answers and understand why

Pass 3: Reinforcement (Week 9-10)

  • Review flagged items and weak areas
  • Memorize key facts (limits, pricing, features)
  • Final practice tests (target: 80%+)
  • Review cheat sheet daily

Active Learning Techniques

  1. Teach Someone: Explain concepts out loud (even to yourself)
  2. Draw Diagrams: Visualize architectures and data flows
  3. Write Scenarios: Create your own exam questions
  4. Compare Options: Use comparison tables to understand differences

Memory Aids

Mnemonic for S3 Storage Classes (by cost):
"Deep Glaciers Grow One Standard Size"

  • Deep Archive ($0.00099)
  • Glacier Flexible ($0.0036)
  • Glacier Instant ($0.004)
  • One Zone-IA ($0.01)
  • Standard-IA ($0.0125)
  • Standard ($0.023)

Mnemonic for EC2 Instance Families:
"Tiny Machines Compute Rapidly, Instances Process Graphics Fast"

  • T: Burstable (T3)
  • M: General Purpose (M5)
  • C: Compute Optimized (C5)
  • R: Memory Optimized (R5)
  • I: Storage Optimized (I3)
  • P: GPU (P3)
  • G: Graphics (G4)
  • F: FPGA (F1)

Test-Taking Strategies

Time Management

  • Total time: 130 minutes
  • Total questions: 65
  • Time per question: 2 minutes average

Strategy:

  • First pass (60 min): Answer all easy questions, flag difficult ones
  • Second pass (40 min): Tackle flagged questions
  • Final pass (30 min): Review marked answers, check for mistakes

Question Analysis Method

Step 1: Read the scenario (30 seconds)

  • Identify: Company type, current situation, problem
  • Note: Key requirements (security, cost, performance, resilience)

Step 2: Identify constraints (15 seconds)

  • Cost requirements ("most cost-effective")
  • Performance needs ("low latency", "high throughput")
  • Compliance ("PCI-DSS", "HIPAA")
  • Operational overhead ("least operational overhead")

Step 3: Eliminate wrong answers (30 seconds)

  • Remove options that violate constraints
  • Eliminate technically incorrect options
  • Remove options that don't address the problem

Step 4: Choose best answer (45 seconds)

  • Select option that best meets ALL requirements
  • If tied, choose option with least complexity/cost

Handling Difficult Questions

When stuck:

  1. Eliminate obviously wrong answers (reduce to 2-3 options)
  2. Look for constraint keywords ("most cost-effective" → cheapest option)
  3. Choose AWS-recommended solution (managed services, Multi-AZ, encryption)
  4. Flag and move on if unsure (don't waste time)

āš ļø Never: Spend more than 3 minutes on one question initially

Common Question Patterns

Pattern 1: "Most cost-effective solution"

  • Look for: Lifecycle policies, Spot Instances, Savings Plans, S3 storage classes
  • Eliminate: Expensive options (On-Demand, Provisioned IOPS, Standard storage)

Pattern 2: "Highest availability"

  • Look for: Multi-AZ, Multi-Region, Auto Scaling, Load Balancers
  • Eliminate: Single AZ, single instance, no redundancy

Pattern 3: "Lowest latency"

  • Look for: CloudFront, ElastiCache, DAX, Read Replicas, edge locations
  • Eliminate: Cross-region calls, no caching, single region

Pattern 4: "Most secure"

  • Look for: Encryption (KMS), IAM roles, Security Groups, WAF, MFA
  • Eliminate: Public access, embedded credentials, no encryption

Exam Day Preparation

Day Before Exam

  • Review: Cheat sheet (1 hour), chapter summaries (1 hour)
  • Don't: Try to learn new topics
  • Do: Get 8 hours sleep, prepare materials

Morning of Exam

  • Light review: Cheat sheet (30 minutes)
  • Eat: Good breakfast
  • Arrive: 30 minutes early

Brain Dump Strategy

When exam starts, immediately write down on scratch paper:

  • S3 storage class prices (Deep Archive $0.00099 → Standard $0.023)
  • EC2 pricing discounts (Spot 90%, Savings Plans 72%, Reserved 60%)
  • RDS Multi-AZ failover time (60-120 seconds)
  • Key service limits (5,500 GET/sec per S3 prefix, 3,500 PUT/sec)

During Exam

  • Follow time management strategy (first pass, second pass, final pass)
  • Use scratch paper for diagrams and calculations
  • Flag questions for review (don't get stuck)
  • Trust your preparation (first instinct often correct)

Advanced Study Techniques

Spaced Repetition System

What it is: Review material at increasing intervals to maximize retention.

How to implement:

  • Day 1: Learn new concept (read chapter section)
  • Day 2: Review concept (read notes)
  • Day 4: Test yourself (practice questions)
  • Day 7: Review again (if correct, move to next interval)
  • Day 14: Final review (if correct, consider mastered)

šŸ“Š Spaced Repetition Schedule:

gantt
    title Spaced Repetition Study Schedule
    dateFormat YYYY-MM-DD
    section Domain 1
    Initial Learning    :2025-01-01, 7d
    First Review       :2025-01-08, 1d
    Second Review      :2025-01-11, 1d
    Third Review       :2025-01-15, 1d
    Final Review       :2025-01-22, 1d
    section Domain 2
    Initial Learning    :2025-01-08, 7d
    First Review       :2025-01-15, 1d
    Second Review      :2025-01-18, 1d
    Third Review       :2025-01-22, 1d
    Final Review       :2025-01-29, 1d

See: diagrams/07_study_strategies_spaced_repetition.mmd

Why it works: Spacing reviews forces your brain to work harder to recall information, strengthening memory pathways.

The Feynman Technique

Step 1: Choose a concept (e.g., "RDS Multi-AZ")

Step 2: Explain it simply (as if teaching a 10-year-old):
"RDS Multi-AZ is like having two identical databases in different buildings. If one building has a problem, the other one automatically takes over so your application keeps working."

Step 3: Identify gaps (where you struggled to explain):

  • How does failover actually work?
  • How long does it take?
  • What triggers failover?

Step 4: Review and simplify (go back to study materials, fill gaps, try again)

Step 5: Use analogies (make it relatable):
"Multi-AZ is like having a backup generator that automatically kicks in when power fails."

Interleaved Practice

What it is: Mix different topics in one study session instead of focusing on one topic.

Traditional approach (blocked practice):

  • Monday: Study only S3 (2 hours)
  • Tuesday: Study only EC2 (2 hours)
  • Wednesday: Study only RDS (2 hours)

Interleaved approach (better retention):

  • Monday: S3 (40 min) → EC2 (40 min) → RDS (40 min)
  • Tuesday: EC2 (40 min) → RDS (40 min) → S3 (40 min)
  • Wednesday: RDS (40 min) → S3 (40 min) → EC2 (40 min)

Why it works: Forces your brain to discriminate between concepts and choose the right approach for each problem (like the actual exam).

Elaborative Interrogation

Technique: Ask yourself "why" questions about facts.

Example:

  • Fact: "S3 Standard-IA is cheaper than S3 Standard"
  • Why?: Because AWS assumes you'll access it less frequently, so they charge less for storage but more for retrieval
  • Why does that matter?: It helps me choose the right storage class based on access patterns
  • When would I use it?: For data accessed less than once a month but needs immediate access when requested

Practice questions to ask:

  • Why does this service exist?
  • Why would I choose this over alternatives?
  • Why does this limitation exist?
  • Why is this the best practice?

Retrieval Practice

What it is: Testing yourself BEFORE you feel ready (not just reviewing notes).

How to implement:

  1. Read a chapter section (e.g., "Lambda Concurrency")
  2. Close the book immediately
  3. Write down everything you remember (no peeking!)
  4. Check your notes (identify what you missed)
  5. Repeat (focus on what you missed)

Why it works: The act of retrieving information strengthens memory more than passive review.

Tools:

  • Flashcards (physical or digital)
  • Practice questions (from this package)
  • Self-quizzing (write questions for yourself)
  • Teach someone (forces retrieval)

Domain-Specific Study Strategies

Domain 1: Security (30% of exam)

Focus areas:

  • IAM policies (understand policy evaluation logic)
  • VPC security (Security Groups vs NACLs)
  • Encryption (KMS, at-rest, in-transit)
  • Compliance (AWS services for different frameworks)

Study approach:

  1. Master IAM first (foundation for everything)
  2. Draw VPC diagrams (visualize security layers)
  3. Practice policy writing (hands-on with IAM Policy Simulator)
  4. Memorize encryption options (which services support what)

Common mistakes to avoid:

  • Confusing Security Groups (stateful) with NACLs (stateless)
  • Forgetting that IAM is global (not region-specific)
  • Not understanding policy evaluation order (explicit deny wins)

šŸ“Š Security Study Priority:

graph TD
    A[Start Security Study] --> B[IAM Fundamentals]
    B --> C[VPC Security]
    C --> D[Encryption & KMS]
    D --> E[Compliance Services]
    E --> F[Practice Questions]
    
    B --> B1[Users, Groups, Roles]
    B --> B2[Policies & Permissions]
    B --> B3[MFA & Access Keys]
    
    C --> C1[Security Groups]
    C --> C2[NACLs]
    C --> C3[VPC Flow Logs]
    
    D --> D1[KMS Keys]
    D --> D2[S3 Encryption]
    D --> D3[EBS/RDS Encryption]
    
    style B fill:#ffcccc
    style C fill:#ffddcc
    style D fill:#ffeecc
    style E fill:#ffffcc

See: diagrams/07_study_strategies_security_priority.mmd

Domain 2: Resilience (26% of exam)

Focus areas:

  • Multi-AZ deployments
  • Auto Scaling
  • Load balancing
  • Disaster recovery strategies
  • Decoupling (SQS, SNS, EventBridge)

Study approach:

  1. Understand RTO/RPO (drives DR strategy selection)
  2. Practice architecture diagrams (draw HA architectures)
  3. Compare DR strategies (backup/restore vs pilot light vs warm standby vs active-active)
  4. Master decoupling patterns (when to use SQS vs SNS vs EventBridge)

Common mistakes to avoid:

  • Confusing Multi-AZ (HA) with Read Replicas (performance)
  • Not understanding Auto Scaling cooldown periods
  • Forgetting that ELB health checks can trigger Auto Scaling

šŸ“Š Resilience Study Progression:

graph LR
    A[Week 1-2: HA Basics] --> B[Week 3: Auto Scaling]
    B --> C[Week 4: Load Balancing]
    C --> D[Week 5: DR Strategies]
    D --> E[Week 6: Decoupling]
    E --> F[Week 7: Practice]
    
    A --> A1[Multi-AZ]
    A --> A2[Availability Zones]
    
    B --> B1[Dynamic Scaling]
    B --> B2[Predictive Scaling]
    
    C --> C1[ALB vs NLB]
    C --> C2[Health Checks]
    
    D --> D1[RTO/RPO]
    D --> D2[4 DR Strategies]
    
    E --> E1[SQS]
    E --> E2[SNS]
    E --> E3[EventBridge]
    
    style A fill:#c8e6c9
    style B fill:#a5d6a7
    style C fill:#81c784
    style D fill:#66bb6a
    style E fill:#4caf50
    style F fill:#388e3c

See: diagrams/07_study_strategies_resilience_progression.mmd

Domain 3: Performance (24% of exam)

Focus areas:

  • Storage performance (IOPS, throughput)
  • Compute optimization (instance types, Lambda)
  • Database performance (caching, read replicas)
  • Network optimization (CloudFront, Global Accelerator)
  • Data ingestion (Kinesis, Glue)

Study approach:

  1. Memorize instance types (T, M, C, R, I, P, G families)
  2. Understand IOPS calculations (gp3, io2, Provisioned IOPS)
  3. Compare caching options (ElastiCache, DAX, CloudFront)
  4. Practice service selection (when to use what)

Common mistakes to avoid:

  • Confusing EBS volume types (gp2 vs gp3 vs io2)
  • Not understanding Lambda memory = CPU allocation
  • Forgetting that CloudFront caches at edge locations (not origin)

šŸ“Š Performance Optimization Decision Tree:

graph TD
    A[Performance Issue?] --> B{What layer?}
    B -->|Storage| C{Access pattern?}
    B -->|Compute| D{Workload type?}
    B -->|Database| E{Read or write heavy?}
    B -->|Network| F{Geographic distribution?}
    
    C -->|Sequential| C1[HDD: st1, sc1]
    C -->|Random| C2[SSD: gp3, io2]
    
    D -->|Steady| D1[EC2 Reserved]
    D -->|Variable| D2[Auto Scaling]
    D -->|Event-driven| D3[Lambda]
    
    E -->|Read-heavy| E1[Read Replicas + ElastiCache]
    E -->|Write-heavy| E2[Provisioned IOPS + Write Sharding]
    
    F -->|Global| F1[CloudFront + Global Accelerator]
    F -->|Regional| F2[Regional Edge Caches]
    
    style C1 fill:#ffcccc
    style C2 fill:#ccffcc
    style D1 fill:#ccccff
    style D2 fill:#ffffcc
    style D3 fill:#ffccff
    style E1 fill:#ccffff
    style E2 fill:#ffddcc
    style F1 fill:#ddffcc
    style F2 fill:#ccddff

See: diagrams/07_study_strategies_performance_decision.mmd

Domain 4: Cost Optimization (20% of exam)

Focus areas:

  • EC2 pricing models (On-Demand, Reserved, Spot, Savings Plans)
  • S3 storage classes and lifecycle policies
  • Database cost optimization
  • Data transfer costs
  • Cost monitoring tools

Study approach:

  1. Memorize pricing discounts (Spot 90%, Savings Plans 72%, Reserved 60%)
  2. Understand S3 lifecycle transitions (Standard → IA → Glacier → Deep Archive)
  3. Compare Reserved Instance types (Standard, Convertible, Scheduled)
  4. Learn data transfer costs (inter-AZ, inter-region, internet)

Common mistakes to avoid:

  • Confusing Savings Plans (flexible) with Reserved Instances (specific)
  • Not understanding S3 minimum storage duration charges
  • Forgetting that NAT Gateway has data processing charges

šŸ“Š Cost Optimization Study Map:

mindmap
  root((Cost Optimization))
    Compute
      EC2 Pricing
        On-Demand
        Reserved Instances
        Spot Instances
        Savings Plans
      Lambda Pricing
        Requests
        Duration
        Memory
      Auto Scaling
        Right-sizing
        Scheduled scaling
    Storage
      S3 Storage Classes
        Standard
        IA
        Glacier
        Deep Archive
      EBS Optimization
        gp3 vs gp2
        Volume types
        Snapshot lifecycle
      Lifecycle Policies
        Transition rules
        Expiration rules
    Database
      RDS Pricing
        Instance types
        Reserved Instances
        Storage autoscaling
      DynamoDB
        On-Demand
        Provisioned
        Reserved Capacity
      Caching
        ElastiCache
        DAX
    Network
      Data Transfer
        Inter-AZ
        Inter-Region
        Internet egress
      NAT Gateway
        Data processing
        Hourly charges
      VPC Endpoints
        Cost savings

See: diagrams/07_study_strategies_cost_optimization_map.mmd

Exam Question Analysis Framework

The STAR Method for Scenario Questions

S - Situation: What's the current state?

  • Company type (startup, enterprise, government)
  • Current architecture (on-premises, hybrid, cloud)
  • Problem statement (what's not working)

T - Task: What needs to be achieved?

  • Business requirements (cost, time, compliance)
  • Technical requirements (performance, scalability, security)
  • Constraints (budget, timeline, skills)

A - Action: What solutions are proposed?

  • Evaluate each answer option
  • Check if it addresses the task
  • Verify it fits within constraints

R - Result: What's the expected outcome?

  • Does it solve the problem?
  • Does it meet all requirements?
  • Is it the BEST solution (not just A solution)?

šŸ“Š STAR Method Application:

sequenceDiagram
    participant Q as Question
    participant S as Situation
    participant T as Task
    participant A as Action
    participant R as Result
    
    Q->>S: Read scenario
    S->>S: Identify: Company, Current State, Problem
    S->>T: Extract requirements
    T->>T: List: Business + Technical + Constraints
    T->>A: Evaluate options
    A->>A: Check each answer against requirements
    A->>R: Select best option
    R->>R: Verify: Solves problem + Meets requirements + Best choice
    R->>Q: Choose answer

See: diagrams/07_study_strategies_star_method.mmd

Keyword Recognition Strategy

Cost keywords (choose cheapest option):

  • "most cost-effective"
  • "minimize cost"
  • "lowest cost"
  • "reduce expenses"

Performance keywords (choose fastest option):

  • "lowest latency"
  • "highest throughput"
  • "best performance"
  • "fastest"

Security keywords (choose most secure option):

  • "most secure"
  • "comply with"
  • "encrypt"
  • "least privilege"

Operational keywords (choose simplest option):

  • "least operational overhead"
  • "minimal management"
  • "fully managed"
  • "automated"

Availability keywords (choose most resilient option):

  • "highly available"
  • "fault-tolerant"
  • "disaster recovery"
  • "minimize downtime"

Elimination Strategy

Step 1: Eliminate obviously wrong answers (reduce to 2-3 options)

  • Technically impossible (service doesn't support that feature)
  • Violates stated constraints (too expensive, wrong region)
  • Doesn't address the problem (solves different issue)

Step 2: Eliminate "almost right" answers (reduce to 1-2 options)

  • Partially correct (addresses some requirements but not all)
  • Overengineered (more complex than needed)
  • Underengineered (doesn't meet scale requirements)

Step 3: Choose the BEST answer (final selection)

  • Meets ALL requirements
  • Follows AWS best practices
  • Most cost-effective among remaining options
  • Least operational overhead

šŸ“Š Elimination Process:

graph TD
    A[4 Answer Options] --> B{Step 1: Obviously Wrong?}
    B -->|Yes| C[Eliminate]
    B -->|No| D[Keep]
    
    D --> E{Step 2: Partially Correct?}
    E -->|Yes| F[Eliminate]
    E -->|No| G[Keep]
    
    G --> H{Step 3: Best Option?}
    H -->|Meets all requirements| I[SELECT]
    H -->|Missing requirements| J[Eliminate]
    
    C --> K[Remaining: 2-3 options]
    F --> L[Remaining: 1-2 options]
    I --> M[Final Answer]
    
    style C fill:#ffcccc
    style F fill:#ffddcc
    style I fill:#ccffcc
    style M fill:#66bb6a

See: diagrams/07_study_strategies_elimination_process.mmd

Practice Test Strategy

Progressive Difficulty Approach

Week 1-2: Beginner Tests

  • Take: Beginner Practice Test 1
  • Target score: 60%+
  • Focus: Understanding basic concepts
  • Review: All incorrect answers

Week 3-4: Intermediate Tests

  • Take: Intermediate Practice Test 1
  • Target score: 65%+
  • Focus: Applying concepts to scenarios
  • Review: Incorrect + flagged answers

Week 5-6: Advanced Tests

  • Take: Advanced Practice Test 1
  • Target score: 70%+
  • Focus: Complex multi-service scenarios
  • Review: All answers (even correct ones)

Week 7-8: Full Practice Tests

  • Take: Full Practice Test 1 (mixed difficulty)
  • Target score: 75%+
  • Focus: Time management + endurance
  • Review: Weak domains

Week 9: Final Practice

  • Take: Full Practice Test 2 & 3
  • Target score: 80%+
  • Focus: Exam readiness
  • Review: Only missed questions

Review Strategy for Incorrect Answers

For each incorrect answer:

  1. Read the explanation (understand why you were wrong)
  2. Identify the gap (what concept did you miss?)
  3. Review the chapter (go back to study guide)
  4. Create a flashcard (for future review)
  5. Find similar questions (practice the same concept)

Track your mistakes:

  • Keep a "mistake log" (question ID, topic, why you got it wrong)
  • Identify patterns (always miss IAM policy questions?)
  • Focus study time on weak areas

Simulated Exam Conditions

2 weeks before exam: Take practice tests under real conditions

  • Time limit: 130 minutes (no pausing)
  • Environment: Quiet room, no distractions
  • No resources: No notes, no internet, no study guide
  • Breaks: None (build endurance)
  • Review: Only after completing all 65 questions

Why it matters: Builds exam stamina and time management skills

Mental Preparation Strategies

Managing Test Anxiety

Before the exam:

  • Visualize success: Imagine yourself calmly answering questions
  • Positive self-talk: "I've prepared well, I know this material"
  • Physical preparation: Exercise, eat well, sleep 8 hours

During the exam:

  • Deep breathing: 4 counts in, hold 4, 4 counts out (if stressed)
  • Positive reframing: "This is challenging" not "This is impossible"
  • Focus on process: One question at a time, don't think about score

If you panic:

  1. Close your eyes (10 seconds)
  2. Take 3 deep breaths
  3. Read the question again (slowly)
  4. Eliminate one wrong answer (builds momentum)
  5. Continue (you've got this)

Building Confidence

Confidence comes from:

  • Preparation: You've studied 60,000+ words of content
  • Practice: You've answered 500+ practice questions
  • Knowledge: You understand the concepts deeply
  • Experience: You've taken multiple practice tests

Confidence boosters:

  • Review your practice test scores (see your improvement)
  • Skim chapter summaries (remind yourself what you know)
  • Read success stories (others have done this, so can you)

Growth Mindset

Fixed mindset (avoid):

  • "I'm not good at cloud computing"
  • "I'll never understand IAM policies"
  • "This exam is too hard for me"

Growth mindset (embrace):

  • "I'm learning cloud computing"
  • "IAM policies are challenging, but I'm improving"
  • "This exam is difficult, but I'm preparing well"

Remember: Intelligence and skills are developed through effort, not fixed traits.

Final Week Strategy

Day 7 (One week before)

  • Morning: Full Practice Test 3 (130 minutes)
  • Afternoon: Review all incorrect answers (2 hours)
  • Evening: Review Domain 1 chapter summary (1 hour)

Day 6

  • Morning: Domain-focused tests (weak domains)
  • Afternoon: Review Domain 2 chapter summary
  • Evening: Create final flashcards for weak areas

Day 5

  • Morning: Service-focused tests (weak services)
  • Afternoon: Review Domain 3 chapter summary
  • Evening: Review cheat sheet

Day 4

  • Morning: Timed practice (30 questions in 60 minutes)
  • Afternoon: Review Domain 4 chapter summary
  • Evening: Review integration patterns

Day 3

  • Morning: Review all chapter summaries (3 hours)
  • Afternoon: Review cheat sheet (1 hour)
  • Evening: Light review, early sleep

Day 2

  • Morning: Review cheat sheet only (1 hour)
  • Afternoon: Review flashcards (1 hour)
  • Evening: Relax, no studying

Day 1 (Day before exam)

  • Morning: Light review of cheat sheet (30 minutes)
  • Afternoon: Prepare materials (ID, confirmation, directions)
  • Evening: Relax, watch a movie, early sleep (8 hours)

Exam Day

  • Morning: Light breakfast, review brain dump items (15 minutes)
  • Arrive: 30 minutes early
  • During: Follow time management strategy
  • After: Celebrate! You've earned it!

Chapter Summary

What We Covered

  • āœ… Effective study techniques (spaced repetition, Feynman, interleaved practice)
  • āœ… Domain-specific study strategies (security, resilience, performance, cost)
  • āœ… Exam question analysis framework (STAR method, keyword recognition)
  • āœ… Elimination strategy (3-step process to find best answer)
  • āœ… Practice test strategy (progressive difficulty, review process)
  • āœ… Mental preparation (managing anxiety, building confidence)
  • āœ… Final week strategy (day-by-day plan)

Critical Takeaways

  1. Active learning beats passive review: Test yourself, teach others, draw diagrams
  2. Spaced repetition maximizes retention: Review at increasing intervals
  3. Interleaved practice improves discrimination: Mix topics in study sessions
  4. STAR method for scenarios: Situation → Task → Action → Result
  5. Keyword recognition guides answer selection: Cost, performance, security, operational, availability
  6. Elimination strategy: Remove obviously wrong → partially correct → choose best
  7. Progressive practice tests: Beginner → Intermediate → Advanced → Full
  8. Mental preparation matters: Manage anxiety, build confidence, growth mindset

Self-Assessment Checklist

Test yourself before exam day:

  • I have a study schedule and I'm following it
  • I'm using active learning techniques (not just reading)
  • I'm scoring 75%+ on practice tests
  • I can recognize question patterns and keywords
  • I can eliminate wrong answers systematically
  • I've reviewed all incorrect answers and understand why
  • I've identified and strengthened my weak areas
  • I'm managing test anxiety effectively
  • I have a final week plan and I'm ready to execute it

Practice Questions

Try these from your practice test bundles:

  • Take a full practice test under timed conditions
  • Review using the strategies from this chapter
  • Track your improvement over time
  • Expected score: 80%+ before exam day

If you scored below 80%:

  • Review sections: Focus on weak domains
  • Apply study techniques: Spaced repetition, Feynman technique
  • Practice more: Take additional domain-focused tests
  • Strengthen weak areas: Review relevant chapters

Next Chapter: 08_final_checklist - Final Week Preparation Checklist


Chapter Summary

What We Covered

This chapter provided strategies for effective learning and exam success:

  • āœ… Study Techniques: Active recall, spaced repetition, and hands-on practice
  • āœ… Time Management: Creating a study schedule and managing exam time
  • āœ… Question Analysis: How to read and interpret exam questions
  • āœ… Elimination Strategies: Identifying and eliminating wrong answers
  • āœ… Common Traps: Recognizing and avoiding common exam pitfalls
  • āœ… Practice Approach: How to use practice tests effectively
  • āœ… Final Preparation: Last-week strategies and exam day tips

Critical Takeaways

  1. Active Learning: Don't just read - practice, build, and teach concepts
  2. Spaced Repetition: Review material multiple times over weeks, not cramming
  3. Hands-On Practice: Build real architectures in AWS (use Free Tier)
  4. Question Keywords: Look for constraint keywords (cost-effective, highly available, secure)
  5. Elimination: Remove obviously wrong answers first, then choose best remaining option
  6. Time Management: 2 minutes per question, flag difficult ones for review
  7. Practice Tests: Take multiple full-length tests, review all incorrect answers
  8. Confidence: Trust your preparation, don't second-guess yourself

Self-Assessment Checklist

Evaluate your exam readiness:

  • Have you completed all four domain chapters?
  • Have you taken at least 3 full-length practice tests?
  • Are you scoring 75%+ consistently on practice tests?
  • Can you complete 65 questions in 130 minutes comfortably?
  • Have you reviewed all incorrect answers and understood why?
  • Do you understand common question patterns and traps?
  • Have you practiced hands-on with AWS services?
  • Are you confident in your test-taking strategies?

If you answered "no" to any: Address those areas before scheduling your exam.

If you answered "yes" to all: You're ready to schedule your exam!


Next Steps: Proceed to 08_final_checklist for your final week preparation checklist.

Chapter Summary

Key Study Strategies Covered

Effective Learning Techniques:

  • āœ… Active learning through hands-on practice
  • āœ… Spaced repetition for long-term retention
  • āœ… The 3-pass method (understanding, application, reinforcement)
  • āœ… Teaching concepts to solidify understanding
  • āœ… Drawing diagrams to visualize architectures

Test-Taking Strategies:

  • āœ… Time management (2 minutes per question)
  • āœ… Question analysis method (STEM, constraints, elimination)
  • āœ… Keyword recognition (cost-effective, highly available, secure)
  • āœ… Elimination techniques for difficult questions
  • āœ… Handling scenario-based questions

Exam Preparation:

  • āœ… Practice test strategy (multiple full-length tests)
  • āœ… Mistake analysis and learning from errors
  • āœ… Final week preparation checklist
  • āœ… Exam day tips and mental preparation

Critical Takeaways

  1. Active Learning: Don't just read - practice, build, and teach concepts
  2. Spaced Repetition: Review material multiple times over weeks, not cramming
  3. Hands-On Practice: Build real architectures in AWS (use Free Tier)
  4. Question Keywords: Look for constraint keywords (cost-effective, highly available, secure)
  5. Elimination: Remove obviously wrong answers first, then choose best remaining option
  6. Time Management: 2 minutes per question, flag difficult ones for review
  7. Practice Tests: Take multiple full-length tests, review all incorrect answers
  8. Confidence: Trust your preparation, don't second-guess yourself

Self-Assessment Checklist

Evaluate your exam readiness:

  • Have you completed all four domain chapters?
  • Have you taken at least 3 full-length practice tests?
  • Are you scoring 75%+ consistently on practice tests?
  • Can you complete 65 questions in 130 minutes comfortably?
  • Have you reviewed all incorrect answers and understood why?
  • Do you understand common question patterns and traps?
  • Have you practiced hands-on with AWS services?
  • Are you confident in your test-taking strategies?

If you answered "no" to any: Address those areas before scheduling your exam.

If you answered "yes" to all: You're ready to schedule your exam!



Final Week Checklist

7 Days Before Exam

Knowledge Audit

Go through this comprehensive checklist to identify any remaining gaps:

Domain 1: Design Secure Architectures (30%)

  • IAM: I can explain policy evaluation logic (explicit deny > explicit allow > implicit deny)
  • IAM: I understand when to use users vs groups vs roles
  • IAM: I can design cross-account access with roles and external IDs
  • Security Groups: I know they are stateful and allow rules only
  • NACLs: I know they are stateless and support both allow and deny rules
  • VPC Security: I can design multi-tier VPC architectures with public/private subnets
  • KMS: I understand customer managed keys vs AWS managed keys
  • Encryption: I know which services support encryption at rest and in transit
  • WAF: I can explain when to use WAF vs Shield vs Security Groups
  • Compliance: I know which AWS services help with compliance frameworks

Domain 2: Design Resilient Architectures (26%)

  • Multi-AZ: I understand the difference between Multi-AZ and Read Replicas
  • Auto Scaling: I can configure dynamic, predictive, and scheduled scaling policies
  • Load Balancing: I know when to use ALB vs NLB vs GWLB
  • DR Strategies: I can explain backup/restore, pilot light, warm standby, active-active
  • RTO/RPO: I can calculate and select appropriate DR strategy based on requirements
  • SQS: I understand standard vs FIFO queues and when to use each
  • SNS: I can design fan-out patterns with SNS and SQS
  • EventBridge: I know when to use EventBridge vs SNS vs SQS
  • Lambda: I understand concurrency limits and how to handle throttling
  • ECS/EKS: I can explain when to use Fargate vs EC2 launch type

Domain 3: Design High-Performing Architectures (24%)

  • S3 Performance: I know how to optimize with multipart upload and transfer acceleration
  • EBS Volume Types: I can select appropriate volume type (gp3, io2, st1, sc1)
  • EC2 Instance Types: I understand T, M, C, R, I, P, G families and their use cases
  • Lambda Performance: I know that memory allocation affects CPU and network
  • RDS Performance: I can design read-heavy architectures with read replicas
  • ElastiCache: I understand Redis vs Memcached and when to use each
  • CloudFront: I know how to optimize caching with TTL and cache behaviors
  • Global Accelerator: I understand when to use it vs CloudFront
  • Kinesis: I can design streaming data pipelines with Kinesis Data Streams
  • Athena: I know how to optimize queries with partitioning and columnar formats

Domain 4: Design Cost-Optimized Architectures (20%)

  • EC2 Pricing: I can explain On-Demand, Reserved, Spot, and Savings Plans
  • S3 Storage Classes: I know the cost and retrieval characteristics of each class
  • S3 Lifecycle: I can design lifecycle policies to transition between storage classes
  • RDS Pricing: I understand when to use Reserved Instances vs On-Demand
  • DynamoDB Pricing: I know the difference between On-Demand and Provisioned capacity
  • Data Transfer: I understand inter-AZ, inter-region, and internet egress costs
  • NAT Gateway: I know the cost implications vs NAT instance
  • VPC Endpoints: I understand how they reduce data transfer costs
  • Cost Tools: I can use Cost Explorer, Budgets, and Cost Allocation Tags
  • Trusted Advisor: I know what cost optimization checks it provides

If you checked fewer than 80%: Review those specific chapters and take domain-focused practice tests

Practice Test Marathon

šŸ“Š Final Week Practice Schedule:

gantt
    title Final Week Practice Test Schedule
    dateFormat YYYY-MM-DD
    section Practice Tests
    Full Practice Test 3       :2025-02-01, 1d
    Review & Study Weak Areas  :2025-02-02, 1d
    Domain-Focused Tests       :2025-02-03, 1d
    Service-Focused Tests      :2025-02-04, 1d
    Timed Practice (30Q)       :2025-02-05, 1d
    Review Summaries           :2025-02-06, 1d
    Light Review Only          :2025-02-07, 1d
    section Exam Day
    Exam Day                   :milestone, 2025-02-08, 0d

See: diagrams/08_final_checklist_practice_schedule.mmd

Day 7 (One week before):

  • Morning: Full Practice Test 3 (130 minutes, timed, no breaks)
  • Target score: 80%+ (if below, extend study by 1 week)
  • Afternoon: Review ALL incorrect answers (2-3 hours)
  • Evening: Review Domain 1 chapter summary (1 hour)
  • Track: Note weak areas for focused study

Day 6:

  • Morning: Take domain-focused tests for weak domains (2 hours)
  • Afternoon: Review Domain 2 chapter summary (1 hour)
  • Evening: Create final flashcards for weak areas (1 hour)
  • Focus: Strengthen identified weak areas

Day 5:

  • Morning: Take service-focused tests for weak services (2 hours)
  • Afternoon: Review Domain 3 chapter summary (1 hour)
  • Evening: Review cheat sheet (1 hour)
  • Focus: Service-specific knowledge gaps

Day 4:

  • Morning: Timed practice - 30 questions in 60 minutes (test time management)
  • Afternoon: Review Domain 4 chapter summary (1 hour)
  • Evening: Review integration patterns from Chapter 6 (1 hour)
  • Focus: Cross-domain scenarios

Day 3:

  • Morning: Review ALL chapter summaries (3 hours)
  • Afternoon: Review cheat sheet thoroughly (1 hour)
  • Evening: Light review of flashcards, early sleep (8 hours)
  • Focus: Consolidation and rest

Day 2:

  • Morning: Review cheat sheet only (1 hour)
  • Afternoon: Review flashcards for weak areas (1 hour)
  • Evening: Relax, no studying after 6 PM
  • Focus: Mental preparation and rest

Day 1 (Day before exam):

  • Morning: Light review of cheat sheet (30 minutes MAX)
  • Afternoon: Prepare exam day materials (see checklist below)
  • Evening: Relax, watch a movie, early sleep (8 hours minimum)
  • Focus: Rest and mental preparation

Day Before Exam

Final Review (2-3 hours max)

Morning Review Session (1 hour):

  • Skim cheat sheet (focus on ⭐ Must Know items)
  • Review brain dump items (see list below)
  • Quick review of service comparison tables

Afternoon Review (1 hour):

  • Skim chapter summaries (don't deep dive)
  • Review flagged flashcards
  • Quick review of common question patterns

Evening (30 minutes):

  • Review brain dump items one more time
  • Visualize exam success
  • Prepare materials for tomorrow

Don't:

  • āŒ Try to learn new topics
  • āŒ Take practice tests
  • āŒ Study past 6 PM
  • āŒ Stay up late cramming

Brain Dump Items to Memorize

Critical Numbers (write these down immediately when exam starts):

S3 Storage Class Pricing (per GB/month):

  • Deep Archive: $0.00099
  • Glacier Flexible: $0.0036
  • Glacier Instant: $0.004
  • One Zone-IA: $0.01
  • Standard-IA: $0.0125
  • Standard: $0.023

EC2 Pricing Discounts:

  • Spot Instances: Up to 90% off On-Demand
  • Savings Plans: Up to 72% off On-Demand
  • Reserved Instances: Up to 60% off On-Demand

RDS Multi-AZ:

  • Failover time: 60-120 seconds
  • Synchronous replication (zero data loss)
  • Automatic failover on primary failure

S3 Performance:

  • 5,500 GET/HEAD requests per second per prefix
  • 3,500 PUT/COPY/POST/DELETE requests per second per prefix
  • No limit on number of prefixes

Lambda Limits:

  • Memory: 128 MB to 10,240 MB
  • Timeout: 15 minutes maximum
  • Concurrent executions: 1,000 (default, can be increased)
  • Deployment package: 50 MB (zipped), 250 MB (unzipped)

EBS Volume Types:

  • gp3: 3,000-16,000 IOPS, 125-1,000 MB/s throughput
  • io2: Up to 64,000 IOPS, 1,000 MB/s throughput
  • st1: 500 IOPS, 500 MB/s throughput (HDD)
  • sc1: 250 IOPS, 250 MB/s throughput (HDD)

DR Strategy RTO/RPO:

  • Backup/Restore: Hours (RTO), Hours (RPO)
  • Pilot Light: 10s of minutes (RTO), Minutes (RPO)
  • Warm Standby: Minutes (RTO), Seconds (RPO)
  • Active-Active: Real-time (RTO), None (RPO)

Mental Preparation

Positive Affirmations (repeat these):

  • "I have prepared thoroughly and I am ready"
  • "I understand AWS services and can apply them to scenarios"
  • "I will read each question carefully and choose the best answer"
  • "I trust my preparation and my instincts"

Visualization Exercise (5 minutes):

  1. Close your eyes
  2. Imagine yourself at the testing center, calm and confident
  3. See yourself reading questions carefully
  4. Visualize yourself selecting correct answers
  5. Imagine the "Pass" result on your screen
  6. Feel the pride and accomplishment

Stress Management:

  • Practice deep breathing (4 counts in, hold 4, 4 counts out)
  • Do light exercise (walk, yoga, stretching)
  • Avoid caffeine after 2 PM
  • Avoid heavy meals before bed
  • Set multiple alarms for exam day

Exam Day Materials Checklist

Required Documents:

  • Government-issued photo ID (driver's license, passport)
  • Exam confirmation email (printed or on phone)
  • Testing center address and directions
  • Contact number for testing center

Optional Items:

  • Water bottle (if allowed by testing center)
  • Light snack (for after exam)
  • Jacket or sweater (testing rooms can be cold)
  • Earplugs (if allowed and you prefer quiet)

Not Allowed (leave at home or in car):

  • āŒ Study materials, notes, books
  • āŒ Electronic devices (phone, smartwatch, fitness tracker)
  • āŒ Bags, backpacks, purses
  • āŒ Food or drinks (except water, if allowed)

Sleep and Nutrition

Night Before:

  • Eat a light, healthy dinner (avoid heavy or spicy foods)
  • No caffeine after 2 PM
  • No alcohol
  • Go to bed early (aim for 8 hours of sleep)
  • Set multiple alarms (primary + backup)

Morning Of:

  • Wake up 2-3 hours before exam (don't rush)
  • Eat a balanced breakfast (protein + complex carbs)
  • Drink water (stay hydrated, but not too much)
  • Avoid excessive caffeine (one cup of coffee/tea is fine)
  • Arrive at testing center 30 minutes early

Exam Day

Morning Routine

2-3 hours before exam:

  • Wake up naturally (no snooze button)
  • Light review of brain dump items (15 minutes)
  • Eat a good breakfast (eggs, oatmeal, fruit)
  • Shower and dress comfortably
  • Double-check materials (ID, confirmation)

1 hour before exam:

  • Arrive at testing center (30 minutes early)
  • Use restroom
  • Do breathing exercises (calm nerves)
  • Review brain dump items one last time (5 minutes)
  • Positive self-talk ("I'm ready, I've got this")

At the Testing Center

Check-in Process:

  • Present ID and confirmation
  • Store personal items in locker
  • Review testing center rules
  • Get scratch paper and pen/pencil
  • Take a deep breath before starting

Before Starting Exam:

  • Read all instructions carefully
  • Adjust chair and monitor for comfort
  • Do a quick breathing exercise (calm nerves)
  • Start with confidence

Brain Dump Strategy

First 2-3 minutes of exam (before reading any questions):

  • Write down S3 storage class prices
  • Write down EC2 pricing discounts
  • Write down RDS Multi-AZ failover time
  • Write down Lambda limits
  • Write down EBS volume type characteristics
  • Write down DR strategy RTO/RPO
  • Write down any other numbers you tend to forget

Why this works: Frees up mental space and reduces anxiety about forgetting important numbers

During Exam

Time Management Strategy:

šŸ“Š Exam Time Allocation:

pie title 130 Minutes Exam Time Allocation
    "First Pass: Easy Questions" : 60
    "Second Pass: Flagged Questions" : 40
    "Final Pass: Review" : 30

See: diagrams/08_final_checklist_time_allocation.mmd

First Pass (60 minutes):

  • Answer all easy questions (ones you're confident about)
  • Flag difficult questions (don't spend more than 2 minutes)
  • Mark questions you want to review
  • Goal: Answer 45-50 questions

Second Pass (40 minutes):

  • Return to flagged questions
  • Use elimination strategy (remove obviously wrong answers)
  • Make educated guesses (no penalty for wrong answers)
  • Goal: Answer all remaining questions

Final Pass (30 minutes):

  • Review marked questions
  • Check for silly mistakes (misread question, wrong answer selected)
  • Verify you answered all questions
  • Don't second-guess yourself (first instinct usually correct)

Question Reading Strategy:

  • Read the scenario carefully (identify company, problem, requirements)
  • Identify constraint keywords ("most cost-effective", "lowest latency", "most secure")
  • Read all answer options before selecting
  • Eliminate obviously wrong answers first
  • Choose the BEST answer (not just A correct answer)

If You Get Stuck:

  1. Take a deep breath (5 seconds)
  2. Re-read the question (look for keywords you missed)
  3. Eliminate one wrong answer (builds momentum)
  4. Make an educated guess (no penalty for guessing)
  5. Flag for review (come back if time permits)
  6. Move on (don't waste time)

Common Traps to Avoid:

  • āŒ Misreading "NOT", "EXCEPT", "LEAST" in questions
  • āŒ Choosing technically correct but not BEST answer
  • āŒ Overthinking simple questions
  • āŒ Changing answers without good reason (first instinct often correct)
  • āŒ Spending too much time on one question

After Exam

Immediately After:

  • Take a deep breath (you did it!)
  • Don't discuss answers with others (causes unnecessary stress)
  • Celebrate your effort (regardless of how you feel about performance)

Waiting for Results:

  • Results typically available within 5 business days
  • Check your email for notification
  • Access results through AWS Certification portal
  • Passing score: 720/1000 (72%)

If You Pass:

  • Celebrate! You're now AWS Certified Solutions Architect - Associate!
  • Update your resume and LinkedIn profile
  • Download your digital badge
  • Consider next certification (Professional level or Specialty)

If You Don't Pass:

  • Don't be discouraged (many people need multiple attempts)
  • Review your score report (identifies weak domains)
  • Focus study on weak areas
  • Take more practice tests
  • Schedule retake (30-day waiting period)
  • You've learned a lot and you'll pass next time!

You're Ready When...

Knowledge Indicators:

  • You score 80%+ on all full practice tests
  • You can explain key concepts without notes
  • You recognize question patterns instantly
  • You make decisions quickly using frameworks
  • You've completed all self-assessment checklists
  • You can draw architecture diagrams from memory
  • You understand WHY answers are correct, not just WHAT they are

Confidence Indicators:

  • You feel calm and prepared (not anxious)
  • You trust your preparation
  • You can manage test anxiety
  • You have a clear exam day plan
  • You've visualized success

Practical Indicators:

  • You've taken at least 3 full practice tests
  • You've reviewed all incorrect answers
  • You've strengthened weak areas
  • You've memorized brain dump items
  • You know the testing center location and rules

Remember

Trust Your Preparation:

  • You've studied 60,000+ words of comprehensive content
  • You've answered 500+ practice questions
  • You've reviewed 120+ diagrams
  • You've completed all self-assessments
  • You're ready!

Manage Your Time:

  • 2 minutes per question average
  • Don't spend more than 3 minutes on any question initially
  • Flag and move on if stuck
  • Save time for review

Read Carefully:

  • Watch for "NOT", "EXCEPT", "LEAST"
  • Identify constraint keywords
  • Read all answer options
  • Choose the BEST answer

Don't Overthink:

  • First instinct often correct
  • Don't change answers without good reason
  • Simple questions have simple answers
  • Trust your knowledge

Stay Calm:

  • Take deep breaths if stressed
  • Use positive self-talk
  • Focus on one question at a time
  • You've got this!

Final Thoughts

You've put in the work. You've studied hard. You've practiced extensively. You understand AWS services and how to apply them to real-world scenarios. You're ready for this exam.

Remember: This certification is a milestone, not the destination. Whether you pass on your first attempt or need to retake, you've learned valuable skills that will serve you throughout your career.

Believe in yourself. Trust your preparation. You've got this! šŸŽÆ

Good luck on your AWS Certified Solutions Architect - Associate exam!


Previous Chapter: 07_study_strategies - Study Techniques & Test-Taking Strategies

Appendices: 99_appendices - Quick Reference Tables, Glossary, Resources


Final Confidence Check

Are You Ready?

Answer honestly:

  • I consistently score 75%+ on full-length practice tests
  • I can complete 65 questions in 130 minutes with time to review
  • I understand all four exam domains thoroughly
  • I can explain AWS services and when to use them
  • I recognize common question patterns and traps
  • I've reviewed all my incorrect practice test answers
  • I'm confident in my test-taking strategies
  • I've had adequate rest and am mentally prepared

If you checked all boxes: You're ready! Trust your preparation and go ace that exam!

If you're missing any: Take an extra week to address those areas. It's better to be over-prepared than under-prepared.


Final Words of Encouragement

You've put in the work. You've studied the material. You've practiced the questions. You understand the concepts.

Trust yourself. You're ready for this.

Remember:

  • Read each question carefully
  • Eliminate wrong answers systematically
  • Choose the BEST answer, not just a correct answer
  • Manage your time wisely
  • Don't overthink - your first instinct is usually right
  • Stay calm and confident

Good luck on your AWS Certified Solutions Architect - Associate exam!

You've got this! šŸš€


After the exam: Whether you pass or not, be proud of the effort you put in. If you pass, celebrate! If not, review your score report, identify weak areas, and try again. Many successful architects didn't pass on their first attempt.


Exam Day Checklist

Morning of the Exam

3-4 Hours Before Exam:

  • Wake up at your normal time (don't disrupt sleep schedule)
  • Eat a healthy breakfast with protein and complex carbs
  • Avoid excessive caffeine (no more than your normal amount)
  • Do a light 15-minute review of your cheat sheet
  • Review your brain dump list one final time

2 Hours Before Exam:

  • Gather required items:
    • Two forms of ID (government-issued photo ID + secondary ID)
    • Confirmation email with exam appointment details
    • Water bottle (if allowed at test center)
    • Snack for after the exam
  • Dress comfortably (layers for temperature control)
  • Use the restroom before leaving

1 Hour Before Exam:

  • Arrive at test center 30 minutes early
  • Turn off phone and store in locker
  • Complete check-in process
  • Review test center rules and procedures
  • Take a few deep breaths to calm nerves

At the Test Station:

  • Adjust chair and monitor for comfort
  • Test headphones/earplugs if provided
  • Verify scratch paper and pen/pencil
  • Read all on-screen instructions carefully
  • Start the exam when ready

During the Exam

First 5 Minutes (Brain Dump):

  • Write down all memorized facts on scratch paper:
    • Port numbers (22, 80, 443, 3389, etc.)
    • Service limits (Lambda 15 min, S3 5 TB object, etc.)
    • Pricing comparisons (RI vs Spot vs On-Demand)
    • DR strategies (RTO/RPO for each)
    • Storage classes and costs
    • Any formulas or calculations

Time Management Strategy:

  • First Pass (60 minutes): Answer all questions you're confident about

    • Skip difficult questions (mark for review)
    • Aim to answer 40-45 questions in first pass
    • Build confidence with easy wins
  • Second Pass (40 minutes): Tackle marked questions

    • Use elimination method
    • Apply decision frameworks
    • Make educated guesses
    • Don't leave any blank
  • Final Pass (20 minutes): Review all answers

    • Check for misread questions
    • Verify you answered what was asked
    • Look for careless mistakes
    • Trust your first instinct (don't overthink)

Question-Answering Strategy:

  • Read the scenario carefully (identify key details)
  • Identify the question type:
    • "Most cost-effective" → Choose cheapest option
    • "Least operational overhead" → Choose managed service
    • "Best practice" → Choose AWS recommended approach
    • "Highest performance" → Choose fastest/most powerful option
  • Eliminate obviously wrong answers first
  • Choose the BEST answer (not just a correct answer)
  • Watch for qualifier words: "MOST", "LEAST", "BEST", "FIRST"

Common Traps to Avoid:

  • Don't overthink simple questions
  • Don't assume information not given in the scenario
  • Don't choose answers with absolute words ("always", "never")
  • Don't pick the longest answer just because it's detailed
  • Don't change answers unless you're certain (first instinct usually right)

Mental Strategies

If You Feel Overwhelmed:

  1. Take 3 deep breaths (in through nose, out through mouth)
  2. Close your eyes for 10 seconds
  3. Remind yourself: "I've prepared for this. I know this material."
  4. Skip the current question and come back to it
  5. Answer a few easy questions to rebuild confidence

If You're Running Out of Time:

  1. Don't panic - you have time
  2. Focus on answering remaining questions (don't leave blank)
  3. Use elimination method quickly
  4. Make educated guesses based on patterns
  5. Trust your preparation

If You Don't Know an Answer:

  1. Eliminate obviously wrong answers
  2. Look for AWS best practices in remaining options
  3. Choose the most managed/automated solution
  4. Choose the most secure option if security-related
  5. Choose the most cost-effective if cost-related
  6. Make a guess and move on (don't dwell)

After the Exam

Immediately After:

  • Take a deep breath - you did it!
  • Don't discuss questions with others (NDA violation)
  • Collect your belongings from locker
  • Review your preliminary pass/fail result (if shown)

Within 5 Business Days:

  • Check your email for official score report
  • Review your performance by domain
  • If you passed: Celebrate! Share your achievement!
  • If you didn't pass: Review weak areas, schedule retake

If You Passed:

  • Download your digital badge from AWS Certification portal
  • Add certification to LinkedIn profile
  • Update your resume
  • Request physical certificate (optional)
  • Consider next certification (SAP-C02, DVA-C02, SOA-C02)

If You Didn't Pass:

  • Don't be discouraged - many successful architects failed first attempt
  • Review your score report to identify weak domains
  • Focus study on domains where you scored lowest
  • Retake practice tests for those specific domains
  • Schedule retake after 14-day waiting period
  • You've got this - try again!

Final Confidence Boosters

You're Ready If...

  • You've completed all chapters in this study guide
  • You score 75%+ on practice tests consistently
  • You can explain concepts without looking at notes
  • You recognize question patterns instantly
  • You make decisions quickly using frameworks
  • You've reviewed all domain summaries
  • You've practiced with all bundle types

Remember These Truths

  1. You've put in the work - Trust your preparation
  2. The exam is fair - It tests what you've studied
  3. You don't need 100% - 720/1000 is passing (72%)
  4. Educated guesses are okay - No penalty for wrong answers
  5. First instinct is usually right - Don't overthink
  6. You belong here - You've earned this opportunity

Final Mantras

  • "I am prepared and confident"
  • "I know this material"
  • "I will read each question carefully"
  • "I will choose the BEST answer"
  • "I trust my preparation"
  • "I've got this!"

Post-Exam Reflection

Regardless of Result

What You've Accomplished:

  • āœ… Studied 60,000+ words of comprehensive material
  • āœ… Learned 100+ AWS services and their use cases
  • āœ… Practiced 500+ exam-style questions
  • āœ… Mastered 4 major domains of cloud architecture
  • āœ… Developed critical thinking for cloud solutions
  • āœ… Invested weeks/months in professional development

This Knowledge is Valuable:

  • You now understand cloud architecture principles
  • You can design secure, resilient, high-performing, cost-optimized solutions
  • You've gained skills that are in high demand
  • You've proven your commitment to learning
  • You're better prepared for real-world AWS projects

Next Steps:

  • Apply this knowledge in your work
  • Build projects to reinforce learning
  • Share knowledge with others
  • Continue learning (cloud is always evolving)
  • Pursue additional certifications if desired

Closing Words

You've reached the end of this comprehensive study guide. Whether you're reading this the night before your exam or weeks in advance, know that you've invested significant time and effort into your professional development.

The exam is just one milestone in your cloud journey. The real value is in the knowledge you've gained and the skills you've developed. These will serve you throughout your career.

Trust yourself. You've prepared thoroughly. You understand the concepts. You can do this.

Good luck on your AWS Certified Solutions Architect - Associate exam!

You've got this! šŸš€


One Final Reminder:

  • Read each question carefully
  • Eliminate wrong answers systematically
  • Choose the BEST answer, not just a correct answer
  • Manage your time wisely
  • Stay calm and confident

Now go ace that exam!



Appendices

Appendix A: Quick Reference Tables

S3 Storage Classes Comparison

Storage Class Cost/GB-month Retrieval Time Retrieval Cost Min Duration Use Case
Standard $0.023 Milliseconds None None Frequent access
Intelligent-Tiering $0.023 + $0.0025/1K objects Milliseconds None None Unknown pattern
Standard-IA $0.0125 Milliseconds $0.01/GB 30 days Infrequent access
One Zone-IA $0.01 Milliseconds $0.01/GB 30 days Reproducible data
Glacier Instant $0.004 Milliseconds $0.03/GB 90 days Archive, instant
Glacier Flexible $0.0036 Minutes-hours $0.01-0.03/GB 90 days Archive, flexible
Glacier Deep Archive $0.00099 12-48 hours $0.02/GB 180 days Long-term archive

EC2 Instance Families

Family Type vCPU:Memory Ratio Use Case Example
T3 Burstable 1:2 Variable workloads Web servers, dev/test
M5 General Purpose 1:4 Balanced App servers, databases
C5 Compute Optimized 1:2 High CPU Batch, gaming, encoding
R5 Memory Optimized 1:8 High memory In-memory DBs, big data
I3 Storage Optimized 1:8 + NVMe High I/O NoSQL, data warehousing
P3 GPU GPU ML training Deep learning, HPC
G4 GPU GPU Graphics ML inference, rendering

RDS vs DynamoDB

Feature RDS DynamoDB
Type Relational (SQL) NoSQL (key-value)
Scaling Vertical (instance size) Horizontal (automatic)
Latency 5-10ms 1-5ms
Throughput Limited by instance Unlimited (on-demand)
Transactions ACID Eventually consistent (default)
Queries Complex SQL Simple key-based
Cost Instance hours Request-based
Use Case Complex queries, joins High-scale, simple queries

Load Balancer Types

Feature ALB NLB GWLB
Layer 7 (HTTP/HTTPS) 4 (TCP/UDP) 3 (IP)
Performance Moderate Ultra-high High
Routing Content-based Connection-based Transparent
Static IP No Yes Yes
WebSocket Yes Yes No
Use Case Web apps, microservices TCP/UDP, extreme performance Firewalls, IDS/IPS

Appendix B: Key Service Limits

S3 Limits

  • Buckets per account: 100 (soft limit)
  • Object size: 5 TB maximum
  • Single PUT: 5 GB maximum
  • Multipart upload: 5 TB maximum
  • Request rate: 5,500 GET/sec, 3,500 PUT/sec per prefix

EC2 Limits

  • On-Demand instances: 20 per region (soft limit)
  • Reserved Instances: No limit
  • Spot Instances: Dynamic (based on capacity)
  • EBS volumes: 5,000 per region
  • Elastic IPs: 5 per region (soft limit)

VPC Limits

  • VPCs per region: 5 (soft limit)
  • Subnets per VPC: 200
  • Security Groups per VPC: 2,500
  • Rules per Security Group: 60 inbound, 60 outbound
  • NACLs per VPC: 200
  • Rules per NACL: 20 (soft limit)

RDS Limits

  • DB instances: 40 per region
  • Read replicas: 15 per primary
  • Automated backups: 35 days retention
  • Manual snapshots: No limit
  • Storage: 64 TB maximum (most engines)

Lambda Limits

  • Concurrent executions: 1,000 per region (soft limit)
  • Function timeout: 15 minutes maximum
  • Memory: 128 MB - 10,240 MB
  • Deployment package: 50 MB (zipped), 250 MB (unzipped)
  • /tmp storage: 512 MB - 10,240 MB

Appendix C: Pricing Quick Reference

Compute Pricing (us-east-1)

  • t3.medium: $0.0416/hour
  • m5.xlarge: $0.192/hour
  • c5.xlarge: $0.17/hour
  • r5.xlarge: $0.252/hour
  • Lambda: $0.20 per 1M requests + $0.0000166667 per GB-second

Storage Pricing

  • S3 Standard: $0.023/GB-month
  • EBS gp3: $0.08/GB-month
  • EFS Standard: $0.30/GB-month
  • Glacier Deep Archive: $0.00099/GB-month

Database Pricing

  • RDS db.m5.large: $0.192/hour
  • DynamoDB On-Demand: $1.25 per 1M writes, $0.25 per 1M reads
  • ElastiCache cache.m5.large: $0.161/hour

Network Pricing

  • Data Transfer Out (first 10 TB): $0.09/GB
  • CloudFront (first 10 TB): $0.085/GB
  • NAT Gateway: $0.045/hour + $0.045/GB processed

Appendix D: Disaster Recovery Strategies Comparison

Strategy RPO RTO Cost Complexity Use Case
Backup and Restore Hours Hours $ Low Non-critical, cost-sensitive
Pilot Light Minutes 10s of minutes $$ Medium Core services only
Warm Standby Seconds Minutes $$$ Medium-High Business-critical
Active-Active Near-zero Seconds $$$$ High Mission-critical

Implementation Details:

  • Backup/Restore: Regular snapshots to S3, restore when needed
  • Pilot Light: Core services running at minimum, scale up during disaster
  • Warm Standby: Scaled-down replica running, scale up during disaster
  • Active-Active: Full production in multiple regions, Route 53 for failover

Appendix E: Security Best Practices Checklist

IAM Security

  • Enable MFA for root account
  • Delete root account access keys
  • Create individual IAM users (no shared accounts)
  • Use groups to assign permissions
  • Apply least privilege principle
  • Enable CloudTrail for audit logging
  • Rotate credentials regularly (90 days)
  • Use IAM roles for applications
  • Enable IAM Access Analyzer
  • Set password policy (length, complexity, rotation)

Network Security

  • Use security groups as primary firewall
  • Implement NACLs for subnet-level protection
  • Enable VPC Flow Logs
  • Use private subnets for databases
  • Implement bastion hosts or Systems Manager Session Manager
  • Enable AWS Shield Standard (automatic)
  • Configure AWS WAF for web applications
  • Use VPC endpoints to avoid internet traffic
  • Implement network segmentation
  • Enable GuardDuty for threat detection

Data Protection

  • Enable encryption at rest for all storage
  • Use KMS customer-managed keys for sensitive data
  • Enable encryption in transit (TLS/SSL)
  • Implement S3 bucket policies to enforce encryption
  • Enable S3 versioning for critical data
  • Configure S3 Object Lock for compliance
  • Enable automated backups
  • Test backup restoration regularly
  • Implement cross-region replication for critical data
  • Use Secrets Manager for credential management

Appendix F: Well-Architected Framework Pillars

1. Operational Excellence

Design Principles:

  • Perform operations as code
  • Make frequent, small, reversible changes
  • Refine operations procedures frequently
  • Anticipate failure
  • Learn from operational failures

Key Services:

  • CloudFormation (IaC)
  • CodePipeline (CI/CD)
  • CloudWatch (monitoring)
  • X-Ray (tracing)

2. Security

Design Principles:

  • Implement strong identity foundation
  • Enable traceability
  • Apply security at all layers
  • Automate security best practices
  • Protect data in transit and at rest
  • Keep people away from data
  • Prepare for security events

Key Services:

  • IAM (identity)
  • KMS (encryption)
  • GuardDuty (threat detection)
  • Security Hub (centralized security)

3. Reliability

Design Principles:

  • Automatically recover from failure
  • Test recovery procedures
  • Scale horizontally
  • Stop guessing capacity
  • Manage change through automation

Key Services:

  • Auto Scaling (elasticity)
  • Route 53 (DNS failover)
  • RDS Multi-AZ (high availability)
  • S3 (durability)

4. Performance Efficiency

Design Principles:

  • Democratize advanced technologies
  • Go global in minutes
  • Use serverless architectures
  • Experiment more often
  • Consider mechanical sympathy

Key Services:

  • Lambda (serverless)
  • CloudFront (global CDN)
  • ElastiCache (caching)
  • RDS read replicas (scaling)

5. Cost Optimization

Design Principles:

  • Implement cloud financial management
  • Adopt consumption model
  • Measure overall efficiency
  • Stop spending on undifferentiated heavy lifting
  • Analyze and attribute expenditure

Key Services:

  • Cost Explorer (analysis)
  • Budgets (alerts)
  • Trusted Advisor (recommendations)
  • Compute Optimizer (right-sizing)

6. Sustainability

Design Principles:

  • Understand your impact
  • Establish sustainability goals
  • Maximize utilization
  • Anticipate and adopt new, more efficient hardware and software
  • Use managed services
  • Reduce downstream impact

Key Services:

  • Auto Scaling (efficient utilization)
  • Lambda (serverless efficiency)
  • S3 Intelligent-Tiering (storage optimization)

Appendix G: Common Exam Keywords and Their Meanings

Performance Keywords

  • "Lowest latency" → Use caching (ElastiCache, DAX, CloudFront)
  • "Highest throughput" → Use parallel processing, multiple instances
  • "Real-time" → Use Kinesis Data Streams, Lambda, DynamoDB
  • "Near real-time" → Use Kinesis Data Firehose (60 sec buffer)
  • "Batch processing" → Use EMR, Glue, Batch

Cost Keywords

  • "Most cost-effective" → Consider Spot Instances, S3 lifecycle, right-sizing
  • "Minimize costs" → Use Reserved Instances, Savings Plans, serverless
  • "Pay only for what you use" → Lambda, DynamoDB On-Demand, Fargate
  • "Predictable costs" → Reserved Instances, Savings Plans

Security Keywords

  • "Least privilege" → Minimal IAM permissions needed
  • "Encryption at rest" → KMS, S3 SSE, EBS encryption
  • "Encryption in transit" → TLS/SSL, ACM certificates
  • "Audit trail" → CloudTrail, Config, VPC Flow Logs
  • "Compliance" → Config Rules, Audit Manager, Artifact

Availability Keywords

  • "High availability" → Multi-AZ deployment
  • "Fault tolerant" → Automatic failover, no single point of failure
  • "Disaster recovery" → Multi-region, backup strategy
  • "Zero downtime" → Blue/green deployment, rolling updates
  • "Automatic failover" → RDS Multi-AZ, Route 53 health checks

Scalability Keywords

  • "Elastic" → Auto Scaling, Lambda, DynamoDB
  • "Horizontal scaling" → Add more instances
  • "Vertical scaling" → Increase instance size
  • "Unlimited scale" → S3, DynamoDB On-Demand, Lambda
  • "Burst capacity" → T3 instances, gp3 volumes

Appendix H: Service Selection Decision Trees

Storage Selection

Need storage?
ā”œā”€ Object storage (files, backups, static content)
│  └─ S3 (with appropriate storage class)
ā”œā”€ Block storage (databases, boot volumes)
│  └─ EBS (with appropriate volume type)
ā”œā”€ File storage (shared access)
│  ā”œā”€ Linux → EFS
│  └─ Windows → FSx for Windows File Server
└─ High-performance computing
   └─ FSx for Lustre

Compute Selection

Need compute?
ā”œā”€ Full control, custom OS
│  └─ EC2
ā”œā”€ Event-driven, <15 min execution
│  └─ Lambda
ā”œā”€ Containers
│  ā”œā”€ Serverless → Fargate
│  ā”œā”€ AWS-native → ECS
│  └─ Kubernetes → EKS
└─ Platform as a Service
   └─ Elastic Beanstalk

Database Selection

Need database?
ā”œā”€ Relational (SQL)
│  ā”œā”€ High performance, global → Aurora
│  ā”œā”€ Specific engine (MySQL, PostgreSQL, etc.) → RDS
│  └─ Data warehouse → Redshift
ā”œā”€ NoSQL
│  ā”œā”€ Key-value, document → DynamoDB
│  ā”œā”€ In-memory cache → ElastiCache
│  ā”œā”€ Graph → Neptune
│  └─ Time-series → Timestream
└─ Ledger (immutable)
   └─ QLDB

Appendix I: Troubleshooting Common Scenarios

EC2 Instance Won't Start

  1. Check service limits (On-Demand instance limit)
  2. Verify AMI availability in region
  3. Check subnet has available IP addresses
  4. Verify security group allows necessary traffic
  5. Check IAM instance profile permissions

Can't Connect to EC2 Instance

  1. Verify security group allows inbound traffic (SSH port 22 or RDP port 3389)
  2. Check NACL allows bidirectional traffic
  3. Verify instance has public IP (if accessing from internet)
  4. Check route table has route to internet gateway
  5. Verify key pair is correct

S3 Access Denied

  1. Check bucket policy allows access
  2. Verify IAM user/role has necessary permissions
  3. Check bucket is not in different region
  4. Verify bucket encryption settings
  5. Check for explicit DENY in policies

RDS Connection Issues

  1. Verify security group allows inbound traffic on database port
  2. Check RDS is in correct VPC/subnet
  3. Verify endpoint is correct
  4. Check database is in "available" state
  5. Verify credentials are correct

Lambda Function Timeout

  1. Increase timeout setting (max 15 minutes)
  2. Optimize function code
  3. Check for network latency (VPC configuration)
  4. Verify external dependencies are responsive
  5. Consider breaking into smaller functions

Appendix J: Exam Day Checklist

One Week Before

  • Complete all practice tests (target 80%+ score)
  • Review all chapter summaries
  • Focus on weak areas identified in practice tests
  • Review all diagrams and decision trees
  • Memorize key service limits and metrics

One Day Before

  • Light review of cheat sheet (2-3 hours max)
  • Skim chapter quick reference cards
  • Review common exam traps
  • Get 8 hours of sleep
  • Prepare exam day materials (ID, confirmation)

Exam Day Morning

  • Light breakfast
  • 30-minute review of critical topics
  • Arrive 30 minutes early
  • Use restroom before exam
  • Take deep breaths, stay calm

During Exam

  • Read each question carefully
  • Identify keywords and constraints
  • Eliminate obviously wrong answers
  • Flag difficult questions for review
  • Manage time (2 minutes per question)
  • Review flagged questions if time permits

Appendix K: Additional Resources

Official AWS Resources

Practice and Labs

Community Resources

Appendix L: Comprehensive Glossary

ACL (Access Control List): List of permissions attached to an object

AMI (Amazon Machine Image): Template for EC2 instance (OS, applications, configuration)

API Gateway: Managed service for creating, publishing, and managing APIs

Auto Scaling: Automatically adjusts compute capacity based on demand

Availability Zone (AZ): Isolated data center within a Region with redundant power, networking

Bastion Host: EC2 instance in public subnet used to access instances in private subnet

CIDR (Classless Inter-Domain Routing): IP address range notation (e.g., 10.0.0.0/16)

CloudFormation: Infrastructure as Code service for provisioning AWS resources

CloudFront: Content Delivery Network (CDN) for distributing content globally

CloudTrail: Service for logging and monitoring AWS API calls

CloudWatch: Monitoring and observability service for AWS resources

CMK (Customer Master Key): Encryption key managed by customer in KMS

Cognito: User authentication and authorization service

DDoS (Distributed Denial of Service): Attack overwhelming system with traffic

Direct Connect: Dedicated network connection from on-premises to AWS

DynamoDB: Fully managed NoSQL database service

EBS (Elastic Block Store): Block storage for EC2 instances

EC2 (Elastic Compute Cloud): Virtual servers in the cloud

ECR (Elastic Container Registry): Docker container registry

ECS (Elastic Container Service): Container orchestration service

EFS (Elastic File System): Managed file storage for EC2

EKS (Elastic Kubernetes Service): Managed Kubernetes service

Elastic IP: Static public IPv4 address

ElastiCache: In-memory caching service (Redis or Memcached)

ELB (Elastic Load Balancing): Distributes traffic across multiple targets

EMR (Elastic MapReduce): Managed Hadoop/Spark for big data processing

Fargate: Serverless compute engine for containers

FSx: Managed file systems (Windows, Lustre, NetApp, OpenZFS)

Glacier: Low-cost archival storage service

Glue: Serverless ETL (Extract, Transform, Load) service

GuardDuty: Threat detection service using machine learning

IAM (Identity and Access Management): Service for managing access to AWS resources

IOPS (Input/Output Operations Per Second): Storage performance metric

KMS (Key Management Service): Managed encryption key service

Lambda: Serverless compute service (run code without servers)

Macie: Data security service for discovering sensitive data

NAT (Network Address Translation): Allows private instances to access internet

NACL (Network Access Control List): Stateless firewall at subnet level

NLB (Network Load Balancer): Layer 4 load balancer for TCP/UDP traffic

RDS (Relational Database Service): Managed relational database service

Region: Geographic area containing multiple Availability Zones

Route 53: DNS and domain registration service

RPO (Recovery Point Objective): Maximum acceptable data loss (time)

RTO (Recovery Time Objective): Maximum acceptable downtime (time)

S3 (Simple Storage Service): Object storage service

SCP (Service Control Policy): Policy in AWS Organizations to restrict actions

Security Group: Stateful firewall at instance level

Secrets Manager: Service for managing secrets (passwords, API keys)

Shield: DDoS protection service (Standard free, Advanced paid)

SNS (Simple Notification Service): Pub/sub messaging service

SQS (Simple Queue Service): Message queuing service

SSE (Server-Side Encryption): Encryption of data at rest by AWS

Step Functions: Workflow orchestration service

STS (Security Token Service): Temporary security credentials

Transit Gateway: Hub for connecting VPCs and on-premises networks

VPC (Virtual Private Cloud): Isolated network in AWS

VPN (Virtual Private Network): Encrypted connection over internet

WAF (Web Application Firewall): Protects web applications from common attacks

X-Ray: Distributed tracing service for debugging applications


Final Words

You're Ready When...

  • You score 80%+ on all practice tests consistently
  • You can explain key concepts without notes
  • You recognize question patterns instantly
  • You make decisions quickly using frameworks
  • You understand the "why" behind architectural choices
  • You can draw architecture diagrams from memory
  • You know when to use each AWS service

Remember on Exam Day

  • Trust your preparation: You've studied comprehensively through this guide
  • Manage your time: 130 minutes for 65 questions = 2 minutes per question
  • Read carefully: Watch for keywords like "most cost-effective," "lowest latency," "highest availability"
  • Identify constraints: Budget, time, compliance, performance requirements
  • Eliminate wrong answers: Usually 2 answers are obviously wrong
  • Don't overthink: Your first instinct is often correct
  • Flag and move on: Don't get stuck on one question
  • Review flagged questions: Use remaining time to revisit difficult questions

Exam Strategy Reminders

  1. Read the scenario first: Understand the business context
  2. Identify the question type: What is being asked? (security, cost, performance, availability)
  3. Look for keywords: "most," "least," "highest," "lowest," "cost-effective"
  4. Apply frameworks: Use decision trees from this guide
  5. Eliminate distractors: Remove obviously wrong answers
  6. Choose best answer: Not just correct, but BEST for the scenario

After the Exam

  • Results available immediately (pass/fail)
  • Detailed score report within 5 business days
  • Certificate available in AWS Certification account
  • Valid for 3 years from exam date
  • Consider next certification: Solutions Architect Professional, DevOps Engineer, Security Specialty

Final Encouragement

You've completed a comprehensive study guide covering:

  • āœ… 60,000+ words of detailed content
  • āœ… 129 visual diagrams for complex concepts
  • āœ… All four exam domains with deep explanations
  • āœ… Hundreds of examples and scenarios
  • āœ… Decision frameworks and best practices
  • āœ… Quick reference materials and cheat sheets

You are well-prepared. Trust your knowledge. Stay calm. You've got this!

Congratulations on completing this study guide! Best of luck on your AWS Certified Solutions Architect - Associate (SAA-C03) exam! šŸŽÆšŸš€


Study Guide Complete | Total Word Count: ~85,000 words | Diagrams: 129 files | Ready for Exam āœ…


Final Words

You're Ready When...

  • You score 75%+ on all practice tests consistently
  • You can explain key concepts without notes
  • You recognize question patterns instantly
  • You make decisions quickly using frameworks
  • You understand trade-offs between different solutions
  • You can design complete architectures from scratch

Remember

On Exam Day:

  • Trust your preparation - you've put in the work
  • Read questions carefully - every word matters
  • Eliminate wrong answers systematically
  • Choose the BEST answer, not just a correct answer
  • Manage your time - 2 minutes per question
  • Don't overthink - your first instinct is usually right
  • Stay calm and confident throughout

The Exam Tests:

  • Your ability to design secure, resilient, high-performing, cost-optimized architectures
  • Your understanding of AWS services and when to use them
  • Your ability to make trade-off decisions
  • Your knowledge of best practices and design patterns

You've Learned:

  • 500+ practice questions with detailed explanations
  • 100,000+ words of comprehensive study material
  • 173 visual diagrams covering all key concepts
  • All four exam domains in depth
  • Integration patterns and real-world scenarios
  • Test-taking strategies and time management

You're Prepared!

Go into that exam with confidence. You've studied hard, practiced extensively, and you know this material.

Good luck on your AWS Certified Solutions Architect - Associate exam! šŸŽÆ


After Passing: Congratulations! You're now an AWS Certified Solutions Architect - Associate. Update your LinkedIn, celebrate your achievement, and start applying your knowledge to real-world projects.

If You Need to Retake: Don't be discouraged. Review your score report, identify weak areas, study those topics, and try again. Many successful architects didn't pass on their first attempt. Persistence pays off!