CC

ANS-C01 Study Guide & Reviewer

Comprehensive Study Materials & Key Concepts

AWS Certified Advanced Networking - Specialty (ANS-C01) Comprehensive Study Guide

Complete Learning Path for Certification Success

Overview

This study guide provides a structured learning path from networking fundamentals to advanced AWS networking mastery. Designed for networking professionals transitioning to cloud, it teaches all concepts progressively while focusing exclusively on exam-relevant content. Extensive diagrams and visual aids are integrated throughout to enhance understanding and retention.

Target Audience: Networking professionals with 3-5+ years of traditional networking experience who are expanding into AWS cloud and hybrid networking architectures.

Exam Details:

  • Exam Code: ANS-C01
  • Question Count: 65 total (50 scored, 15 unscored)
  • Passing Score: 750/1000
  • Duration: 170 minutes
  • Question Types: Multiple choice (1 correct) and multiple response (2+ correct)
  • Exam Level: Advanced Specialty

Study Plan Overview

Total Time: 8-12 weeks (3-4 hours daily for professionals with networking background)

Phase 1: Foundation Building (Weeks 1-2)

  • Week 1: Chapter 0 - AWS Networking Fundamentals (section 01)
    • VPC architecture and components
    • Subnetting and IP addressing in AWS
    • Security groups vs NACLs
    • Route tables and routing basics
    • AWS networking services overview
  • Week 2: Domain 1 Part 1 - Edge Services & DNS (file 02, sections 1-2)
    • CloudFront and Global Accelerator
    • Route 53 and DNS architectures
    • Practice: 50 beginner questions

Phase 2: Core Network Design (Weeks 3-4)

  • Week 3: Domain 1 Part 2 - Load Balancing & Monitoring (file 02, sections 3-4)
    • ELB types and use cases
    • Network monitoring and logging
    • Practice: 50 intermediate questions
  • Week 4: Domain 1 Part 3 - Hybrid & Multi-Account (file 02, sections 5-6)
    • Direct Connect and VPN
    • Transit Gateway and VPC peering
    • Practice: Domain 1 focused bundle

Phase 3: Implementation Skills (Weeks 5-6)

  • Week 5: Domain 2 Part 1 - Hybrid Implementation (file 03, sections 1-2)
    • Implementing Direct Connect
    • Configuring VPN connections
    • Multi-account connectivity
    • Practice: 50 intermediate questions
  • Week 6: Domain 2 Part 2 - DNS & Automation (file 03, sections 3-4)
    • Complex DNS architectures
    • Infrastructure as Code for networking
    • Practice: Domain 2 focused bundle

Phase 4: Operations & Management (Weeks 7-8)

  • Week 7: Domain 3 - Network Management (section 04)
    • Maintaining routing and connectivity
    • Monitoring and troubleshooting
    • Network optimization
    • Practice: 50 intermediate questions
  • Week 8: Domain 4 Part 1 - Security Features (file 05, sections 1-2)
    • Network security implementation
    • Monitoring and logging for security
    • Practice: Domain 3 & 4 focused bundles

Phase 5: Advanced Topics (Weeks 9-10)

  • Week 9: Domain 4 Part 2 & Integration (sections 05-06)
    • Data confidentiality and encryption
    • Cross-domain scenarios
    • Complex architectures
    • Practice: Full practice test 1
  • Week 10: Advanced Scenarios & Review (section 06)
    • Multi-region architectures
    • Hybrid cloud patterns
    • SD-WAN integration
    • Practice: Full practice test 2

Phase 6: Final Preparation (Weeks 11-12)

  • Week 11: Intensive Practice & Weak Area Focus
    • Review flagged topics from practice tests
    • Complete service-focused bundles
    • Study strategies and test-taking techniques (section 07)
    • Practice: Full practice test 3
  • Week 12: Final Review & Exam Readiness
    • Final checklist completion (section 08)
    • Cheat sheet review
    • Appendices quick reference (section 99)
    • Mental preparation

Learning Approach

1. Read & Understand

  • Study each section thoroughly
  • Don't rush - this is advanced material
  • Take notes on complex topics
  • Draw your own diagrams to reinforce learning

2. Visualize & Diagram

  • Study all provided Mermaid diagrams carefully
  • Understand data flows and component interactions
  • Recreate diagrams from memory to test understanding
  • Use diagrams to explain concepts to others

3. Practice Hands-On (Recommended)

While this guide is comprehensive, hands-on practice reinforces learning:

  • Create VPCs and test connectivity
  • Configure Direct Connect (if available) or VPN
  • Set up Transit Gateway architectures
  • Implement Route 53 routing policies
  • Configure CloudFront distributions
  • Test Network Firewall rules

4. Test Knowledge

  • Complete practice questions after each section
  • Use practice test bundles to validate understanding
  • Review explanations for both correct and incorrect answers
  • Track weak areas and revisit those chapters

5. Review & Reinforce

  • Revisit marked sections regularly
  • Use spaced repetition for memorization
  • Review diagrams before practice tests
  • Complete self-assessment checklists

Progress Tracking

Chapter Completion Checklist

Track your progress through the study guide:

Foundation:

  • Chapter 0: Fundamentals (01_fundamentals)
  • Self-assessment passed (70%+)

Domain 1: Network Design (30%)

  • Section 1: Edge Services & Global Architectures
  • Section 2: DNS Solutions
  • Section 3: Load Balancing
  • Section 4: Logging & Monitoring
  • Section 5: Hybrid Connectivity
  • Section 6: Multi-Account/Multi-Region
  • Domain 1 practice test (70%+)

Domain 2: Network Implementation (26%)

  • Section 1: Hybrid Implementation
  • Section 2: Multi-Account Connectivity
  • Section 3: Complex DNS Architectures
  • Section 4: Network Automation
  • Domain 2 practice test (70%+)

Domain 3: Network Management (20%)

  • Section 1: Routing & Connectivity Maintenance
  • Section 2: Monitoring & Troubleshooting
  • Section 3: Network Optimization
  • Domain 3 practice test (70%+)

Domain 4: Network Security (24%)

  • Section 1: Security Features Implementation
  • Section 2: Security Monitoring & Validation
  • Section 3: Data Confidentiality
  • Domain 4 practice test (70%+)

Advanced Topics:

  • Integration chapter (06_integration)
  • Full practice test 1 (75%+)
  • Full practice test 2 (75%+)
  • Full practice test 3 (80%+)

Final Preparation:

  • Study strategies reviewed (07_study_strategies)
  • Final checklist completed (08_final_checklist)
  • Cheat sheet memorized
  • Appendices reviewed (99_appendices)

Legend & Visual Markers

Throughout this guide, you'll see these markers:

  • ⭐ Must Know: Critical for exam - memorize this
  • šŸ’” Tip: Helpful insight or shortcut
  • āš ļø Warning: Common mistake to avoid
  • šŸ”— Connection: Related to other topics
  • šŸ“ Practice: Hands-on exercise
  • šŸŽÆ Exam Focus: Frequently tested concept
  • šŸ“Š Diagram: Visual representation available
  • šŸ—ļø Architecture: Design pattern or solution architecture
  • šŸ”§ Configuration: Specific configuration detail
  • šŸ“ˆ Performance: Performance optimization tip
  • šŸ”’ Security: Security best practice
  • šŸ’° Cost: Cost optimization consideration

How to Navigate This Guide

For Complete Beginners to AWS Networking:

  1. Start with Chapter 0 (Fundamentals) - don't skip this
  2. Progress sequentially through all chapters
  3. Spend extra time on diagrams and examples
  4. Complete all practice exercises
  5. Allow 10-12 weeks for full preparation

For Experienced AWS Practitioners:

  1. Skim Chapter 0 to identify any gaps
  2. Focus on domain chapters (1-4)
  3. Pay special attention to advanced topics
  4. Use practice tests to identify weak areas
  5. Allow 8-10 weeks for preparation

For Networking Professionals New to AWS:

  1. Study Chapter 0 carefully - AWS networking differs from traditional
  2. Focus on AWS-specific services and patterns
  3. Understand the shared responsibility model
  4. Practice with AWS console and CLI
  5. Allow 10-12 weeks for preparation

Study Tips for Success

1. Understand, Don't Memorize

  • Focus on understanding WHY solutions work
  • Learn the decision-making process
  • Understand trade-offs between options
  • Know when to use each service

2. Think in Architectures

  • Always consider the complete solution
  • Understand component interactions
  • Think about failure scenarios
  • Consider scalability and performance

3. Master the Fundamentals

  • Strong VPC knowledge is essential
  • Understand routing thoroughly
  • Know security groups and NACLs cold
  • Master subnetting and CIDR

4. Practice Decision-Making

  • Learn to eliminate wrong answers
  • Identify constraint keywords
  • Recognize question patterns
  • Practice time management

5. Use Multiple Learning Methods

  • Read explanations
  • Study diagrams
  • Practice hands-on
  • Teach concepts to others
  • Take practice tests

Exam Day Preparation

One Week Before:

  • Complete all practice tests
  • Review flagged topics
  • Study cheat sheet daily
  • Review final checklist

Day Before:

  • Light review only (2-3 hours max)
  • Review cheat sheet
  • Get 8 hours of sleep
  • Prepare exam day materials

Exam Day:

  • Arrive 30 minutes early
  • Brain dump key facts on scratch paper
  • Read questions carefully
  • Manage time effectively (2 minutes per question)
  • Flag uncertain questions for review

Additional Resources

Official AWS Resources:

  • AWS Documentation (docs.aws.amazon.com)
  • AWS Whitepapers (aws.amazon.com/whitepapers)
  • AWS Architecture Center (aws.amazon.com/architecture)
  • AWS Well-Architected Framework

Practice Materials:

  • Practice test bundles (included in this package)
  • AWS Skill Builder (skillbuilder.aws)
  • AWS re:Invent sessions on networking

Community Resources:

  • AWS Forums (forums.aws.amazon.com)
  • AWS subreddit (r/aws)
  • AWS Certification subreddit (r/AWSCertifications)

Important Notes

About This Guide:

  • Comprehensive: 120,000-240,000 words covering all exam topics
  • Visual: 120-200 diagrams for complex concepts
  • Self-Sufficient: No external resources required
  • Exam-Focused: Only exam-relevant content included
  • Progressive: Builds from fundamentals to advanced topics

Content-Heavy Certification:

ANS-C01 is an advanced specialty certification with extensive technical depth. This guide provides:

  • Extended explanations (300-1,200 words per concept)
  • Multiple examples per topic (3+ examples)
  • Detailed architecture diagrams
  • Step-by-step implementation guides
  • Comprehensive troubleshooting sections

Time Investment:

  • Minimum: 8 weeks (3-4 hours daily)
  • Recommended: 10-12 weeks (2-3 hours daily)
  • With hands-on practice: Add 2-4 weeks

Prerequisites:

  • Strong understanding of networking fundamentals (TCP/IP, routing, switching)
  • Basic AWS knowledge (VPC, EC2, S3)
  • Experience with network design and troubleshooting
  • Familiarity with command-line interfaces

Getting Started

Ready to begin? Here's your first step:

  1. Read this overview completely āœ“ (you're here!)
  2. Review the study plan and adjust timeline to your schedule
  3. Set up progress tracking (use checkboxes above)
  4. Begin Chapter 0 (01_fundamentals)
  5. Stay consistent - daily study is more effective than cramming

Motivation & Mindset

Why This Certification Matters:

  • Career Growth: Advanced networking skills are in high demand
  • Salary Impact: Specialty certifications command premium compensation
  • Technical Depth: Deep expertise in AWS networking architectures
  • Problem-Solving: Ability to design complex hybrid and multi-region solutions
  • Industry Recognition: AWS specialty certifications are highly respected

Success Mindset:

  • Be Patient: This is advanced material - take time to understand
  • Stay Curious: Ask "why" and "how" for every concept
  • Practice Regularly: Consistent daily study beats weekend cramming
  • Learn from Mistakes: Wrong answers teach as much as correct ones
  • Trust the Process: Follow the study plan and you'll succeed

You Can Do This!

Thousands of networking professionals have earned this certification. With dedication, consistent study, and this comprehensive guide, you'll join them. The journey is challenging but rewarding.

Let's begin your AWS Advanced Networking certification journey!


Next: Chapter 0 - AWS Networking Fundamentals (01_fundamentals)


Chapter 0: AWS Networking Fundamentals

What You Need to Know First

This certification assumes you have a strong foundation in traditional networking. Before diving into AWS-specific networking, ensure you understand these prerequisite concepts:

Traditional Networking Prerequisites

  • TCP/IP Protocol Suite - Understanding of layers 3-7, IP addressing, subnetting, CIDR notation
  • Routing Protocols - Static routing, dynamic routing (BGP, OSPF), route tables, routing decisions
  • Switching & VLANs - Layer 2 switching, VLAN concepts, trunking, spanning tree
  • Network Security - Firewalls, ACLs, stateful vs stateless filtering, encryption (IPsec, TLS)
  • DNS - DNS hierarchy, record types (A, AAAA, CNAME, MX, TXT), DNS resolution process
  • Load Balancing - Load balancing algorithms, health checks, session persistence
  • WAN Technologies - MPLS, VPN, dedicated circuits, SD-WAN concepts

If you're missing any: Review traditional networking fundamentals before proceeding. This guide assumes strong networking knowledge and focuses on AWS-specific implementations.

Core Concepts Foundation

AWS Global Infrastructure

What it is: AWS operates a global infrastructure consisting of Regions, Availability Zones, Edge Locations, and Points of Presence that form the foundation for all AWS networking services.

Why it matters: Understanding AWS's physical infrastructure is essential for designing resilient, low-latency, and compliant network architectures. Every networking decision you make relates back to this infrastructure.

Real-world analogy: Think of AWS infrastructure like a global postal system. Regions are major distribution centers (like regional postal hubs), Availability Zones are local post offices within a city (providing redundancy), and Edge Locations are mailboxes on street corners (bringing services closer to users).

Key components explained:

  1. AWS Regions: Geographic areas containing multiple isolated Availability Zones. Each Region is completely independent, with its own power, cooling, and networking infrastructure. As of 2024, AWS operates 30+ Regions globally. Regions are connected via AWS's private global network backbone, not the public internet.

  2. Availability Zones (AZs): Physically separated data centers within a Region, each with independent power, cooling, and networking. AZs within a Region are connected via high-bandwidth, low-latency private fiber links (typically sub-millisecond latency). Each AZ is designed to be isolated from failures in other AZs, providing fault tolerance.

  3. Edge Locations: AWS Points of Presence (PoPs) distributed globally (400+ locations) that cache content and provide low-latency access to AWS services like CloudFront and Route 53. Edge Locations are not full AWS Regions but serve as entry points to the AWS network.

  4. Local Zones: Extensions of AWS Regions that place compute, storage, and database services closer to end users in specific geographic areas. Useful for latency-sensitive applications requiring single-digit millisecond latency.

  5. Wavelength Zones: AWS infrastructure embedded within telecommunications providers' 5G networks, enabling ultra-low latency applications for mobile devices.

šŸ“Š AWS Global Infrastructure Diagram:

graph TB
    subgraph "AWS Global Infrastructure"
        subgraph "Region: us-east-1 (N. Virginia)"
            subgraph "AZ-1a"
                DC1[Data Center 1]
                DC2[Data Center 2]
            end
            subgraph "AZ-1b"
                DC3[Data Center 3]
                DC4[Data Center 4]
            end
            subgraph "AZ-1c"
                DC5[Data Center 5]
                DC6[Data Center 6]
            end
        end
        
        subgraph "Region: eu-west-1 (Ireland)"
            subgraph "AZ-2a"
                DC7[Data Center 7]
            end
            subgraph "AZ-2b"
                DC8[Data Center 8]
            end
            subgraph "AZ-2c"
                DC9[Data Center 9]
            end
        end
        
        subgraph "Edge Network"
            EDGE1[Edge Location - New York]
            EDGE2[Edge Location - London]
            EDGE3[Edge Location - Tokyo]
            EDGE4[Edge Location - Sydney]
        end
    end
    
    DC1 -.High-bandwidth fiber.-> DC3
    DC3 -.High-bandwidth fiber.-> DC5
    DC1 -.High-bandwidth fiber.-> DC5
    
    DC7 -.High-bandwidth fiber.-> DC8
    DC8 -.High-bandwidth fiber.-> DC9
    
    DC1 ==AWS Global Network==> DC7
    
    EDGE1 -.Content Delivery.-> DC1
    EDGE2 -.Content Delivery.-> DC7
    EDGE3 -.Content Delivery.-> DC1
    EDGE4 -.Content Delivery.-> DC7
    
    style DC1 fill:#c8e6c9
    style DC3 fill:#c8e6c9
    style DC5 fill:#c8e6c9
    style DC7 fill:#fff3e0
    style DC8 fill:#fff3e0
    style DC9 fill:#fff3e0
    style EDGE1 fill:#e1f5fe
    style EDGE2 fill:#e1f5fe
    style EDGE3 fill:#e1f5fe
    style EDGE4 fill:#e1f5fe

See: diagrams/01_fundamentals_global_infrastructure.mmd

Diagram Explanation (detailed):

This diagram illustrates AWS's hierarchical global infrastructure. At the top level, we see two AWS Regions (us-east-1 in green and eu-west-1 in orange), each completely independent and isolated. Within each Region, there are three Availability Zones (AZ-1a, AZ-1b, AZ-1c for us-east-1). Each AZ contains multiple data centers (shown as DC1-DC9) for redundancy within the AZ itself. The dotted lines between AZs within a Region represent high-bandwidth, low-latency private fiber connections (typically 1-2ms latency, 25-100 Gbps bandwidth). These connections enable synchronous replication and high-availability architectures.

The thick double line between Regions represents AWS's private global network backbone, which interconnects all Regions without traversing the public internet. This provides secure, high-bandwidth, and predictable latency for inter-region traffic. The Edge Locations (shown in blue) are distributed globally and connect to the nearest Region via AWS's network. When a user in New York requests content from CloudFront, the Edge Location in New York serves cached content or fetches it from the origin Region (us-east-1) via AWS's private network.

Understanding this infrastructure is critical because your networking decisions must account for: (1) AZ-level failures requiring multi-AZ deployments, (2) Region-level isolation requiring cross-region replication for disaster recovery, (3) Edge Location distribution for global content delivery, and (4) the private AWS backbone for secure inter-region communication.

⭐ Must Know (Critical Facts):

  • Minimum 3 AZs per Region: Most Regions have 3+ AZs for high availability (some have 6+)
  • AZ Latency: Sub-2ms latency between AZs in same Region (typically <1ms)
  • AZ Independence: Each AZ has independent power, cooling, networking, and physical security
  • Region Isolation: Regions are completely isolated; data doesn't leave a Region unless you explicitly configure it
  • Edge Location Count: 400+ Edge Locations globally (vs 30+ Regions)
  • Private Backbone: All inter-region AWS traffic uses AWS's private fiber network, not public internet

šŸ’” Tip: When designing for high availability, always deploy across at least 2 AZs (preferably 3). When designing for disaster recovery, deploy across at least 2 Regions.


Amazon Virtual Private Cloud (VPC)

What it is: A VPC is a logically isolated virtual network within AWS where you launch AWS resources. It's your own private section of the AWS cloud where you have complete control over IP addressing, subnets, route tables, and network gateways.

Why it exists: In traditional data centers, you have physical network isolation through VLANs, firewalls, and routers. In AWS's multi-tenant cloud environment, VPCs provide logical isolation using software-defined networking (SDN). Each VPC is isolated from other VPCs and the internet by default, giving you a secure environment to run your workloads. VPCs solve the problem of network isolation, security, and control in a shared cloud infrastructure.

Real-world analogy: Think of a VPC like an apartment building. The entire AWS Region is like a city, and your VPC is your private apartment within that city. You control who enters your apartment (security groups), what rooms exist (subnets), and how people move between rooms (route tables). Other tenants (other AWS customers) have their own apartments (VPCs) that are completely isolated from yours, even though you're all in the same building (Region).

How it works (Detailed step-by-step):

  1. VPC Creation: When you create a VPC, you specify an IPv4 CIDR block (e.g., 10.0.0.0/16), which defines the private IP address range for your VPC. This CIDR block can be between /16 (65,536 IPs) and /28 (16 IPs). AWS reserves 5 IP addresses in each subnet (first 4 and last 1) for networking purposes. You can optionally assign an IPv6 CIDR block (/56) for dual-stack networking.

  2. Subnet Creation: You divide your VPC CIDR block into smaller subnets, each associated with a specific Availability Zone. For example, from 10.0.0.0/16, you might create 10.0.1.0/24 in AZ-1a and 10.0.2.0/24 in AZ-1b. Subnets cannot span multiple AZs. Each subnet can be designated as public (has route to internet gateway) or private (no direct internet access).

  3. Route Table Association: Each subnet must be associated with a route table that controls traffic routing. The route table contains rules (routes) that determine where network traffic is directed. For example, a route of 10.0.0.0/16 → local keeps traffic within the VPC, while 0.0.0.0/0 → igw-xxx sends internet-bound traffic to an internet gateway.

  4. Gateway Attachment: To enable internet connectivity, you attach an Internet Gateway (IGW) to the VPC. For private subnet internet access, you deploy a NAT Gateway in a public subnet. For hybrid connectivity, you attach a Virtual Private Gateway (VGW) for VPN connections or use Direct Connect Gateway for dedicated connections.

  5. Security Configuration: You configure security groups (stateful, instance-level firewalls) and network ACLs (stateless, subnet-level firewalls) to control inbound and outbound traffic. Security groups use allow rules only, while NACLs support both allow and deny rules.

  6. Resource Deployment: You launch EC2 instances, RDS databases, Lambda functions, and other resources into specific subnets. Each resource receives a private IP from the subnet's CIDR range. Resources in public subnets can optionally receive public IPs or Elastic IPs for internet accessibility.

šŸ“Š VPC Architecture Diagram:

graph TB
    subgraph "AWS Region: us-east-1"
        subgraph "VPC: 10.0.0.0/16"
            IGW[Internet Gateway]
            
            subgraph "Availability Zone 1a"
                subgraph "Public Subnet: 10.0.1.0/24"
                    WEB1[Web Server<br/>10.0.1.10<br/>Public IP: 54.x.x.x]
                    NAT1[NAT Gateway<br/>10.0.1.20<br/>EIP: 52.x.x.x]
                end
                subgraph "Private Subnet: 10.0.3.0/24"
                    APP1[App Server<br/>10.0.3.10]
                    DB1[Database<br/>10.0.3.20]
                end
            end
            
            subgraph "Availability Zone 1b"
                subgraph "Public Subnet: 10.0.2.0/24"
                    WEB2[Web Server<br/>10.0.2.10<br/>Public IP: 54.y.y.y]
                    NAT2[NAT Gateway<br/>10.0.2.20<br/>EIP: 52.y.y.y]
                end
                subgraph "Private Subnet: 10.0.4.0/24"
                    APP2[App Server<br/>10.0.4.10]
                    DB2[Database<br/>10.0.4.20]
                end
            end
            
            VGW[Virtual Private Gateway]
        end
    end
    
    INTERNET((Internet))
    ONPREM[On-Premises<br/>Data Center<br/>192.168.0.0/16]
    
    INTERNET <-->|Public Traffic| IGW
    IGW --> WEB1
    IGW --> WEB2
    
    WEB1 --> APP1
    WEB2 --> APP2
    
    APP1 --> DB1
    APP2 --> DB2
    
    APP1 -.Outbound Internet.-> NAT1
    APP2 -.Outbound Internet.-> NAT2
    NAT1 --> IGW
    NAT2 --> IGW
    
    VGW <-->|VPN/Direct Connect| ONPREM
    VGW -.Private Connectivity.-> APP1
    VGW -.Private Connectivity.-> APP2
    
    style IGW fill:#e1f5fe
    style NAT1 fill:#fff3e0
    style NAT2 fill:#fff3e0
    style WEB1 fill:#c8e6c9
    style WEB2 fill:#c8e6c9
    style APP1 fill:#f3e5f5
    style APP2 fill:#f3e5f5
    style DB1 fill:#ffebee
    style DB2 fill:#ffebee
    style VGW fill:#fce4ec

See: diagrams/01_fundamentals_vpc_architecture.mmd

Diagram Explanation (detailed):

This diagram shows a production-ready VPC architecture implementing a three-tier application across two Availability Zones for high availability. The VPC uses the 10.0.0.0/16 CIDR block, providing 65,536 IP addresses. The architecture is divided into public and private subnets across two AZs.

Public Subnets (10.0.1.0/24 and 10.0.2.0/24): These subnets have a route table entry pointing 0.0.0.0/0 to the Internet Gateway (IGW), making them "public." Resources here can receive public IP addresses and communicate directly with the internet. The Web Servers (green) have both private IPs (10.0.1.10, 10.0.2.10) and public IPs (54.x.x.x, 54.y.y.y) for inbound internet traffic. The NAT Gateways (orange) also reside in public subnets and have Elastic IPs (52.x.x.x, 52.y.y.y) for outbound internet access from private subnets.

Private Subnets (10.0.3.0/24 and 10.0.4.0/24): These subnets have no direct route to the IGW, making them "private." Resources here (App Servers in purple, Databases in red) only have private IP addresses and cannot be directly accessed from the internet. For outbound internet access (e.g., downloading patches), traffic routes through the NAT Gateway in the same AZ (shown by dotted lines). This provides security by preventing inbound connections while allowing outbound.

Traffic Flows: (1) Internet users connect to Web Servers via the IGW using public IPs. (2) Web Servers communicate with App Servers using private IPs within the VPC (10.0.x.x). (3) App Servers query Databases using private IPs. (4) App Servers access the internet (for API calls, updates) via NAT Gateway in their AZ. (5) On-premises systems connect to App Servers via the Virtual Private Gateway (VGW) using VPN or Direct Connect, accessing private IPs directly.

High Availability Design: Each tier is deployed in both AZs. If AZ-1a fails completely, the application continues running in AZ-1b. The NAT Gateways are deployed per-AZ (not shared) because they're AZ-specific resources. If NAT1 fails, APP1 loses internet access, but APP2 continues via NAT2.

Security Layers: (1) Internet Gateway only allows traffic to/from resources with public IPs. (2) NAT Gateways provide one-way internet access (outbound only). (3) Security groups on each instance control allowed traffic. (4) Network ACLs at subnet boundaries provide additional filtering. (5) Virtual Private Gateway encrypts traffic to on-premises.

⭐ Must Know (Critical VPC Facts):

  • Default VPC: Each AWS account comes with a default VPC in each Region (172.31.0.0/16)
  • VPC CIDR Limits: Primary CIDR must be /16 to /28; can add up to 4 secondary CIDRs
  • Subnet Sizing: AWS reserves 5 IPs per subnet (.0, .1, .2, .3, and .255)
  • Subnet-AZ Binding: Each subnet exists in exactly one AZ; cannot span AZs
  • IGW Limit: One Internet Gateway per VPC (1:1 relationship)
  • VPC Peering: VPCs can peer within same Region or across Regions (no transitive routing)
  • DNS Resolution: VPC provides DNS at VPC_CIDR+2 (e.g., 10.0.0.2 for 10.0.0.0/16)

Detailed Example 1: E-Commerce Application Deployment

Imagine you're deploying an e-commerce platform on AWS. Your application has web servers, application servers, and a MySQL database. You need high availability, security, and the ability to connect to your on-premises inventory system.

Step 1 - VPC Design: You create a VPC with CIDR 10.0.0.0/16 in us-east-1, giving you 65,536 IP addresses. You plan for growth and multi-tier architecture.

Step 2 - Subnet Planning: You create 6 subnets:

  • Public Subnet AZ-1a: 10.0.1.0/24 (256 IPs, 251 usable)
  • Public Subnet AZ-1b: 10.0.2.0/24 (256 IPs, 251 usable)
  • Private App Subnet AZ-1a: 10.0.10.0/24 (256 IPs, 251 usable)
  • Private App Subnet AZ-1b: 10.0.11.0/24 (256 IPs, 251 usable)
  • Private DB Subnet AZ-1a: 10.0.20.0/24 (256 IPs, 251 usable)
  • Private DB Subnet AZ-1b: 10.0.21.0/24 (256 IPs, 251 usable)

Step 3 - Gateway Configuration: You attach an Internet Gateway to the VPC for public internet access. You create NAT Gateways in each public subnet (10.0.1.20 and 10.0.2.20) for private subnet internet access. You attach a Virtual Private Gateway and configure a VPN tunnel to your on-premises data center (192.168.0.0/16).

Step 4 - Route Table Setup:

  • Public route table: 10.0.0.0/16 → local, 0.0.0.0/0 → IGW, 192.168.0.0/16 → VGW
  • Private AZ-1a route table: 10.0.0.0/16 → local, 0.0.0.0/0 → NAT-1a, 192.168.0.0/16 → VGW
  • Private AZ-1b route table: 10.0.0.0/16 → local, 0.0.0.0/0 → NAT-1b, 192.168.0.0/16 → VGW

Step 5 - Resource Deployment: You launch Application Load Balancers in public subnets, EC2 web servers in private app subnets, and RDS MySQL in private DB subnets. The ALB has public IPs and distributes traffic to web servers. Web servers access the database using private IPs. The database replicates synchronously across AZs.

Step 6 - Security Configuration: Web server security group allows inbound 443 from ALB security group. Database security group allows inbound 3306 from web server security group. NAT Gateway security is implicit (managed by AWS). VPN traffic is encrypted with IPsec.

Result: Your e-commerce platform is highly available (survives AZ failure), secure (databases not internet-accessible), and integrated with on-premises (VPN for inventory queries). Customers access the site via the ALB's public DNS, which resolves to public IPs in multiple AZs.

Detailed Example 2: Multi-Tier SaaS Application with Microservices

You're building a SaaS platform with multiple microservices: authentication service, API gateway, data processing service, and analytics service. Each service needs isolation, scalability, and secure communication.

Architecture Decision: You create a VPC with 10.0.0.0/16 and use subnet segmentation for service isolation. Public subnets (10.0.1.0/24, 10.0.2.0/24) host ALBs. Private subnets are divided by service: Auth (10.0.10.0/24, 10.0.11.0/24), API (10.0.20.0/24, 10.0.21.0/24), Processing (10.0.30.0/24, 10.0.31.0/24), Analytics (10.0.40.0/24, 10.0.41.0/24).

Service Communication: Each microservice has its own security group. The API Gateway security group allows inbound from ALB security group. The Auth service security group allows inbound from API Gateway security group. The Processing service security group allows inbound from API Gateway security group. This creates a security boundary where services can only communicate through defined paths.

Scaling Strategy: Each microservice uses Auto Scaling Groups spanning both AZs. The ALB performs health checks and routes traffic only to healthy instances. When load increases, Auto Scaling launches new instances in both AZs, maintaining balance. The ALB automatically includes new instances in its target group.

Data Flow: (1) User request hits ALB public IP. (2) ALB routes to API Gateway instance in private subnet. (3) API Gateway calls Auth service to validate token. (4) Auth service queries DynamoDB (via VPC endpoint, no internet). (5) API Gateway calls Processing service with validated request. (6) Processing service stores results in S3 (via VPC endpoint). (7) Response flows back through API Gateway to ALB to user.

Cost Optimization: You use VPC endpoints for S3 and DynamoDB access instead of routing through NAT Gateway, saving NAT Gateway data processing charges ($0.045/GB). For a service processing 10TB/month, this saves $450/month in NAT Gateway fees.

Detailed Example 3: Hybrid Cloud with Direct Connect

Your company has a large on-premises data center (10.0.0.0/8) and wants to extend to AWS for burst capacity and disaster recovery. You need private, high-bandwidth connectivity.

Challenge: Your on-premises network uses 10.0.0.0/8, which overlaps with typical AWS VPC ranges. You need to carefully plan IP addressing to avoid conflicts.

Solution: You create a VPC with non-overlapping CIDR 172.16.0.0/16. You provision AWS Direct Connect with a 10 Gbps connection from your data center to AWS. You create a Virtual Private Gateway and attach it to your VPC. You configure a private Virtual Interface (VIF) on the Direct Connect connection, associating it with the VGW.

Routing Configuration: On-premises routers advertise 10.0.0.0/8 via BGP over the Direct Connect connection. AWS VGW advertises 172.16.0.0/16 back to on-premises. Your VPC route tables have: 172.16.0.0/16 → local, 10.0.0.0/8 → VGW. On-premises route tables have: 10.0.0.0/8 → local, 172.16.0.0/16 → Direct Connect.

Traffic Flow: An on-premises application server (10.50.30.20) needs to query an AWS RDS database (172.16.10.50). The packet leaves the on-premises server, hits the on-premises router, which routes it to the Direct Connect connection based on destination 172.16.0.0/16. The packet traverses the Direct Connect private fiber link (not internet), arrives at the VGW, which routes it to the RDS subnet based on destination 172.16.10.50. The RDS database receives the query, processes it, and sends the response back via the same path.

Performance: Direct Connect provides consistent latency (typically 1-10ms depending on distance) and high bandwidth (10 Gbps). Unlike VPN over internet, there's no encryption overhead (you can add MACsec for layer 2 encryption if needed). You can burst to full 10 Gbps without internet congestion issues.

Redundancy: For production, you provision a second Direct Connect connection to a different AWS Direct Connect location. You configure BGP with AS path prepending to make one connection primary and the other backup. If the primary connection fails, BGP automatically fails over to the backup within seconds.

šŸ’” Tips for Understanding VPCs:

  • Think in Layers: VPC (network boundary) → Subnets (AZ-specific segments) → Route Tables (traffic control) → Security Groups/NACLs (firewalls) → Resources (EC2, RDS, etc.)
  • Public vs Private: A subnet is "public" only if its route table has 0.0.0.0/0 → IGW; otherwise it's private
  • NAT Gateway Placement: Always place NAT Gateways in public subnets; they need IGW access
  • AZ Strategy: Deploy critical resources in at least 2 AZs; use 3 AZs for maximum availability
  • CIDR Planning: Plan for growth; use /16 for large VPCs, /24 for subnets; avoid overlaps with on-premises

āš ļø Common Mistakes & Misconceptions:

Mistake 1: Placing NAT Gateway in a private subnet

  • Why it's wrong: NAT Gateway needs internet access via IGW to function; it must be in a public subnet
  • Correct understanding: NAT Gateway sits in public subnet, has Elastic IP, routes to IGW; private subnets route to NAT Gateway for outbound internet

Mistake 2: Thinking VPC peering is transitive

  • Why it's wrong: If VPC-A peers with VPC-B, and VPC-B peers with VPC-C, VPC-A cannot reach VPC-C through VPC-B
  • Correct understanding: VPC peering is non-transitive; you need direct peering between each VPC pair, or use Transit Gateway for hub-and-spoke

Mistake 3: Assuming all resources in a public subnet are internet-accessible

  • Why it's wrong: Being in a public subnet doesn't automatically make a resource accessible from internet; it needs a public IP or Elastic IP
  • Correct understanding: Public subnet + public IP/EIP + security group allowing inbound = internet accessible

Mistake 4: Using the same NAT Gateway for all AZs

  • Why it's wrong: NAT Gateway is an AZ-specific resource; if the AZ fails, all dependent AZs lose internet access
  • Correct understanding: Deploy one NAT Gateway per AZ for high availability; costs more but prevents single point of failure

Mistake 5: Forgetting AWS reserves 5 IPs per subnet

  • Why it's wrong: A /24 subnet has 256 IPs, but only 251 are usable; .0 (network), .1 (VPC router), .2 (DNS), .3 (future), .255 (broadcast) are reserved
  • Correct understanding: Always subtract 5 from theoretical subnet size; plan accordingly for large deployments

šŸ”— Connections to Other Topics:

  • Relates to Direct Connect because: VPCs connect to on-premises via VGW attached to VPC; understanding VPC routing is essential for hybrid connectivity
  • Builds on Route 53 by: VPC provides internal DNS resolution at VPC_CIDR+2; Route 53 private hosted zones associate with VPCs for custom DNS
  • Often used with Transit Gateway to: Connect multiple VPCs without full-mesh peering; Transit Gateway attaches to VPCs and routes between them

IP Addressing and Subnetting in AWS

What it is: IP addressing in AWS follows standard IPv4 and IPv6 protocols but with AWS-specific constraints and best practices. Subnetting divides your VPC CIDR block into smaller network segments for organizational and security purposes.

Why it exists: Proper IP addressing prevents conflicts, enables efficient routing, supports growth, and facilitates hybrid connectivity. In AWS, IP addressing must account for multi-AZ deployments, VPC peering, Direct Connect, and potential future expansion. Poor IP planning leads to overlapping CIDRs, exhausted address space, and complex renumbering projects.

Real-world analogy: Think of IP addressing like a postal system. Your VPC CIDR (e.g., 10.0.0.0/16) is like a city's zip code range. Subnets (e.g., 10.0.1.0/24) are like neighborhoods within the city. Individual IPs (e.g., 10.0.1.50) are like specific street addresses. Just as cities plan neighborhoods for growth and organization, you plan subnets for scalability and isolation.

How it works (Detailed step-by-step):

  1. CIDR Block Selection: When creating a VPC, you choose a CIDR block from RFC 1918 private ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) or any other range. The CIDR must be between /16 (65,536 IPs) and /28 (16 IPs). AWS recommends /16 for large VPCs to allow for growth. You must avoid overlapping with on-premises networks, other VPCs you'll peer with, and AWS service ranges.

  2. Subnet Calculation: You divide the VPC CIDR into subnets using CIDR notation. For example, from 10.0.0.0/16, you can create 256 /24 subnets (10.0.0.0/24 through 10.0.255.0/24). Each /24 subnet provides 256 IPs, but AWS reserves 5, leaving 251 usable. The formula is: usable IPs = 2^(32-prefix) - 5. A /24 = 2^(32-24) - 5 = 256 - 5 = 251 usable IPs.

  3. Reserved IP Addresses: In each subnet, AWS reserves: (1) .0 = network address, (2) .1 = VPC router, (3) .2 = DNS server, (4) .3 = reserved for future use, (5) .255 = broadcast address (not used in VPC but reserved). For example, in 10.0.1.0/24: 10.0.1.0, 10.0.1.1, 10.0.1.2, 10.0.1.3, and 10.0.1.255 are unavailable. First usable IP is 10.0.1.4, last is 10.0.1.254.

  4. IP Assignment: When you launch a resource (EC2, RDS, Lambda in VPC), AWS assigns a private IP from the subnet's CIDR range. You can let AWS auto-assign (next available IP) or specify a particular IP. For public accessibility, you assign a public IP (ephemeral, changes on stop/start) or Elastic IP (static, persists). Public IPs are from AWS's public IP pool, not your VPC CIDR.

  5. Secondary CIDRs: If you exhaust your primary CIDR, you can add up to 4 secondary CIDR blocks to the VPC. Secondary CIDRs must not overlap with primary or each other. This is useful for growth without migrating to a new VPC. However, secondary CIDRs have limitations with VPN and Direct Connect (some services don't support them).

  6. IPv6 Addressing: You can enable IPv6 by requesting an AWS-provided /56 CIDR block (or bring your own). AWS assigns a /64 subnet from the /56 to each subnet. IPv6 addresses are globally unique and routable. All IPv6 addresses in AWS are public (no private IPv6 ranges). You configure dual-stack (IPv4 + IPv6) for compatibility.

šŸ“Š IP Addressing and Subnetting Diagram:

graph TB
    subgraph "VPC: 10.0.0.0/16 (65,536 IPs)"
        subgraph "Subnet Planning"
            S1["Public Subnet AZ-1a<br/>10.0.1.0/24<br/>256 IPs (251 usable)"]
            S2["Public Subnet AZ-1b<br/>10.0.2.0/24<br/>256 IPs (251 usable)"]
            S3["Private Subnet AZ-1a<br/>10.0.10.0/24<br/>256 IPs (251 usable)"]
            S4["Private Subnet AZ-1b<br/>10.0.11.0/24<br/>256 IPs (251 usable)"]
            S5["Reserved for Growth<br/>10.0.20.0/22<br/>1,024 IPs"]
        end
        
        subgraph "Reserved IPs in 10.0.1.0/24"
            R1["10.0.1.0<br/>Network Address"]
            R2["10.0.1.1<br/>VPC Router"]
            R3["10.0.1.2<br/>DNS Server"]
            R4["10.0.1.3<br/>Future Use"]
            R5["10.0.1.255<br/>Broadcast"]
        end
        
        subgraph "Usable IPs in 10.0.1.0/24"
            U1["10.0.1.4 - 10.0.1.254<br/>251 Usable IPs"]
        end
    end
    
    style S1 fill:#c8e6c9
    style S2 fill:#c8e6c9
    style S3 fill:#fff3e0
    style S4 fill:#fff3e0
    style S5 fill:#e1f5fe
    style R1 fill:#ffebee
    style R2 fill:#ffebee
    style R3 fill:#ffebee
    style R4 fill:#ffebee
    style R5 fill:#ffebee
    style U1 fill:#c8e6c9

See: diagrams/01_fundamentals_ip_addressing.mmd

Diagram Explanation (detailed):

This diagram illustrates IP address allocation and reservation within a VPC. Starting with a /16 VPC CIDR (10.0.0.0/16), we have 65,536 total IP addresses available. The diagram shows how this large block is subdivided into smaller, manageable subnets for different purposes.

The top section shows subnet planning across two Availability Zones. Public subnets (green) use 10.0.1.0/24 and 10.0.2.0/24, each providing 256 IPs. Private subnets (orange) use 10.0.10.0/24 and 10.0.11.0/24. Notice the gap between 10.0.2.0/24 and 10.0.10.0/24 - this is intentional planning for future public subnets. The blue block (10.0.20.0/22) represents reserved space for future growth, providing 1,024 IPs that can be subdivided later.

The middle section details the 5 reserved IPs in every subnet, using 10.0.1.0/24 as an example. These reservations (shown in red) are: (1) 10.0.1.0 is the network address identifying the subnet itself, (2) 10.0.1.1 is the VPC router that handles routing between subnets and to gateways, (3) 10.0.1.2 is the DNS server providing name resolution (Amazon-provided DNS), (4) 10.0.1.3 is reserved by AWS for future use (currently unused but unavailable), and (5) 10.0.1.255 is the broadcast address (not actually used in VPC but reserved per IP standards).

The bottom section shows the usable IP range (green): 10.0.1.4 through 10.0.1.254, totaling 251 addresses. These are the IPs you can assign to EC2 instances, RDS databases, Lambda ENIs, load balancers, and other resources. When you launch a resource in this subnet, AWS assigns an IP from this range.

Key Insight: The 5-IP reservation means a /28 subnet (16 IPs) only provides 11 usable IPs, and a /27 (32 IPs) provides 27 usable. For production workloads with auto-scaling, always use /24 or larger to avoid IP exhaustion. If you need 100 instances in a subnet, a /24 (251 usable) is sufficient, but a /25 (123 usable) provides less headroom for growth.

⭐ Must Know (Critical IP Addressing Facts):

  • VPC CIDR Range: /16 (65,536 IPs) to /28 (16 IPs); /16 recommended for production
  • Reserved IPs: 5 per subnet (.0, .1, .2, .3, .255); always subtract from total
  • Secondary CIDRs: Up to 4 additional CIDRs; must not overlap; some limitations with VPN/DX
  • Public IP Types: Public IP (ephemeral, free, changes on stop/start) vs Elastic IP (static, $0.005/hour when not attached, persists)
  • IPv6 CIDR: AWS provides /56 for VPC, /64 per subnet; all IPv6 addresses are public
  • CIDR Overlap: Cannot peer VPCs with overlapping CIDRs; plan carefully
  • Bring Your Own IP (BYOIP): Can bring public IPv4 (/24 or larger) or IPv6 (/48 or larger) to AWS

Detailed Example 1: Enterprise VPC with Growth Planning

Your company is migrating a large application to AWS. Current on-premises deployment has 500 servers, but you expect 3x growth over 5 years. You need to plan IP addressing for current needs plus future expansion.

Requirements Analysis:

  • Current: 500 servers across 3 tiers (web, app, database)
  • Growth: 1,500 servers in 5 years
  • High availability: 3 Availability Zones
  • Tiers: Public (web), Private (app), Private (database)
  • On-premises: Uses 192.168.0.0/16 (must avoid overlap)

CIDR Selection: You choose 10.0.0.0/16 for the VPC, providing 65,536 IPs. This avoids overlap with on-premises (192.168.0.0/16) and provides ample room for growth.

Subnet Design:

  • Public subnets (web tier): 10.0.1.0/24 (AZ-a), 10.0.2.0/24 (AZ-b), 10.0.3.0/24 (AZ-c) = 753 usable IPs
  • Private app subnets: 10.0.10.0/23 (AZ-a), 10.0.12.0/23 (AZ-b), 10.0.14.0/23 (AZ-c) = 1,533 usable IPs
  • Private DB subnets: 10.0.20.0/24 (AZ-a), 10.0.21.0/24 (AZ-b), 10.0.22.0/24 (AZ-c) = 753 usable IPs
  • Reserved for future: 10.0.30.0/19 = 8,192 IPs

Rationale: Web tier uses /24 (251 usable per AZ) because web servers are typically fewer and behind load balancers. App tier uses /23 (507 usable per AZ) because this is where most compute happens. Database tier uses /24 (251 usable per AZ) because databases are fewer but need dedicated space. The /19 reserved block provides massive expansion capacity.

IP Assignment Strategy: You use auto-assigned private IPs for most resources. For critical infrastructure (NAT Gateways, bastion hosts, monitoring servers), you manually assign IPs from the low end of each subnet (e.g., 10.0.1.10, 10.0.1.11) for easy identification. Auto-scaling groups use auto-assigned IPs from the remaining pool.

Growth Accommodation: With current design, you can deploy 1,500 app servers (500 per AZ) within the /23 subnets. If you exceed this, you can subdivide the /19 reserved block into additional /23 subnets. The VPC has capacity for 65,000+ IPs, far exceeding your 5-year projection.

Detailed Example 2: Multi-VPC Architecture with Peering

You're building a multi-tenant SaaS platform where each major customer gets a dedicated VPC for isolation. You need to peer these VPCs with a shared services VPC (for monitoring, logging, authentication).

Challenge: With multiple VPCs, you must ensure no CIDR overlaps, as overlapping VPCs cannot peer.

Solution - CIDR Allocation Strategy:

  • Shared Services VPC: 10.0.0.0/16
  • Customer VPC 1: 10.1.0.0/16
  • Customer VPC 2: 10.2.0.0/16
  • Customer VPC 3: 10.3.0.0/16
  • Customer VPC 4-254: 10.4.0.0/16 through 10.254.0.0/16 (reserved for future customers)

Peering Configuration: Each customer VPC peers with the shared services VPC. Customer VPC 1 has peering connection to 10.0.0.0/16. Customer VPC 2 has peering connection to 10.0.0.0/16. Customers cannot peer with each other (no transitive routing), ensuring isolation.

Routing: Customer VPC 1 route table has: 10.1.0.0/16 → local, 10.0.0.0/16 → pcx-xxx (peering connection). Shared services VPC route table has: 10.0.0.0/16 → local, 10.1.0.0/16 → pcx-xxx, 10.2.0.0/16 → pcx-yyy, 10.3.0.0/16 → pcx-zzz.

Scalability: This design supports 254 customer VPCs (10.1.0.0/16 through 10.254.0.0/16). Each customer VPC has 65,536 IPs. If you need more customers, you can use 172.16.0.0/12 range (172.16.0.0/16 through 172.31.0.0/16) for an additional 16 customer VPCs, or use Transit Gateway instead of peering for better scalability.

Security Benefit: CIDR-based isolation means even if a customer VPC is compromised, the attacker cannot reach other customer VPCs (no routing path exists). They can only reach shared services VPC, which has strict security groups allowing only specific ports (e.g., 443 for API, 514 for logging).

Detailed Example 3: IPv6 Dual-Stack Deployment

Your application needs to support IPv6 for compliance and future-proofing. You want dual-stack (IPv4 + IPv6) to maintain backward compatibility.

IPv6 CIDR Request: You enable IPv6 on your VPC (10.0.0.0/16). AWS assigns an IPv6 CIDR block from its pool, for example: 2600:1f13:1234:5600::/56. This /56 block provides 256 /64 subnets (2^(64-56) = 256).

Subnet IPv6 Assignment: For each subnet, you assign a /64 from the VPC's /56:

  • Public Subnet AZ-1a: 10.0.1.0/24 (IPv4) + 2600:1f13:1234:5600::/64 (IPv6)
  • Public Subnet AZ-1b: 10.0.2.0/24 (IPv4) + 2600:1f13:1234:5601::/64 (IPv6)
  • Private Subnet AZ-1a: 10.0.10.0/24 (IPv4) + 2600:1f13:1234:5610::/64 (IPv6)

IPv6 Addressing: Each /64 subnet provides 18 quintillion IPs (2^64). AWS doesn't reserve IPs in IPv6 subnets like it does for IPv4. When you launch an EC2 instance, it receives both an IPv4 address (e.g., 10.0.1.50) and an IPv6 address (e.g., 2600:1f13:1234:5600::1a).

Routing Differences: For IPv4, you route 0.0.0.0/0 to IGW for internet access. For IPv6, you route ::/0 to IGW. There's no NAT for IPv6 (all IPv6 addresses are public and routable). If you need outbound-only IPv6 (like NAT for IPv4), you use an Egress-Only Internet Gateway, which allows outbound IPv6 but blocks inbound.

Security Considerations: Since all IPv6 addresses are public, security groups become critical. You must explicitly allow inbound IPv6 traffic. A security group rule allowing 0.0.0.0/0 (IPv4) doesn't automatically allow ::/0 (IPv6); you need separate rules.

Use Case: A mobile app needs to support IPv6 (required by some mobile carriers). The app connects to your ALB, which has both IPv4 and IPv6 addresses. IPv6 clients connect via IPv6, IPv4 clients via IPv4. The ALB translates to IPv4 when communicating with backend instances (which can be IPv4-only).

šŸ’” Tips for IP Addressing:

  • Plan for 3x Growth: Always allocate 3x your current needs; IP space is cheap, renumbering is expensive
  • Use /16 for VPCs: Provides maximum flexibility; you can always use less, but can't easily expand
  • Standardize Subnet Sizes: Use consistent sizes (/24 for most subnets) for easier management
  • Document CIDR Allocations: Maintain a spreadsheet of all VPC and subnet CIDRs to prevent overlaps
  • Reserve Blocks: Always reserve CIDR blocks for future use; don't allocate everything immediately

āš ļø Common Mistakes & Misconceptions:

Mistake 1: Choosing /24 for VPC CIDR

  • Why it's wrong: /24 provides only 256 IPs (251 usable); after subnetting across AZs, you have very few IPs per subnet
  • Correct understanding: Use /16 for VPCs (65,536 IPs); subnet into /24s or /23s; provides room for growth and multiple tiers

Mistake 2: Forgetting to check for CIDR overlaps before peering

  • Why it's wrong: VPCs with overlapping CIDRs cannot peer; you'll need to recreate one VPC with different CIDR
  • Correct understanding: Plan all VPC CIDRs upfront; maintain a CIDR registry; use non-overlapping ranges (10.0.0.0/16, 10.1.0.0/16, 10.2.0.0/16)

Mistake 3: Using public IP ranges for VPCs

  • Why it's wrong: Public IPs are routable on internet; using them in VPC causes routing conflicts and security issues
  • Correct understanding: Always use RFC 1918 private ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) for VPC CIDRs

Mistake 4: Not accounting for AWS's 5 reserved IPs when sizing subnets

  • Why it's wrong: You plan for 30 instances, create a /27 (32 IPs), but only get 27 usable IPs; you're 3 short
  • Correct understanding: Always subtract 5 from theoretical subnet size; for 30 instances, use /26 (64 - 5 = 59 usable)

šŸ”— Connections to Other Topics:

  • Relates to VPC Peering because: Peering requires non-overlapping CIDRs; IP planning must account for all VPCs you'll peer
  • Builds on Direct Connect by: Hybrid connectivity requires non-overlapping CIDRs between AWS and on-premises; careful IP planning prevents conflicts
  • Often used with Transit Gateway to: TGW can route between VPCs with overlapping CIDRs using separate route tables, but it's complex; better to avoid overlaps

Routing in AWS VPCs

What it is: Routing in AWS VPCs determines how network traffic flows between subnets, to the internet, to on-premises networks, and to other VPCs. Route tables contain rules (routes) that specify where to send traffic based on destination IP addresses.

Why it exists: In traditional networks, routers use routing protocols (OSPF, EIGRP, BGP) to learn and propagate routes. In AWS VPCs, routing is software-defined and controlled via route tables. This provides flexibility, automation, and integration with AWS services. Routing enables connectivity patterns like internet access, hybrid connectivity, multi-VPC architectures, and traffic segmentation.

Real-world analogy: Think of route tables like GPS navigation. When you enter a destination (IP address), the GPS (route table) tells you which road to take (gateway/interface). Just as GPS has rules like "for downtown, take Highway 101" or "for the airport, take Route 280," route tables have rules like "for 10.0.0.0/16, use local" or "for 0.0.0.0/0, use internet gateway."

How it works (Detailed step-by-step):

  1. Route Table Creation: Every VPC has a main route table created automatically. You can create custom route tables for different routing needs. Each subnet must be associated with exactly one route table (either main or custom). If you don't explicitly associate a subnet with a custom route table, it uses the main route table.

  2. Route Evaluation: When a packet leaves a resource (EC2, Lambda, RDS), AWS evaluates the route table associated with the source subnet. AWS uses longest prefix match to select the route. For example, if a packet is destined for 10.0.5.50, and the route table has both 10.0.0.0/16 → local and 10.0.5.0/24 → pcx-xxx, AWS chooses the more specific /24 route.

  3. Local Route: Every route table has an implicit local route for the VPC CIDR (e.g., 10.0.0.0/16 → local). This route cannot be deleted or modified. It enables communication between all subnets within the VPC. If you add secondary CIDRs, local routes are automatically added for them.

  4. Gateway Routes: You add routes pointing to gateways for external connectivity. Common gateway routes: (1) 0.0.0.0/0 → igw-xxx (internet gateway) for internet access, (2) 0.0.0.0/0 → nat-xxx (NAT gateway) for private subnet internet access, (3) 192.168.0.0/16 → vgw-xxx (virtual private gateway) for on-premises access, (4) 0.0.0.0/0 → tgw-xxx (transit gateway) for multi-VPC routing.

  5. Route Propagation: For VPN and Direct Connect, you can enable route propagation, which automatically adds routes learned via BGP to the route table. For example, if your on-premises router advertises 192.168.0.0/16 via BGP, route propagation automatically adds 192.168.0.0/16 → vgw-xxx to the route table. This eliminates manual route management.

  6. Route Priority: When multiple routes match a destination, AWS uses this priority order: (1) Longest prefix match (most specific route wins), (2) If prefixes are equal, static routes take precedence over propagated routes, (3) If still tied, routes are evaluated in this order: local > VGW static > Direct Connect > VGW propagated > VPN static > VPN propagated.

šŸ“Š VPC Routing Diagram:

graph TB
    subgraph "VPC: 10.0.0.0/16"
        subgraph "Public Subnet: 10.0.1.0/24"
            WEB[Web Server<br/>10.0.1.10]
            RT_PUB["Public Route Table<br/>10.0.0.0/16 → local<br/>0.0.0.0/0 → igw-xxx<br/>192.168.0.0/16 → vgw-xxx"]
        end
        
        subgraph "Private Subnet: 10.0.10.0/24"
            APP[App Server<br/>10.0.10.20]
            RT_PRIV["Private Route Table<br/>10.0.0.0/16 → local<br/>0.0.0.0/0 → nat-xxx<br/>192.168.0.0/16 → vgw-xxx"]
        end
        
        IGW[Internet Gateway]
        NAT[NAT Gateway<br/>in Public Subnet]
        VGW[Virtual Private Gateway]
    end
    
    INTERNET((Internet<br/>0.0.0.0/0))
    ONPREM[On-Premises<br/>192.168.0.0/16]
    
    WEB -->|Destination: 8.8.8.8<br/>Match: 0.0.0.0/0| IGW
    WEB -->|Destination: 10.0.10.20<br/>Match: 10.0.0.0/16| APP
    WEB -->|Destination: 192.168.1.50<br/>Match: 192.168.0.0/16| VGW
    
    APP -->|Destination: 8.8.8.8<br/>Match: 0.0.0.0/0| NAT
    APP -->|Destination: 10.0.1.10<br/>Match: 10.0.0.0/16| WEB
    APP -->|Destination: 192.168.1.50<br/>Match: 192.168.0.0/16| VGW
    
    NAT --> IGW
    IGW <--> INTERNET
    VGW <--> ONPREM
    
    style WEB fill:#c8e6c9
    style APP fill:#f3e5f5
    style RT_PUB fill:#fff3e0
    style RT_PRIV fill:#fff3e0
    style IGW fill:#e1f5fe
    style NAT fill:#ffecb3
    style VGW fill:#fce4ec

See: diagrams/01_fundamentals_routing.mmd

Diagram Explanation (detailed):

This diagram illustrates how route tables control traffic flow in a VPC with multiple connectivity patterns. The VPC (10.0.0.0/16) has two subnets with different route tables, demonstrating the flexibility of AWS routing.

Public Subnet Routing (10.0.1.0/24): The Web Server (10.0.1.10) is associated with the Public Route Table (orange box), which contains three routes: (1) 10.0.0.0/16 → local enables communication with all resources in the VPC, (2) 0.0.0.0/0 → igw-xxx sends all internet-bound traffic to the Internet Gateway, and (3) 192.168.0.0/16 → vgw-xxx sends on-premises traffic to the Virtual Private Gateway. When the web server sends a packet to 8.8.8.8 (Google DNS), AWS evaluates the route table, matches 0.0.0.0/0 (default route), and forwards to the IGW. When it sends to 10.0.10.20 (app server), AWS matches 10.0.0.0/16 (local route) and routes within the VPC. When it sends to 192.168.1.50 (on-premises), AWS matches 192.168.0.0/16 and routes to VGW.

Private Subnet Routing (10.0.10.0/24): The App Server (10.0.10.20) is associated with the Private Route Table, which has the same local and VGW routes but differs in the default route: 0.0.0.0/0 → nat-xxx sends internet traffic to the NAT Gateway instead of directly to IGW. This is the key difference between public and private subnets. When the app server needs to download updates from the internet (e.g., 8.8.8.8), the packet goes to the NAT Gateway (10.0.1.20 in the public subnet), which translates the source IP to its own Elastic IP and forwards to the IGW. Return traffic comes back to the NAT Gateway, which translates back to the app server's private IP. This provides outbound internet access without exposing the app server to inbound connections.

Longest Prefix Match: If the app server sends a packet to 10.0.1.10 (web server), AWS evaluates: (1) Does 10.0.1.10 match 10.0.0.0/16? Yes (/16 match). (2) Does it match 0.0.0.0/0? Yes (/0 match). (3) Does it match 192.168.0.0/16? No. AWS chooses 10.0.0.0/16 because /16 is more specific than /0 (longest prefix match). The packet routes locally within the VPC.

Hybrid Connectivity: Both route tables have 192.168.0.0/16 → vgw-xxx, enabling communication with on-premises resources. This route could be static (manually added) or propagated (learned via BGP). When either server sends traffic to 192.168.1.50, AWS matches this route and forwards to the VGW, which encrypts the traffic (if VPN) or sends it over Direct Connect to the on-premises data center.

Gateway Roles: (1) Internet Gateway (blue) provides bidirectional internet access for resources with public IPs, (2) NAT Gateway (yellow) provides outbound-only internet access for private resources, translating private IPs to its Elastic IP, (3) Virtual Private Gateway (pink) provides encrypted connectivity to on-premises networks via VPN or Direct Connect.

⭐ Must Know (Critical Routing Facts):

  • Local Route: Automatically created for VPC CIDR; cannot be deleted; enables intra-VPC communication
  • Longest Prefix Match: Most specific route wins; /24 beats /16, /16 beats /0
  • Route Limit: 50 static routes per route table (can request increase to 100); propagated routes don't count toward limit
  • Route Propagation: Automatically adds BGP-learned routes; useful for VPN and Direct Connect
  • Main Route Table: Default for all subnets; best practice is to leave it minimal and use custom route tables
  • Route Priority: Local > VGW static > Direct Connect > VGW propagated > VPN static > VPN propagated
  • Blackhole Routes: If target gateway is deleted, route shows "blackhole" status; traffic is dropped

šŸ’” Tips for Understanding Routing:

  • Public = IGW Route: A subnet is public if its route table has 0.0.0.0/0 → IGW
  • Private = NAT Route: A subnet is private if its route table has 0.0.0.0/0 → NAT or no default route
  • One Route Table Per Subnet: Each subnet associates with exactly one route table, but one route table can serve multiple subnets
  • Test with Traceroute: Use VPC Reachability Analyzer to test routing paths without sending actual traffic

Security Groups and Network ACLs

What they are: Security Groups and Network ACLs (NACLs) are two layers of network security in AWS VPCs. Security Groups are stateful, instance-level firewalls that control traffic to and from ENIs (Elastic Network Interfaces). Network ACLs are stateless, subnet-level firewalls that control traffic entering and leaving subnets.

Why they exist: In traditional networks, you use firewalls, ACLs on routers, and host-based firewalls for defense in depth. AWS provides similar layered security with NACLs (subnet boundary) and Security Groups (instance boundary). This implements the principle of least privilege and defense in depth. Security Groups provide fine-grained, stateful control, while NACLs provide coarse-grained, stateless control and can explicitly deny traffic.

Real-world analogy: Think of a corporate office building. The Network ACL is like the building's main security gate - it checks everyone entering or leaving the building (subnet) based on a list of rules, treating each entry and exit separately. The Security Group is like the keycard access to individual offices - once you're granted access to an office (inbound allowed), you can leave freely (outbound automatically allowed due to statefulness), and the system remembers your session.

How they work (Detailed step-by-step):

Security Groups:

  1. Stateful Operation: When you allow an inbound connection, the return traffic is automatically allowed, regardless of outbound rules. For example, if you allow inbound TCP 443 from 0.0.0.0/0, responses to those connections are automatically allowed outbound, even if you have no outbound rules. This is because Security Groups track connection state (5-tuple: source IP, source port, dest IP, dest port, protocol).

  2. Default Behavior: Security Groups are deny-by-default. If no rule explicitly allows traffic, it's denied. You can only create allow rules, not deny rules. The default Security Group allows all outbound traffic and allows inbound traffic from other instances in the same Security Group.

  3. Rule Evaluation: Security Groups evaluate all rules before deciding to allow traffic. If any rule matches, traffic is allowed. Rules are not processed in order (unlike NACLs). You can specify sources/destinations as: (1) CIDR blocks (e.g., 10.0.0.0/16), (2) Other Security Groups (e.g., sg-xxx), (3) Prefix lists (e.g., pl-xxx for S3).

  4. Association: Security Groups attach to ENIs (Elastic Network Interfaces), not instances directly. Each ENI can have up to 5 Security Groups. Each Security Group can have up to 60 inbound and 60 outbound rules. Security Groups are VPC-specific and cannot span VPCs.

  5. Changes Take Effect Immediately: When you modify a Security Group rule, the change applies immediately to all associated ENIs. No restart or reconnection required.

Network ACLs:

  1. Stateless Operation: NACLs don't track connection state. You must explicitly allow both inbound and outbound traffic. For example, if you allow inbound TCP 443, you must also allow outbound TCP 1024-65535 (ephemeral ports) for return traffic. This is because clients use random high ports for responses.

  2. Rule Numbering: NACL rules are numbered 1-32766 and evaluated in order from lowest to highest. Once a rule matches, evaluation stops (first match wins). Rule 32767 is the implicit deny all. Best practice is to number rules in increments of 100 (100, 200, 300) to allow inserting rules later.

  3. Allow and Deny: Unlike Security Groups, NACLs support both allow and deny rules. This enables explicit blocking of specific IPs or ranges. For example, you can deny traffic from a known malicious IP while allowing all other traffic.

  4. Default NACL: Every VPC has a default NACL that allows all inbound and outbound traffic. Custom NACLs deny all traffic by default until you add allow rules.

  5. Association: Each subnet must be associated with exactly one NACL. If you don't explicitly associate a subnet with a custom NACL, it uses the default NACL. One NACL can be associated with multiple subnets.

  6. Ephemeral Ports: For stateless NACLs, you must allow ephemeral ports (1024-65535) for return traffic. Different operating systems use different ephemeral port ranges: Linux (32768-60999), Windows (49152-65535), NAT Gateway (1024-65535). To support all, allow 1024-65535.

šŸ“Š Security Groups vs NACLs Diagram:

sequenceDiagram
    participant Client as Internet Client<br/>203.0.113.50
    participant NACL as Network ACL<br/>(Subnet Boundary)
    participant SG as Security Group<br/>(Instance Level)
    participant EC2 as EC2 Instance<br/>10.0.1.10
    
    Note over Client,EC2: Inbound Request (Client → EC2)
    Client->>NACL: TCP SYN to 10.0.1.10:443<br/>from 203.0.113.50:54321
    Note over NACL: Stateless: Check inbound rules<br/>Rule 100: Allow TCP 443 from 0.0.0.0/0<br/>āœ“ ALLOW
    NACL->>SG: Forward packet
    Note over SG: Stateful: Check inbound rules<br/>Allow TCP 443 from 0.0.0.0/0<br/>āœ“ ALLOW + Track connection
    SG->>EC2: Deliver packet
    
    Note over Client,EC2: Outbound Response (EC2 → Client)
    EC2->>SG: TCP SYN-ACK to 203.0.113.50:54321<br/>from 10.0.1.10:443
    Note over SG: Stateful: Connection tracked<br/>Automatically allow return traffic<br/>āœ“ ALLOW (no outbound rule needed)
    SG->>NACL: Forward packet
    Note over NACL: Stateless: Check outbound rules<br/>Rule 100: Allow TCP 1024-65535 to 0.0.0.0/0<br/>āœ“ ALLOW
    NACL->>Client: Deliver packet
    
    Note over Client,EC2: Blocked Traffic Example
    Client->>NACL: TCP SYN to 10.0.1.10:22<br/>from 203.0.113.50:54322
    Note over NACL: Check inbound rules<br/>Rule 100: Allow TCP 443 (no match)<br/>Rule 32767: Deny all<br/>āœ— DENY
    NACL--xClient: Packet dropped at NACL

See: diagrams/01_fundamentals_security_groups_nacls.mmd

Diagram Explanation (detailed):

This sequence diagram illustrates the critical differences between stateful Security Groups and stateless Network ACLs by showing packet flow for an HTTPS connection from an internet client to an EC2 instance.

Inbound Request Flow: An internet client (203.0.113.50) initiates an HTTPS connection to the EC2 instance (10.0.1.10:443). The client uses source port 54321 (randomly chosen ephemeral port). The packet first hits the Network ACL at the subnet boundary. The NACL is stateless, so it evaluates inbound rules without any connection tracking. It finds Rule 100 (Allow TCP 443 from 0.0.0.0/0) and allows the packet. The packet then reaches the Security Group attached to the EC2 instance's ENI. The Security Group is stateful, so it checks inbound rules and finds a match (Allow TCP 443 from 0.0.0.0/0). Critically, the Security Group now tracks this connection (remembering the 5-tuple: source IP, source port, dest IP, dest port, protocol). The packet is delivered to the EC2 instance.

Outbound Response Flow: The EC2 instance sends a response (TCP SYN-ACK) back to the client. The response packet has source 10.0.1.10:443 and destination 203.0.113.50:54321. When it reaches the Security Group, the SG recognizes this as return traffic for the tracked inbound connection and automatically allows it, even if there are no explicit outbound rules. This is the power of statefulness - you don't need to configure outbound rules for return traffic. However, when the packet reaches the NACL, the NACL is stateless and has no memory of the inbound connection. It must evaluate outbound rules. The NACL checks Rule 100 (Allow TCP 1024-65535 to 0.0.0.0/0), which matches the destination port 54321 (ephemeral port range). The packet is allowed and delivered to the client.

Blocked Traffic Example: If the client tries to connect to port 22 (SSH) instead of 443, the packet hits the NACL, which evaluates inbound rules. Rule 100 only allows TCP 443, so it doesn't match. The NACL continues to Rule 32767 (implicit deny all), which matches and denies the packet. The packet never reaches the Security Group or EC2 instance. This demonstrates how NACLs provide a first line of defense at the subnet boundary.

Key Insights: (1) Security Groups are more convenient because statefulness eliminates the need for explicit return traffic rules. (2) NACLs require careful configuration of both inbound and outbound rules, including ephemeral port ranges. (3) NACLs can explicitly deny traffic (useful for blocking specific IPs), while Security Groups can only allow. (4) Traffic must pass both NACL and Security Group checks; if either denies, traffic is blocked. (5) For outbound-initiated connections (e.g., EC2 downloading updates), the flow is reversed: Security Group allows outbound automatically, NACL needs explicit outbound allow and inbound ephemeral port allow for responses.

⭐ Must Know (Critical Security Facts):

  • Security Groups are Stateful: Return traffic automatically allowed; track connection state
  • NACLs are Stateless: Must explicitly allow both directions; no connection tracking
  • Security Groups = Allow Only: Cannot create deny rules; deny-by-default
  • NACLs = Allow and Deny: Can explicitly deny specific IPs or ranges
  • Rule Limits: Security Group: 60 inbound + 60 outbound rules; NACL: 20 inbound + 20 outbound (can request increase)
  • Ephemeral Ports: NACLs must allow 1024-65535 for return traffic (stateless)
  • Evaluation Order: NACL first (subnet boundary), then Security Group (instance level)
  • Changes: Security Group changes immediate; NACL changes immediate
  • Default Security Group: Allows all outbound, allows inbound from same SG
  • Default NACL: Allows all inbound and outbound

Comparison Table:

Feature Security Group Network ACL
Level Instance (ENI) Subnet
State Stateful Stateless
Rules Allow only Allow and Deny
Rule Processing All rules evaluated First match wins (ordered)
Return Traffic Automatic Must explicitly allow
Default Deny all inbound, allow all outbound Allow all (default NACL)
Association ENI (up to 5 SGs per ENI) Subnet (1 NACL per subnet)
Rule Limit 60 inbound + 60 outbound 20 inbound + 20 outbound
Use Case Fine-grained instance control Subnet-level filtering, explicit denies

Detailed Example 1: Web Application Security

You're deploying a three-tier web application (web, app, database) and need to implement defense in depth with both Security Groups and NACLs.

Security Group Design:

  • Web-SG: Inbound: Allow TCP 443 from 0.0.0.0/0 (internet), Allow TCP 80 from 0.0.0.0/0 (redirect to HTTPS). Outbound: Allow all (default, for return traffic and app tier communication).
  • App-SG: Inbound: Allow TCP 8080 from Web-SG (only web tier can access). Outbound: Allow all (for return traffic and database communication).
  • DB-SG: Inbound: Allow TCP 3306 from App-SG (only app tier can access). Outbound: Allow all (for return traffic).

NACL Design (Additional Layer):

  • Public Subnet NACL (Web Tier):

    • Inbound: Rule 100: Allow TCP 443 from 0.0.0.0/0, Rule 110: Allow TCP 80 from 0.0.0.0/0, Rule 120: Allow TCP 1024-65535 from 0.0.0.0/0 (return traffic), Rule 32767: Deny all.
    • Outbound: Rule 100: Allow TCP 1024-65535 to 0.0.0.0/0 (return traffic to internet), Rule 110: Allow TCP 8080 to 10.0.10.0/24 (app tier), Rule 120: Allow TCP 1024-65535 from 10.0.10.0/24 (return from app), Rule 32767: Deny all.
  • Private App Subnet NACL:

    • Inbound: Rule 100: Allow TCP 8080 from 10.0.1.0/24 (web tier), Rule 110: Allow TCP 1024-65535 from 10.0.20.0/24 (return from DB), Rule 32767: Deny all.
    • Outbound: Rule 100: Allow TCP 1024-65535 to 10.0.1.0/24 (return to web), Rule 110: Allow TCP 3306 to 10.0.20.0/24 (DB tier), Rule 32767: Deny all.
  • Private DB Subnet NACL:

    • Inbound: Rule 100: Allow TCP 3306 from 10.0.10.0/24 (app tier only), Rule 32767: Deny all.
    • Outbound: Rule 100: Allow TCP 1024-65535 to 10.0.10.0/24 (return to app), Rule 32767: Deny all.

Traffic Flow: (1) Internet user connects to web server on 443. NACL allows inbound 443, Security Group allows inbound 443. (2) Web server connects to app server on 8080. NACL allows outbound 8080 to app subnet, app subnet NACL allows inbound 8080 from web subnet, App-SG allows inbound 8080 from Web-SG. (3) App server connects to database on 3306. NACL allows outbound 3306 to DB subnet, DB subnet NACL allows inbound 3306 from app subnet, DB-SG allows inbound 3306 from App-SG. (4) All return traffic flows back through the same path, with Security Groups automatically allowing (stateful) and NACLs explicitly allowing ephemeral ports (stateless).

Security Benefit: Even if an attacker compromises the web server, they cannot directly access the database because: (1) DB-SG only allows traffic from App-SG, not Web-SG, (2) DB subnet NACL only allows traffic from app subnet, not web subnet. The attacker must compromise both web and app tiers to reach the database.

Detailed Example 2: Blocking Malicious IPs

Your web application is under attack from a specific IP range (198.51.100.0/24). You need to block this range while allowing all other traffic.

Why Not Security Groups: Security Groups only support allow rules, not deny rules. You cannot create a rule to deny 198.51.100.0/24. You would have to allow every other IP range explicitly, which is impractical.

NACL Solution: You modify the public subnet NACL to add a deny rule:

  • Rule 50: Deny TCP 443 from 198.51.100.0/24
  • Rule 60: Deny TCP 80 from 198.51.100.0/24
  • Rule 100: Allow TCP 443 from 0.0.0.0/0
  • Rule 110: Allow TCP 80 from 0.0.0.0/0

Rule Ordering: The deny rules (50, 60) are numbered lower than the allow rules (100, 110), so they're evaluated first. When a packet from 198.51.100.50 arrives, the NACL evaluates Rule 50, finds a match, and denies the packet. The packet never reaches Rule 100 or the Security Group. Packets from other IPs (e.g., 203.0.113.50) don't match Rules 50 or 60, so evaluation continues to Rule 100, which allows them.

Alternative - AWS WAF: For more sophisticated blocking (e.g., rate limiting, geo-blocking, SQL injection protection), use AWS WAF attached to your Application Load Balancer or CloudFront distribution. WAF provides layer 7 filtering, while NACLs provide layer 3/4 filtering.

Detailed Example 3: Troubleshooting Connectivity Issues

A developer reports that their EC2 instance (10.0.10.50) in a private subnet cannot download updates from the internet, even though a NAT Gateway is configured.

Troubleshooting Steps:

  1. Check Route Table: Verify the private subnet's route table has 0.0.0.0/0 → nat-xxx. āœ“ Correct.

  2. Check Security Group: The instance's Security Group has no outbound rules (uses default allow all outbound). āœ“ Correct - Security Groups allow all outbound by default.

  3. Check NACL Outbound: The private subnet NACL has:

    • Rule 100: Allow TCP 443 to 0.0.0.0/0
    • Rule 32767: Deny all
    • Problem Found: The instance is trying to download updates via HTTP (port 80), but the NACL only allows HTTPS (port 443). Additionally, even if 443 is allowed outbound, there's no inbound rule for return traffic.
  4. Check NACL Inbound: The private subnet NACL has:

    • Rule 100: Allow TCP 8080 from 10.0.1.0/24
    • Rule 32767: Deny all
    • Problem Found: No rule allowing inbound ephemeral ports (1024-65535) for return traffic from the internet.

Solution: Modify the NACL:

  • Outbound: Add Rule 110: Allow TCP 80 to 0.0.0.0/0 (for HTTP downloads)
  • Inbound: Add Rule 110: Allow TCP 1024-65535 from 0.0.0.0/0 (for return traffic)

After Fix: The instance can now download updates. Outbound traffic (port 80 or 443) is allowed by the NACL and routes to the NAT Gateway. The NAT Gateway translates the source IP to its Elastic IP and forwards to the internet. Return traffic comes back to the NAT Gateway on an ephemeral port (e.g., 54321), which the NAT Gateway forwards to the instance. The NACL now allows this inbound ephemeral port traffic, and the packet reaches the instance.

Lesson: NACLs are stateless and require explicit rules for both directions. Always allow ephemeral ports (1024-65535) for return traffic. Security Groups are stateful and handle return traffic automatically.

šŸ’” Tips for Security Groups and NACLs:

  • Use Security Groups for Most Control: Stateful operation is simpler and more intuitive
  • Use NACLs for Explicit Denies: Block specific IPs or ranges at subnet boundary
  • Reference Security Groups: Use sg-xxx as source/destination instead of CIDR blocks for dynamic environments
  • Ephemeral Port Range: Always allow 1024-65535 in NACL outbound rules for return traffic
  • Rule Numbering: Number NACL rules in increments of 10 or 100 for easy insertion
  • Default Deny: Leave default NACL allowing all; create custom NACLs with explicit rules for sensitive subnets

āš ļø Common Mistakes & Misconceptions:

Mistake 1: Forgetting to allow ephemeral ports in NACL outbound rules

  • Why it's wrong: Return traffic from internet uses random high ports (1024-65535); without allowing these, responses are blocked
  • Correct understanding: NACLs are stateless; must explicitly allow both request and response ports

Mistake 2: Trying to create deny rules in Security Groups

  • Why it's wrong: Security Groups only support allow rules; deny is implicit (anything not allowed is denied)
  • Correct understanding: Use NACLs for explicit denies; Security Groups for explicit allows

Mistake 3: Assuming Security Group outbound rules are needed for return traffic

  • Why it's wrong: Security Groups are stateful; return traffic for allowed inbound connections is automatically permitted
  • Correct understanding: Outbound Security Group rules are only needed for instance-initiated outbound connections, not for return traffic

Mistake 4: Applying Security Groups to subnets

  • Why it's wrong: Security Groups attach to ENIs (network interfaces), not subnets
  • Correct understanding: Security Groups = instance level (ENI); NACLs = subnet level

šŸ”— Connections to Other Topics:

  • Relates to VPC Flow Logs because: Flow Logs capture accepted and rejected traffic; use them to troubleshoot Security Group and NACL issues
  • Builds on AWS WAF by: NACLs provide layer 3/4 filtering; WAF provides layer 7 filtering; use both for defense in depth
  • Often used with Network Firewall to: Network Firewall provides stateful inspection and IDS/IPS; complements Security Groups and NACLs

Chapter Summary

What We Covered

This chapter established the foundational knowledge required for AWS Advanced Networking certification:

āœ… AWS Global Infrastructure: Regions, Availability Zones, Edge Locations, and how they form the physical foundation for networking
āœ… Amazon VPC: Logically isolated virtual networks, CIDR blocks, subnets, and multi-AZ architectures
āœ… IP Addressing: CIDR notation, subnet calculation, reserved IPs, IPv4 and IPv6 addressing strategies
āœ… Routing: Route tables, local routes, gateway routes, longest prefix match, and route propagation
āœ… Security: Security Groups (stateful, instance-level) vs Network ACLs (stateless, subnet-level)

Critical Takeaways

  1. Infrastructure Design: Always deploy across multiple AZs for high availability; use Regions for disaster recovery
  2. VPC Architecture: Public subnets have IGW routes; private subnets use NAT Gateway; plan CIDR blocks to avoid overlaps
  3. IP Planning: Use /16 for VPCs, /24 for subnets; remember AWS reserves 5 IPs per subnet; plan for 3x growth
  4. Routing Logic: Longest prefix match determines route selection; local routes enable intra-VPC communication
  5. Security Layers: Security Groups are stateful (easier); NACLs are stateless (require ephemeral ports); use both for defense in depth

Self-Assessment Checklist

Test yourself before moving to Domain 1:

  • I can explain the difference between Regions, AZs, and Edge Locations
  • I can design a multi-AZ VPC architecture with public and private subnets
  • I can calculate usable IPs in a subnet (accounting for AWS's 5 reserved IPs)
  • I can explain how route tables determine traffic flow
  • I can describe the difference between stateful Security Groups and stateless NACLs
  • I can troubleshoot connectivity issues using routing and security concepts
  • I can design IP addressing schemes that avoid overlaps and support growth
  • I understand when to use IGW vs NAT Gateway vs VGW

Practice Questions

Try these from your practice test bundles:

  • Fundamentals Bundle: Questions 1-20
  • Expected score: 80%+ to proceed confidently

If you scored below 80%:

  • Review sections on VPC architecture and routing
  • Practice subnet calculations and CIDR planning
  • Study the Security Groups vs NACLs comparison table
  • Redraw diagrams from memory to reinforce understanding

Quick Reference Card

VPC Essentials:

  • VPC CIDR: /16 to /28 (recommend /16 for production)
  • Subnets: AZ-specific, 5 IPs reserved per subnet
  • Public Subnet: Route table has 0.0.0.0/0 → IGW
  • Private Subnet: Route table has 0.0.0.0/0 → NAT or no default route

Routing Essentials:

  • Local route: Automatic for VPC CIDR, enables intra-VPC communication
  • Longest prefix match: Most specific route wins
  • Route priority: Local > VGW static > DX > VGW propagated > VPN

Security Essentials:

  • Security Groups: Stateful, instance-level, allow-only, return traffic automatic
  • NACLs: Stateless, subnet-level, allow and deny, must allow ephemeral ports (1024-65535)
  • Evaluation: NACL first (subnet boundary), then Security Group (instance level)

Key Limits:

  • Internet Gateways: 1 per VPC
  • Security Groups per ENI: 5
  • Rules per Security Group: 60 inbound + 60 outbound
  • Rules per NACL: 20 inbound + 20 outbound (default)
  • Routes per route table: 50 static (default)

Next Chapter: Domain 1 - Network Design (02_domain_1_network_design)

In the next chapter, we'll apply these fundamentals to design edge network services, DNS solutions, load balancing architectures, monitoring strategies, hybrid connectivity, and multi-account/multi-region networks.


Chapter 1: Network Design (30% of Exam)

Chapter Overview

What you'll learn:

  • Design edge network services using CloudFront and Global Accelerator for global performance
  • Architect DNS solutions with Route 53 for public, private, and hybrid requirements
  • Design load balancing solutions across layers 3, 4, and 7 for high availability
  • Define logging and monitoring strategies for network visibility and troubleshooting
  • Create hybrid connectivity architectures using Direct Connect and VPN
  • Design multi-account and multi-region network topologies with Transit Gateway and VPC peering

Time to complete: 20-25 hours (this is the largest domain)

Prerequisites: Chapter 0 (Fundamentals) - strong understanding of VPC, routing, and security

Exam Weight: 30% of scored questions (approximately 15 questions on the actual exam)

Domain Breakdown:

  • Task 1.1: Edge Network Services and Global Architectures (15% of domain)
  • Task 1.2: DNS Solutions (Public, Private, Hybrid) (15% of domain)
  • Task 1.3: Load Balancing Solutions (15% of domain)
  • Task 1.4: Logging and Monitoring Requirements (15% of domain)
  • Task 1.5: Hybrid Connectivity Routing Strategy (20% of domain)
  • Task 1.6: Multi-Account/Multi-Region Connectivity (20% of domain)

Section 1: Edge Network Services and Global Architectures

Introduction

The problem: Users accessing applications from different geographic locations experience varying latency, performance, and availability. A user in Tokyo accessing a server in Virginia experiences 150-200ms latency, while a user in New York experiences 10-20ms. Additionally, internet routing is unpredictable, with packets potentially taking suboptimal paths through congested networks. For global applications, this creates poor user experience, slow page loads, and potential revenue loss.

The solution: AWS provides edge network services that bring content and application endpoints closer to users globally. Amazon CloudFront caches static and dynamic content at 400+ edge locations worldwide, reducing latency by serving content from the nearest location. AWS Global Accelerator provides static anycast IP addresses that route traffic over AWS's private global network, bypassing congested internet paths and improving performance by up to 60%.

Why it's tested: The ANS-C01 exam heavily tests your ability to design global architectures that optimize performance, availability, and cost. You must understand when to use CloudFront vs Global Accelerator, how to integrate them with other AWS services, and how to configure them for specific use cases like video streaming, API acceleration, and multi-region failover.

Core Concepts

Amazon CloudFront

What it is: CloudFront is AWS's Content Delivery Network (CDN) service that caches and delivers content from 400+ edge locations globally. It supports static content (images, CSS, JavaScript), dynamic content (API responses, personalized pages), video streaming (live and on-demand), and software downloads. CloudFront integrates with AWS services (S3, EC2, ELB, API Gateway) and custom origins.

Why it exists: Traditional web architectures serve all content from a central location, causing high latency for distant users. If your application runs in us-east-1 and a user in Australia requests a page, every request travels 15,000+ miles round-trip, taking 200-300ms. CloudFront solves this by caching content at edge locations near users. The first request from Australia goes to the origin (200ms), but CloudFront caches the response. Subsequent requests from Australia are served from the Sydney edge location (5-10ms), a 95% latency reduction.

Real-world analogy: Think of CloudFront like a global network of convenience stores. Instead of everyone driving to a central warehouse (origin server) to buy milk, they go to their local convenience store (edge location). The store stocks popular items (cached content) and only goes to the warehouse for items not in stock (cache miss). This dramatically reduces travel time (latency) for customers.

How it works (Detailed step-by-step):

  1. Distribution Creation: You create a CloudFront distribution, specifying an origin (S3 bucket, ALB, EC2, API Gateway, or custom HTTP server). You configure cache behaviors (URL patterns and caching rules), SSL/TLS settings, geographic restrictions, and access controls. CloudFront assigns a domain name (e.g., d1234abcd.cloudfront.net) and optionally you can use a custom domain (e.g., cdn.example.com) with your SSL certificate.

  2. DNS Resolution: When a user requests content (e.g., https://cdn.example.com/image.jpg), their DNS query for cdn.example.com resolves to CloudFront's DNS servers. CloudFront's DNS uses anycast and GeoDNS to return the IP address of the edge location closest to the user based on their geographic location and network latency. For a user in London, this might be the London edge location (LHR50-C1).

  3. Edge Location Request: The user's browser connects to the London edge location. CloudFront checks if the requested object (image.jpg) is in the edge location's cache and if the cached copy is still valid (not expired based on TTL). If the object is in cache and valid (cache hit), CloudFront immediately returns it to the user with minimal latency (5-15ms). If the object is not in cache or expired (cache miss), CloudFront proceeds to fetch it from the origin.

  4. Origin Fetch: On a cache miss, the edge location connects to the origin server (e.g., ALB in us-east-1) over AWS's private network backbone (not public internet). CloudFront maintains persistent connections to origins, reducing connection overhead. The origin processes the request and returns the response. CloudFront caches the response at the edge location based on cache headers (Cache-Control, Expires) or distribution settings (default TTL).

  5. Response Delivery: CloudFront delivers the response to the user and stores it in the edge location's cache. The cache key is typically the URL, but can include query strings, cookies, or headers based on configuration. Subsequent requests for the same object from users near that edge location are served from cache until the TTL expires.

  6. Cache Invalidation: When you update content at the origin, cached copies at edge locations don't automatically update until TTL expires. You can create an invalidation request to remove specific objects from all edge locations immediately. Invalidations are processed within minutes but cost $0.005 per path (first 1,000 paths/month free). Alternatively, use versioned URLs (e.g., image-v2.jpg) to bypass cache without invalidation costs.

šŸ“Š CloudFront Architecture Diagram:

graph TB
    subgraph "Users Worldwide"
        U1[User in London<br/>Latency: 10ms]
        U2[User in Tokyo<br/>Latency: 15ms]
        U3[User in Sydney<br/>Latency: 12ms]
    end
    
    subgraph "CloudFront Edge Network (400+ Locations)"
        E1[London Edge<br/>LHR50-C1<br/>Cache Hit: Serve<br/>Cache Miss: Fetch]
        E2[Tokyo Edge<br/>NRT57-C1<br/>Cache Hit: Serve<br/>Cache Miss: Fetch]
        E3[Sydney Edge<br/>SYD4-C1<br/>Cache Hit: Serve<br/>Cache Miss: Fetch]
    end
    
    subgraph "AWS Region: us-east-1"
        subgraph "Origin Infrastructure"
            ALB[Application Load Balancer<br/>Origin: alb.example.com]
            EC2_1[Web Server 1]
            EC2_2[Web Server 2]
            S3[S3 Bucket<br/>Static Assets]
        end
    end
    
    CF_DIST[CloudFront Distribution<br/>d1234abcd.cloudfront.net<br/>Custom: cdn.example.com<br/>Cache Behaviors:<br/>/*.jpg → S3<br/>/* → ALB]
    
    U1 -->|1. DNS Query<br/>cdn.example.com| CF_DIST
    CF_DIST -->|2. Return IP of<br/>Nearest Edge| U1
    U1 -->|3. HTTPS Request<br/>/image.jpg| E1
    
    U2 --> CF_DIST
    CF_DIST --> U2
    U2 --> E2
    
    U3 --> CF_DIST
    CF_DIST --> U3
    U3 --> E3
    
    E1 -.4a. Cache Hit<br/>Return Cached.-> U1
    E1 -.4b. Cache Miss<br/>Fetch from Origin.-> ALB
    E2 -.Cache Miss.-> ALB
    E3 -.Cache Miss.-> S3
    
    ALB --> EC2_1
    ALB --> EC2_2
    
    style U1 fill:#e1f5fe
    style U2 fill:#e1f5fe
    style U3 fill:#e1f5fe
    style E1 fill:#c8e6c9
    style E2 fill:#c8e6c9
    style E3 fill:#c8e6c9
    style CF_DIST fill:#fff3e0
    style ALB fill:#f3e5f5
    style S3 fill:#ffecb3

See: diagrams/02_domain_1_cloudfront_architecture.mmd

Diagram Explanation (detailed):

This diagram illustrates CloudFront's global content delivery architecture serving users from multiple continents. At the top, we have three users in different geographic locations: London, Tokyo, and Sydney. Each user experiences low latency (10-15ms) to their nearest edge location, compared to 150-250ms if they connected directly to the origin in us-east-1.

Request Flow: (1) When the London user requests content from cdn.example.com, their DNS query reaches CloudFront's authoritative DNS servers. (2) CloudFront's GeoDNS system determines the user's location (London) and returns the IP address of the nearest edge location (LHR50-C1 in London). (3) The user's browser establishes an HTTPS connection to the London edge location and requests /image.jpg. (4a) If the image is already cached at the London edge (cache hit), CloudFront immediately returns it with ~10ms latency. (4b) If the image is not cached (cache miss), the London edge location fetches it from the origin.

Origin Fetch Process: When a cache miss occurs, the edge location connects to the origin over AWS's private global network backbone (not public internet). The CloudFront distribution is configured with cache behaviors that route different URL patterns to different origins: /.jpg requests go to the S3 bucket (for static images), while / requests go to the ALB (for dynamic content). The ALB distributes requests across multiple EC2 web servers for high availability. The origin returns the content along with cache headers (Cache-Control: max-age=86400), which CloudFront uses to determine how long to cache the object.

Multi-Origin Architecture: This distribution uses two origins: (1) S3 bucket for static assets (images, CSS, JavaScript) - highly cacheable with long TTLs (hours to days), and (2) ALB for dynamic content (HTML pages, API responses) - shorter TTLs (seconds to minutes) or no caching for personalized content. CloudFront's cache behaviors use path patterns to route requests: /images/* → S3, /api/* → ALB with no caching, /static/* → S3 with 1-year TTL.

Global Distribution: Users in Tokyo and Sydney follow the same pattern, connecting to their nearest edge locations (NRT57-C1 and SYD4-C1). All edge locations share the same cache configuration but maintain independent caches. If the London edge has cached /image.jpg, the Tokyo edge still needs to fetch it on first request (no cache sharing between edges). However, once cached, each edge serves its local users with minimal latency.

Performance Impact: Without CloudFront, the London user would connect directly to us-east-1 (80ms latency), fetch the image (200ms total), and repeat for every request. With CloudFront, the first request takes 80ms (cache miss), but subsequent requests take 10ms (cache hit), a 87.5% improvement. For a page with 50 images, this reduces load time from 10 seconds to 1.5 seconds.

Cost Optimization: CloudFront charges for data transfer out and requests. Caching reduces origin load and data transfer from origin to edge (free on AWS's private network), but you pay for edge-to-user transfer ($0.085/GB in North America, $0.140/GB in Asia). High cache hit ratios (>80%) significantly reduce costs by minimizing origin fetches.

⭐ Must Know (Critical CloudFront Facts):

  • Edge Locations: 400+ globally (vs 30+ Regions); not full AWS Regions, just caching infrastructure
  • Cache Key: Default is URL; can include query strings, cookies, headers based on configuration
  • TTL: Controlled by Cache-Control/Expires headers from origin or distribution default TTL (24 hours default)
  • Origin Types: S3, ALB, NLB, EC2, API Gateway, custom HTTP/HTTPS servers
  • SSL/TLS: Supports SNI (free) and dedicated IP ($600/month); can use ACM certificates
  • Price Classes: All edges (most expensive), exclude expensive regions (cheaper), or US/Europe only (cheapest)
  • Invalidation Cost: $0.005 per path after first 1,000 paths/month; use versioned URLs instead
  • Origin Shield: Additional caching layer between edge and origin; reduces origin load; $0.01/10,000 requests

Detailed Example 1: E-Commerce Website with Global Users

You're architecting a global e-commerce platform with users in North America, Europe, and Asia. The application has static assets (product images, CSS, JavaScript), dynamic content (product listings, user accounts), and an API for checkout.

Requirements:

  • Product images must load quickly globally (< 100ms)
  • Product listings are updated every 5 minutes
  • User account pages are personalized (cannot cache)
  • Checkout API must be secure and fast
  • Origin is in us-east-1 (ALB + EC2 + RDS)

CloudFront Design:

Distribution Configuration:

  • Origin 1: S3 bucket (static-assets.example.com) for images, CSS, JS
  • Origin 2: ALB (api.example.com) for dynamic content and API
  • Custom domain: www.example.com with ACM certificate
  • Price class: All edge locations (global reach)

Cache Behaviors (evaluated in order):

  1. Path: /images/* → Origin: S3, TTL: 1 year, Query strings: None, Cookies: None
  2. Path: /static/* → Origin: S3, TTL: 1 week, Query strings: None, Cookies: None
  3. Path: /api/checkout/* → Origin: ALB, TTL: 0 (no cache), HTTPS only, Forward all headers/cookies
  4. Path: /api/products/* → Origin: ALB, TTL: 5 minutes, Query strings: All, Cookies: None
  5. Path: /account/* → Origin: ALB, TTL: 0 (no cache), Forward session cookie
  6. Path: /* → Origin: ALB, TTL: 1 hour, Query strings: None, Cookies: None (default pages)

Rationale:

  • Images (/images/*): Static, rarely change, long TTL (1 year) maximizes cache hit ratio and reduces origin load. No query strings or cookies needed.
  • Static Assets (/static/*): CSS/JS files, versioned URLs (style-v123.css), 1-week TTL balances caching and updates.
  • Checkout API (/api/checkout/*): Sensitive, personalized, cannot cache (TTL: 0). HTTPS only for security. Forward all headers and cookies for authentication.
  • Products API (/api/products/*): Dynamic but cacheable for 5 minutes. Forward query strings for filtering (e.g., /api/products?category=electronics). Don't forward cookies (not personalized).
  • Account Pages (/account/*): Personalized, cannot cache. Forward session cookie for authentication.
  • Default (/): Homepage and category pages, cache for 1 hour. No query strings or cookies (same for all users).

Performance Results:

  • Users in London: Image load time 15ms (vs 80ms direct to us-east-1), 81% improvement
  • Users in Tokyo: Image load time 20ms (vs 150ms direct), 87% improvement
  • Cache hit ratio: 85% (images and static assets heavily cached)
  • Origin requests reduced by 85%, lowering ALB and EC2 costs

Cost Analysis:

  • CloudFront data transfer: $0.085/GB (North America), $0.140/GB (Asia)
  • Origin data transfer: Free (within AWS)
  • Requests: $0.0075/10,000 HTTPS requests
  • For 10TB/month traffic with 85% cache hit ratio: CloudFront cost ~$1,000/month, saves ~$500/month in origin costs

Detailed Example 2: Video Streaming Platform

You're building a video streaming platform serving on-demand videos to global users. Videos are stored in S3, and you need to deliver them with low latency and high throughput.

Requirements:

  • Support HLS (HTTP Live Streaming) and DASH (Dynamic Adaptive Streaming)
  • Videos range from 100MB to 5GB
  • Users expect instant playback (< 2 seconds to first frame)
  • Must support 100,000 concurrent viewers
  • Origin is S3 in us-east-1

CloudFront Design:

Distribution Configuration:

  • Origin: S3 bucket (videos.example.com) with Origin Access Identity (OAI) for security
  • Custom domain: stream.example.com
  • Price class: All edge locations
  • Smooth streaming: Enabled (optimizes for video delivery)

Cache Behaviors:

  • Path: /videos/*.m3u8 → TTL: 5 seconds (HLS manifests, frequently updated)
  • Path: /videos/*.ts → TTL: 1 year (video segments, immutable)
  • Path: /videos/*.mpd → TTL: 5 seconds (DASH manifests)
  • Path: /videos/*.m4s → TTL: 1 year (DASH segments)

Optimization Techniques:

  1. Segment Caching: Videos are split into small segments (2-10 seconds each). Each segment is cached independently with long TTL. Manifests (playlists) have short TTL to support live updates.

  2. Range Requests: CloudFront supports HTTP range requests, allowing players to seek to any point in the video without downloading the entire file. Edge locations cache ranges independently.

  3. Origin Shield: Enabled in us-east-1 to consolidate requests from all edge locations. When 100 edge locations request the same video segment simultaneously, Origin Shield fetches it once from S3 and serves all 100 edges, reducing S3 requests by 99%.

  4. Compression: CloudFront automatically compresses text-based files (manifests) with gzip/brotli, reducing transfer size by 70-80%.

Traffic Pattern:

  • Popular video released: 50,000 users start watching simultaneously
  • Without Origin Shield: 50,000 requests hit S3 for each segment (500,000 S3 requests for 10 segments)
  • With Origin Shield: 1 request hits S3 per segment (10 S3 requests), Origin Shield serves 50,000 edge locations
  • S3 cost savings: 99.998% reduction in requests

Performance Results:

  • Time to first frame: 1.2 seconds (vs 3.5 seconds without CloudFront)
  • Buffering events: 0.1% (vs 5% without CloudFront)
  • Concurrent viewers supported: 100,000+ (limited by origin without CloudFront)

Detailed Example 3: API Acceleration with CloudFront

Your mobile app makes frequent API calls to a REST API hosted on ALB in us-east-1. Users in Asia experience 200ms latency per API call, and the app makes 20 API calls on launch, resulting in 4 seconds of loading time.

Challenge: API responses are dynamic and personalized (cannot cache), but you still want to reduce latency.

Solution - CloudFront with TTL: 0:

Distribution Configuration:

  • Origin: ALB (api.example.com)
  • Custom domain: api-cdn.example.com
  • Cache behavior: TTL: 0 (no caching), Forward all headers, cookies, query strings

How It Helps Without Caching:

  1. Connection Reuse: CloudFront maintains persistent connections to the origin ALB. Without CloudFront, each API call requires a new TLS handshake (3 round trips, 600ms for Asia users). With CloudFront, the edge location reuses existing connections, eliminating handshake overhead.

  2. AWS Private Network: Traffic from edge to origin travels over AWS's private global network, not public internet. This provides more consistent latency and avoids congested internet paths.

  3. TCP Optimization: CloudFront uses optimized TCP settings (larger windows, better congestion control) for edge-to-origin connections, improving throughput.

Performance Results:

  • API latency: 200ms → 120ms (40% improvement)
  • App launch time: 4 seconds → 2.4 seconds (40% improvement)
  • No caching required, all responses are fresh

When to Use:

  • Dynamic, personalized content that cannot be cached
  • Users are geographically distributed
  • Origin is in a single Region
  • Latency is critical

šŸ’” Tips for Understanding CloudFront:

  • Cache Hit Ratio: Aim for >80%; higher ratio = better performance and lower costs
  • Versioned URLs: Use /image-v2.jpg instead of invalidations; cheaper and instant
  • Origin Shield: Use for high-traffic origins to reduce load; costs $0.01/10,000 requests but saves origin costs
  • Price Classes: Exclude expensive regions (Australia, South America) if users are primarily in US/Europe
  • Lambda@Edge: Run code at edge locations for request/response manipulation; use for A/B testing, authentication, URL rewrites

āš ļø Common Mistakes & Misconceptions:

Mistake 1: Caching personalized content

  • Why it's wrong: If you cache /account/profile with TTL > 0, User A might see User B's profile
  • Correct understanding: Set TTL: 0 for personalized content; forward session cookies; use cache behaviors to separate cacheable and non-cacheable paths

Mistake 2: Not configuring cache behaviors for different content types

  • Why it's wrong: Using a single cache behavior with default TTL for all content results in either stale dynamic content or poor cache hit ratio for static content
  • Correct understanding: Create multiple cache behaviors: long TTL for static assets, short TTL for dynamic content, no caching for personalized content

Mistake 3: Forgetting to forward query strings for dynamic content

  • Why it's wrong: If you don't forward query strings, /api/products?category=electronics and /api/products?category=books return the same cached response
  • Correct understanding: Configure cache behaviors to forward query strings for dynamic content; use query strings as part of cache key

Mistake 4: Using invalidations instead of versioned URLs

  • Why it's wrong: Invalidations cost $0.005 per path after 1,000/month and take minutes to propagate; frequent invalidations are expensive
  • Correct understanding: Use versioned URLs (image-v2.jpg, style-v123.css) for instant updates at no cost; invalidations only for emergencies

šŸ”— Connections to Other Topics:

  • Relates to Route 53 because: CloudFront distributions use Route 53 for DNS resolution; can create alias records pointing to CloudFront
  • Builds on AWS WAF by: CloudFront integrates with WAF for layer 7 protection; attach WAF web ACL to distribution for DDoS protection, geo-blocking, rate limiting
  • Often used with S3 to: Serve static websites; use Origin Access Identity (OAI) to restrict S3 access to CloudFront only

AWS Global Accelerator

What it is: Global Accelerator is a network service that provides static anycast IP addresses (2 IPs per accelerator) that route traffic over AWS's private global network to optimal AWS endpoints (ALB, NLB, EC2, Elastic IP) in one or more Regions. Unlike CloudFront (which caches content), Global Accelerator routes every request to your application, providing consistent performance and instant regional failover.

Why it exists: Internet routing is unpredictable and often suboptimal. A user in Tokyo connecting to an ALB in us-east-1 might have their traffic route through multiple ISPs, experiencing packet loss, jitter, and high latency (150-250ms). Global Accelerator solves this by providing anycast IPs that route traffic to the nearest AWS edge location, then over AWS's private global network to your application. This reduces latency by up to 60% and provides consistent performance regardless of internet conditions.

Real-world analogy: Think of Global Accelerator like a private highway system. Without it, your traffic uses public roads (internet) with traffic jams, detours, and varying conditions. With Global Accelerator, your traffic enters a private highway (AWS network) at the nearest on-ramp (edge location) and travels on a dedicated, optimized route to your destination (application endpoint). The highway is faster, more reliable, and has no traffic jams.

How it works (Detailed step-by-step):

  1. Accelerator Creation: You create a Global Accelerator accelerator, which provisions 2 static anycast IPv4 addresses (e.g., 75.2.60.5 and 99.83.190.51). These IPs are announced from all AWS edge locations globally using BGP anycast. You configure listeners (TCP or UDP ports) and endpoint groups (Regions containing your application endpoints).

  2. Anycast Routing: When a user in Tokyo connects to 75.2.60.5, BGP anycast routing directs their traffic to the nearest AWS edge location (Tokyo). The same IP address is announced from all edge locations, but BGP selects the closest one based on network topology. This happens at the network layer, transparent to the user.

  3. Edge Location Ingress: The Tokyo edge location receives the user's traffic and immediately routes it over AWS's private global network to the configured endpoint group. If you have endpoint groups in us-east-1 and eu-west-1, Global Accelerator routes to the closest healthy endpoint group based on configured traffic dials and health checks.

  4. Endpoint Selection: Within the endpoint group, Global Accelerator distributes traffic across endpoints (ALBs, NLBs, EC2 instances, Elastic IPs) based on configured weights and health checks. If an endpoint is unhealthy, traffic is automatically routed to healthy endpoints. If all endpoints in a Region are unhealthy, traffic fails over to the next closest Region.

  5. Connection Persistence: Global Accelerator maintains connection state and provides client affinity (sticky sessions) based on source IP. This ensures that requests from the same client are routed to the same endpoint for the duration of the session, important for stateful applications.

  6. Health Checks: Global Accelerator performs health checks on all endpoints every 30 seconds. If an endpoint fails 3 consecutive health checks (90 seconds), it's marked unhealthy and removed from rotation. When it passes 3 consecutive checks, it's marked healthy and added back. This provides automatic failover without DNS TTL delays.

šŸ“Š Global Accelerator Architecture Diagram:

graph TB
    subgraph "Users Worldwide"
        U1[User in Tokyo<br/>Connects to 75.2.60.5]
        U2[User in London<br/>Connects to 75.2.60.5]
        U3[User in Sydney<br/>Connects to 75.2.60.5]
    end
    
    subgraph "AWS Edge Network (Anycast)"
        E1[Tokyo Edge<br/>Announces 75.2.60.5<br/>BGP Anycast]
        E2[London Edge<br/>Announces 75.2.60.5<br/>BGP Anycast]
        E3[Sydney Edge<br/>Announces 75.2.60.5<br/>BGP Anycast]
    end
    
    GA[Global Accelerator<br/>Static IPs:<br/>75.2.60.5<br/>99.83.190.51<br/>Listener: TCP 443]
    
    subgraph "Endpoint Group 1: us-east-1 (Weight: 100)"
        ALB1[Application Load Balancer<br/>Health: Healthy<br/>Weight: 100]
        EC2_1[App Server 1]
        EC2_2[App Server 2]
    end
    
    subgraph "Endpoint Group 2: eu-west-1 (Weight: 0 - Failover)"
        ALB2[Application Load Balancer<br/>Health: Healthy<br/>Weight: 0]
        EC2_3[App Server 3]
        EC2_4[App Server 4]
    end
    
    U1 -->|BGP Routes to<br/>Nearest Edge| E1
    U2 --> E2
    U3 --> E3
    
    E1 -.AWS Private Network<br/>Optimized Path.-> GA
    E2 -.AWS Private Network.-> GA
    E3 -.AWS Private Network.-> GA
    
    GA -->|Primary: 100%<br/>Traffic| ALB1
    GA -.Failover: 0%<br/>Unless Primary Unhealthy.-> ALB2
    
    ALB1 --> EC2_1
    ALB1 --> EC2_2
    ALB2 --> EC2_3
    ALB2 --> EC2_4
    
    style U1 fill:#e1f5fe
    style U2 fill:#e1f5fe
    style U3 fill:#e1f5fe
    style E1 fill:#c8e6c9
    style E2 fill:#c8e6c9
    style E3 fill:#c8e6c9
    style GA fill:#fff3e0
    style ALB1 fill:#f3e5f5
    style ALB2 fill:#ffebee

See: diagrams/02_domain_1_global_accelerator.mmd

Diagram Explanation (detailed):

This diagram illustrates Global Accelerator's anycast routing and multi-region failover architecture. Three users in different locations (Tokyo, London, Sydney) all connect to the same static IP address (75.2.60.5), but their traffic is automatically routed to different edge locations based on BGP anycast routing.

Anycast Routing: The Global Accelerator's static IP (75.2.60.5) is announced from all AWS edge locations worldwide using BGP anycast. When the Tokyo user's router performs a BGP lookup for 75.2.60.5, it receives multiple route advertisements (one from each edge location), but BGP's path selection algorithm chooses the shortest AS path, which is typically the geographically nearest edge location (Tokyo Edge). The same IP address is announced from London Edge and Sydney Edge, so users in those locations are routed to their nearest edges. This is fundamentally different from DNS-based routing (like Route 53 latency routing), which requires DNS resolution and is subject to DNS caching. Anycast routing happens at the network layer and is instant.

AWS Private Network Transit: Once traffic reaches the edge location, it immediately enters AWS's private global network backbone. The Tokyo Edge routes the user's traffic to the Global Accelerator service, which then routes it to the configured endpoint group. This transit over AWS's private network (shown as dotted lines) provides several benefits: (1) Consistent, predictable latency (AWS's network is engineered for low latency), (2) No internet congestion or packet loss, (3) Optimized routing (AWS controls the entire path), (4) Better throughput (AWS's network has high bandwidth capacity).

Endpoint Groups and Traffic Distribution: The Global Accelerator is configured with two endpoint groups: (1) us-east-1 (primary) with weight 100, and (2) eu-west-1 (failover) with weight 0. The weight determines traffic distribution. With us-east-1 at 100% and eu-west-1 at 0%, all traffic goes to us-east-1 under normal conditions. If you set us-east-1 to 70 and eu-west-1 to 30, traffic would be split 70/30 for active-active load balancing across Regions. The weight 0 configuration creates an active-passive failover setup where eu-west-1 only receives traffic if us-east-1 is completely unhealthy.

Health Checks and Failover: Global Accelerator performs health checks on the ALB in us-east-1 every 30 seconds. If the ALB fails 3 consecutive checks (90 seconds total), Global Accelerator marks it unhealthy and immediately routes all traffic to eu-west-1. This failover is instant (no DNS TTL delay) because the static IP doesn't change - only the backend routing changes. When us-east-1 recovers and passes 3 consecutive health checks, traffic automatically fails back. This provides sub-2-minute failover for regional disasters.

Client Affinity: Global Accelerator provides client affinity based on source IP address. If a user in Tokyo makes multiple requests, all requests are routed to the same endpoint (ALB1 in us-east-1) for the duration of their session. This is critical for stateful applications that maintain session state on specific servers. The affinity is maintained even if the user's requests come through different edge locations (e.g., if they're mobile and moving between cell towers).

Performance Comparison: Without Global Accelerator, the Tokyo user connects directly to the ALB in us-east-1 over the public internet. The path might be: Tokyo → ISP1 → ISP2 → ISP3 → us-east-1, with 150-200ms latency and potential packet loss. With Global Accelerator, the path is: Tokyo → Tokyo Edge (5ms) → AWS Private Network → us-east-1 (80ms total), a 50% latency reduction. Additionally, the AWS private network provides consistent latency (80ms ±5ms) vs variable internet latency (150-250ms).

⭐ Must Know (Critical Global Accelerator Facts):

  • Static Anycast IPs: 2 IPs per accelerator; announced from all edge locations via BGP anycast
  • Listeners: Support TCP and UDP; configure ports (e.g., 80, 443, custom ports)
  • Endpoint Types: ALB, NLB, EC2 instance, Elastic IP (cannot use CLB or Lambda)
  • Endpoint Groups: One per Region; configure traffic dial (0-100%) for traffic distribution
  • Health Checks: Every 30 seconds; 3 failures = unhealthy; 3 successes = healthy
  • Failover Time: Sub-2-minutes (90 seconds to detect failure + instant routing change)
  • Client Affinity: Based on source IP; maintains session to same endpoint
  • Bring Your Own IP (BYOIP): Can use your own IP addresses instead of AWS-provided
  • Pricing: $0.025/hour per accelerator + $0.015/GB data transfer (in addition to standard data transfer)

Detailed Example 1: Gaming Application with Global Users

You're running a multiplayer gaming platform with players worldwide. The game requires low latency (< 100ms) and consistent performance for real-time gameplay. Your game servers run on EC2 instances behind NLBs in us-east-1, eu-west-1, and ap-southeast-1.

Requirements:

  • Players must connect to nearest Region for lowest latency
  • Instant failover if a Region becomes unavailable
  • Static IP addresses (players whitelist IPs in firewalls)
  • UDP protocol support (game uses UDP for low latency)
  • Session persistence (players must stay connected to same server)

Global Accelerator Design:

Accelerator Configuration:

  • Static IPs: 75.2.60.5, 99.83.190.51 (provided by AWS)
  • Listener: UDP port 7777 (game protocol)
  • Client affinity: Source IP (maintains session to same server)

Endpoint Groups:

  1. us-east-1: Weight 100, NLB with 10 EC2 game servers
  2. eu-west-1: Weight 100, NLB with 10 EC2 game servers
  3. ap-southeast-1: Weight 100, NLB with 10 EC2 game servers

Traffic Distribution: With all endpoint groups at weight 100, Global Accelerator routes each player to the nearest healthy Region based on network proximity. A player in New York connects to us-east-1 (20ms), a player in London connects to eu-west-1 (15ms), and a player in Tokyo connects to ap-southeast-1 (10ms). This is automatic - no DNS configuration or player selection required.

Failover Scenario: During a game session, the us-east-1 Region experiences an outage. Global Accelerator's health checks detect the NLB is unhealthy after 90 seconds. All players connected to us-east-1 are automatically rerouted to the next nearest healthy Region (eu-west-1 for East Coast players). The static IP doesn't change, so players' firewall rules remain valid. Players experience a brief disconnection (< 2 minutes) and automatically reconnect to the new Region.

Performance Results:

  • Average latency: 35ms (vs 120ms without Global Accelerator)
  • Latency consistency: ±5ms (vs ±50ms on public internet)
  • Packet loss: 0.01% (vs 1-2% on public internet)
  • Failover time: 90 seconds (vs 5-10 minutes with DNS-based failover)

Cost Analysis:

  • Global Accelerator: $0.025/hour Ɨ 1 accelerator = $18/month
  • Data transfer: $0.015/GB Ɨ 10TB = $150/month
  • Total: $168/month additional cost
  • Benefit: 60% latency reduction, instant failover, static IPs

Detailed Example 2: Financial Trading Platform

You're building a high-frequency trading platform where milliseconds matter. Traders in New York, London, and Singapore need the lowest possible latency to your trading engine in us-east-1.

Challenge: Even with optimized internet connections, traders in Singapore experience 180-220ms latency to us-east-1 over the public internet. This latency disadvantage costs millions in lost trading opportunities.

Global Accelerator Solution:

Configuration:

  • Static IPs: 75.2.60.5, 99.83.190.51
  • Listener: TCP port 8443 (trading protocol)
  • Endpoint Group: us-east-1 only (trading engine must be in one location for consistency)
  • Endpoint: NLB with trading engine servers

How It Helps:

  1. Optimized Path: Singapore traders connect to the Singapore edge location (5ms), then traffic routes over AWS's private network to us-east-1 (120ms total). This is 60ms faster than public internet (180ms) because AWS's network is optimized for low latency with direct fiber connections between Regions.

  2. Consistent Latency: Public internet latency varies (180-220ms) due to routing changes and congestion. AWS's private network provides consistent latency (120ms ±3ms), allowing traders to predict execution times accurately.

  3. Lower Packet Loss: Public internet has 1-2% packet loss during peak hours. AWS's private network has < 0.01% packet loss, reducing retransmissions and improving throughput.

Performance Results:

  • Singapore to us-east-1: 180ms → 120ms (33% improvement)
  • London to us-east-1: 80ms → 65ms (19% improvement)
  • Latency consistency: ±50ms → ±3ms (94% improvement)
  • Packet loss: 1.5% → 0.01% (99% improvement)

Business Impact: The 60ms latency reduction for Singapore traders translates to competitive advantage in high-frequency trading. At 1,000 trades/second, this saves 60 seconds of cumulative latency per second, enabling faster execution and better prices.

Detailed Example 3: IoT Device Fleet with Static IPs

You have 100,000 IoT devices deployed globally that send telemetry data to your API in us-east-1. Devices are configured with hardcoded IP addresses (cannot use DNS) and must maintain connections for hours.

Requirements:

  • Static IP addresses (devices cannot resolve DNS)
  • Long-lived TCP connections (devices maintain persistent connections)
  • Automatic failover to eu-west-1 if us-east-1 fails
  • Minimize data transfer costs

Global Accelerator Design:

Configuration:

  • Static IPs: 75.2.60.5, 99.83.190.51 (hardcoded in device firmware)
  • Listener: TCP port 8883 (MQTT over TLS)
  • Client affinity: Source IP (maintains connection to same endpoint)

Endpoint Groups:

  1. us-east-1: Weight 100, NLB with API servers (primary)
  2. eu-west-1: Weight 0, NLB with API servers (failover)

Traffic Flow: Devices worldwide connect to 75.2.60.5. BGP anycast routes each device to its nearest edge location. The edge location routes traffic over AWS's private network to us-east-1 (weight 100). Devices maintain persistent TCP connections for hours, sending telemetry every 60 seconds.

Failover Scenario: us-east-1 experiences an outage. Global Accelerator detects the NLB is unhealthy after 90 seconds and routes all traffic to eu-west-1. Devices experience connection drops and automatically reconnect to the same IP (75.2.60.5), which now routes to eu-west-1. No firmware update required because the IP address hasn't changed.

Cost Optimization: Without Global Accelerator, devices in Asia would transfer data over public internet to us-east-1, incurring standard data transfer charges ($0.09/GB). With Global Accelerator, data transfer from edge to endpoint is $0.015/GB (83% cheaper), but you pay $0.025/hour for the accelerator. For 10TB/month traffic, Global Accelerator saves $750/month in data transfer costs, minus $18/month accelerator cost, net savings of $732/month.

šŸ’” Tips for Understanding Global Accelerator:

  • Use for Non-HTTP/HTTPS: Global Accelerator supports any TCP/UDP protocol; CloudFront only supports HTTP/HTTPS
  • Static IPs: Critical for whitelisting, hardcoded configurations, or regulatory requirements
  • Active-Active: Set multiple endpoint groups to weight > 0 for load balancing across Regions
  • Active-Passive: Set primary to weight 100, secondary to weight 0 for failover only
  • Client Affinity: Essential for stateful applications; routes same client to same endpoint

āš ļø Common Mistakes & Misconceptions:

Mistake 1: Using Global Accelerator for cacheable content

  • Why it's wrong: Global Accelerator doesn't cache; every request goes to origin; expensive for high-traffic cacheable content
  • Correct understanding: Use CloudFront for cacheable content (images, videos, static assets); use Global Accelerator for dynamic, non-cacheable traffic (APIs, gaming, IoT)

Mistake 2: Expecting instant failover without health checks

  • Why it's wrong: Global Accelerator requires 3 failed health checks (90 seconds) to mark endpoint unhealthy
  • Correct understanding: Failover takes 90 seconds minimum; configure health check intervals and thresholds appropriately

Mistake 3: Not configuring client affinity for stateful applications

  • Why it's wrong: Without client affinity, requests from same client might route to different endpoints, breaking session state
  • Correct understanding: Enable client affinity (source IP) for stateful applications; disable for stateless applications to maximize load distribution

Mistake 4: Using Global Accelerator when all users are in one Region

  • Why it's wrong: Global Accelerator adds cost ($0.025/hour + $0.015/GB) without benefit if users are near your endpoint
  • Correct understanding: Use Global Accelerator for globally distributed users; use direct ALB/NLB for regional users

šŸ”— Connections to Other Topics:

  • Relates to Route 53 because: Both provide global traffic management, but Global Accelerator uses anycast (network layer) while Route 53 uses DNS (application layer)
  • Builds on NLB/ALB by: Global Accelerator routes to NLB/ALB endpoints; understanding load balancer health checks is essential
  • Often used with Shield to: Global Accelerator includes AWS Shield Standard for DDoS protection; upgrade to Shield Advanced for enhanced protection

CloudFront vs Global Accelerator Decision Framework

When to use CloudFront:

  • āœ… Content is cacheable (static assets, videos, API responses with TTL)
  • āœ… Protocol is HTTP/HTTPS
  • āœ… Need layer 7 features (URL rewriting, header manipulation, Lambda@Edge)
  • āœ… Cost optimization is priority (caching reduces origin load and data transfer)
  • āœ… Integration with S3, API Gateway, or web applications

When to use Global Accelerator:

  • āœ… Content is dynamic and non-cacheable (real-time APIs, gaming, IoT)
  • āœ… Protocol is TCP or UDP (not just HTTP/HTTPS)
  • āœ… Need static IP addresses (whitelisting, hardcoded configurations)
  • āœ… Need instant regional failover (< 2 minutes)
  • āœ… Need client affinity for stateful applications

When to use both:

  • āœ… CloudFront for static assets (images, CSS, JS) + Global Accelerator for dynamic API
  • āœ… CloudFront for website + Global Accelerator for WebSocket connections
  • āœ… CloudFront for video streaming + Global Accelerator for live video ingest

šŸ“Š CloudFront vs Global Accelerator Comparison Diagram:

graph TB
    subgraph "CloudFront Use Cases"
        CF1[Static Content<br/>Images, CSS, JS]
        CF2[Video Streaming<br/>HLS, DASH]
        CF3[Dynamic Content<br/>with Caching<br/>API responses, HTML]
        CF4[HTTP/HTTPS Only]
    end
    
    subgraph "Global Accelerator Use Cases"
        GA1[Non-HTTP Protocols<br/>TCP, UDP, Gaming]
        GA2[Dynamic Non-Cacheable<br/>Real-time APIs, IoT]
        GA3[Static IP Required<br/>Whitelisting, Hardcoded]
        GA4[Instant Failover<br/>Multi-region HA]
    end
    
    subgraph "Both Together"
        BOTH1[CloudFront: Static Assets<br/>Global Accelerator: API]
        BOTH2[CloudFront: Website<br/>Global Accelerator: WebSocket]
    end
    
    style CF1 fill:#c8e6c9
    style CF2 fill:#c8e6c9
    style CF3 fill:#c8e6c9
    style CF4 fill:#c8e6c9
    style GA1 fill:#fff3e0
    style GA2 fill:#fff3e0
    style GA3 fill:#fff3e0
    style GA4 fill:#fff3e0
    style BOTH1 fill:#e1f5fe
    style BOTH2 fill:#e1f5fe

See: diagrams/02_domain_1_cloudfront_vs_global_accelerator.mmd

Comparison Table:

Feature CloudFront Global Accelerator
Primary Function Content caching and delivery Traffic routing and acceleration
Caching Yes (edge locations cache content) No (every request goes to origin)
Protocols HTTP, HTTPS, WebSocket TCP, UDP (any protocol)
IP Addresses Dynamic (changes per edge) Static anycast (2 IPs)
Routing DNS-based (GeoDNS) Anycast (BGP network layer)
Failover Origin failover (minutes, DNS TTL) Regional failover (< 2 minutes)
Use Case Cacheable content, websites, APIs Non-cacheable, gaming, IoT, static IPs
Layer 7 Features Yes (URL rewrite, headers, Lambda@Edge) No (layer 4 only)
Client Affinity Via cookies Via source IP
Pricing $0.085/GB + $0.0075/10K requests $0.025/hour + $0.015/GB
Best For Cost optimization via caching Performance optimization via AWS network

Section 2: DNS Solutions (Public, Private, Hybrid)

Introduction

The problem: Domain Name System (DNS) is critical infrastructure that translates human-readable domain names (www.example.com) to IP addresses (192.0.2.1). In traditional networks, you manage DNS servers manually, configure zone files, and handle replication. In hybrid cloud environments, you need DNS resolution between on-premises and AWS, split-view DNS (different responses for internal vs external queries), and integration with AWS services. Managing this complexity while ensuring high availability, low latency, and security is challenging.

The solution: Amazon Route 53 is AWS's highly available and scalable DNS service that provides domain registration, DNS routing, and health checking. Route 53 supports public hosted zones (internet-facing domains), private hosted zones (VPC-internal domains), and Route 53 Resolver (hybrid DNS between AWS and on-premises). It offers advanced routing policies (latency, geolocation, weighted, failover) and integrates seamlessly with AWS services.

Why it's tested: The ANS-C01 exam extensively tests DNS architecture design, including hybrid DNS with Route 53 Resolver, complex routing policies, DNSSEC implementation, and multi-account DNS strategies. You must understand when to use each routing policy, how to configure conditional forwarding, and how to design split-view DNS for hybrid environments.

Core Concepts

Amazon Route 53 Basics

What it is: Route 53 is a globally distributed DNS service that runs on AWS's edge network (same infrastructure as CloudFront). It provides authoritative DNS for your domains, translating domain names to IP addresses, AWS resources (via alias records), or other DNS names. Route 53 is named after TCP/UDP port 53, the standard DNS port.

Why it exists: Traditional DNS servers are single points of failure and require manual management, replication, and scaling. Route 53 solves this by providing a fully managed, highly available DNS service with 100% uptime SLA. It's globally distributed across AWS's edge locations, ensuring low-latency DNS responses worldwide. Route 53 also provides advanced features like health checks, traffic management, and integration with AWS services that traditional DNS servers lack.

Real-world analogy: Think of Route 53 like a global phone directory service. When someone wants to call "Amazon" (domain name), they look it up in the directory (DNS query), which returns the phone number (IP address). Route 53 is like having directory offices in every city (edge locations) so lookups are fast, and if one office fails, others continue operating (high availability).

How it works (Detailed step-by-step):

  1. Hosted Zone Creation: You create a hosted zone for your domain (example.com). Route 53 assigns 4 name servers (e.g., ns-123.awsdns-12.com, ns-456.awsdns-34.net, ns-789.awsdns-56.org, ns-012.awsdns-78.co.uk). These name servers are distributed globally across AWS's edge network for high availability. You update your domain registrar to use these name servers, delegating DNS authority to Route 53.

  2. DNS Record Creation: You create DNS records in the hosted zone: A records (IPv4 addresses), AAAA records (IPv6 addresses), CNAME records (aliases to other domains), MX records (mail servers), TXT records (text data), and Route 53-specific alias records (point to AWS resources like ALB, CloudFront, S3).

  3. DNS Query Resolution: When a user types www.example.com in their browser, their device sends a DNS query to their configured DNS resolver (ISP's DNS or public DNS like 8.8.8.8). The resolver doesn't know the answer, so it starts recursive resolution: (1) Query root DNS servers for .com name servers, (2) Query .com name servers for example.com name servers, (3) Query example.com name servers (Route 53) for www.example.com.

  4. Route 53 Response: The query reaches one of Route 53's name servers (geographically close to the resolver for low latency). Route 53 looks up www.example.com in the hosted zone, finds an A record pointing to 192.0.2.1, and returns this IP address along with a TTL (Time To Live, e.g., 300 seconds). The resolver caches this response for the TTL duration and returns it to the user's device.

  5. Caching and TTL: The user's device and the resolver cache the DNS response for the TTL period. Subsequent queries for www.example.com within the TTL are answered from cache without querying Route 53. After TTL expires, the next query triggers a new Route 53 lookup. Lower TTLs (60 seconds) enable faster changes but increase query volume and cost. Higher TTLs (3600 seconds) reduce queries and cost but slow down changes.

  6. Health Checks and Failover: Route 53 can perform health checks on endpoints (IP addresses, domain names, AWS resources) every 10 or 30 seconds. If an endpoint fails health checks, Route 53 automatically stops returning it in DNS responses, routing traffic to healthy endpoints. This enables DNS-based failover without manual intervention.

šŸ“Š Route 53 DNS Resolution Diagram:

sequenceDiagram
    participant User as User's Device
    participant Resolver as DNS Resolver<br/>(ISP or 8.8.8.8)
    participant Root as Root DNS Servers<br/>(.)
    participant TLD as TLD DNS Servers<br/>(.com)
    participant R53 as Route 53<br/>(example.com)
    
    User->>Resolver: Query: www.example.com
    Note over Resolver: Check cache<br/>Not found or expired
    
    Resolver->>Root: Query: www.example.com
    Root->>Resolver: Refer to .com servers<br/>(a.gtld-servers.net)
    
    Resolver->>TLD: Query: www.example.com
    TLD->>Resolver: Refer to example.com servers<br/>(ns-123.awsdns-12.com)
    
    Resolver->>R53: Query: www.example.com
    Note over R53: Lookup in hosted zone<br/>Find A record: 192.0.2.1<br/>TTL: 300 seconds
    R53->>Resolver: Answer: 192.0.2.1<br/>TTL: 300 seconds
    
    Note over Resolver: Cache response<br/>for 300 seconds
    Resolver->>User: Answer: 192.0.2.1
    
    Note over User: Connect to 192.0.2.1<br/>(Web server)

See: diagrams/02_domain_1_route53_resolution.mmd

Diagram Explanation (detailed):

This sequence diagram illustrates the complete DNS resolution process from user query to final answer, showing how Route 53 fits into the global DNS hierarchy. The process demonstrates recursive DNS resolution, where the resolver performs multiple queries on behalf of the user.

Step 1 - User Query: The user's device needs to connect to www.example.com. It sends a DNS query to its configured DNS resolver (typically the ISP's DNS server or a public DNS like Google's 8.8.8.8). The user's device doesn't perform recursive resolution itself; it relies on the resolver.

Step 2 - Cache Check: The resolver first checks its cache to see if it recently resolved www.example.com. If the cached entry exists and hasn't expired (TTL hasn't elapsed), the resolver immediately returns the cached answer without further queries. If not cached or expired, the resolver begins recursive resolution.

Step 3 - Root Server Query: The resolver queries one of the 13 root DNS servers (actually hundreds of servers using anycast, but logically 13 addresses). The root servers don't know the answer for www.example.com, but they know which servers are authoritative for .com domains. The root server responds with a referral to the .com TLD (Top-Level Domain) servers (e.g., a.gtld-servers.net).

Step 4 - TLD Server Query: The resolver queries the .com TLD servers, asking for www.example.com. The TLD servers don't know the specific answer, but they know which name servers are authoritative for example.com. They respond with a referral to Route 53's name servers (ns-123.awsdns-12.com, ns-456.awsdns-34.net, etc.). This information was configured when you delegated your domain to Route 53.

Step 5 - Route 53 Query: The resolver queries one of Route 53's name servers for www.example.com. Route 53 looks up the record in the hosted zone for example.com, finds an A record for www pointing to 192.0.2.1, and returns this answer along with a TTL of 300 seconds (5 minutes). Route 53's response is authoritative (the AA flag is set in the DNS response).

Step 6 - Caching and Response: The resolver caches the answer (192.0.2.1) for 300 seconds and returns it to the user's device. The user's device also caches the answer. For the next 5 minutes, any queries for www.example.com from this user or other users using the same resolver are answered from cache without querying Route 53.

Step 7 - Connection: The user's device now has the IP address (192.0.2.1) and establishes a connection to the web server at that address. The DNS resolution is complete.

Performance Considerations: The entire recursive resolution process typically takes 100-200ms for the first query (cache miss). Subsequent queries within the TTL period are answered from cache in < 1ms. This is why TTL selection is important: shorter TTLs enable faster changes but increase query volume and latency for cache misses; longer TTLs reduce queries and improve performance but slow down changes.

Route 53 Advantages: Route 53's name servers are distributed globally across AWS's edge network, so the resolver typically queries a geographically close Route 53 server, reducing latency. Additionally, Route 53 has a 100% uptime SLA, so DNS resolution continues even if some name servers fail.

⭐ Must Know (Critical Route 53 Facts):

  • Hosted Zones: Public (internet-facing) or Private (VPC-internal); $0.50/month per hosted zone
  • Name Servers: 4 per hosted zone; globally distributed; 100% uptime SLA
  • Record Types: A (IPv4), AAAA (IPv6), CNAME (alias), MX (mail), TXT (text), NS (name server), SOA (start of authority), PTR (reverse DNS), SRV (service), CAA (certificate authority)
  • Alias Records: Route 53-specific; point to AWS resources (ALB, CloudFront, S3, API Gateway); no charge for alias queries to AWS resources
  • TTL: Time To Live; controls caching duration; 60-86400 seconds typical; lower = faster changes, higher = better performance
  • Query Pricing: $0.40/million queries for standard queries; $0.60/million for latency-based routing queries; alias queries to AWS resources are free
  • Health Checks: $0.50/month per health check; checks every 10 or 30 seconds; supports HTTP, HTTPS, TCP
  • DNSSEC: Supported for domain registration and hosted zones; adds cryptographic signatures to DNS responses

Detailed Example 1: Multi-Region Web Application with Failover

You're running a web application in us-east-1 (primary) and eu-west-1 (failover). You need DNS-based failover that automatically routes traffic to eu-west-1 if us-east-1 becomes unhealthy.

Architecture:

  • Primary: ALB in us-east-1 (alb-us.example.com)
  • Failover: ALB in eu-west-1 (alb-eu.example.com)
  • Domain: www.example.com

Route 53 Configuration:

Hosted Zone: example.com

Health Checks:

  1. Health Check 1: Monitor alb-us.example.com on HTTPS port 443, path /health, every 30 seconds, 3 failures = unhealthy
  2. Health Check 2: Monitor alb-eu.example.com on HTTPS port 443, path /health, every 30 seconds, 3 failures = unhealthy

DNS Records (Failover Routing Policy):

  1. Record: www.example.com, Type: A, Routing: Failover Primary, Value: Alias to alb-us.example.com, Health Check: Health Check 1, Record ID: primary
  2. Record: www.example.com, Type: A, Routing: Failover Secondary, Value: Alias to alb-eu.example.com, Health Check: Health Check 2, Record ID: secondary

How It Works:

  • Under normal conditions, Route 53 returns the IP of alb-us.example.com for www.example.com queries
  • Route 53 performs health checks on alb-us.example.com every 30 seconds
  • If alb-us.example.com fails 3 consecutive health checks (90 seconds), Route 53 marks it unhealthy
  • Route 53 immediately starts returning the IP of alb-eu.example.com (failover secondary)
  • When alb-us.example.com recovers and passes 3 consecutive health checks, Route 53 fails back to primary

Failover Time: 90 seconds (health check detection) + TTL (DNS cache expiration). With TTL of 60 seconds, total failover time is 150 seconds (2.5 minutes). Users with cached DNS entries continue connecting to us-east-1 until their cache expires.

Optimization: Set TTL to 60 seconds for faster failover. Lower TTLs (e.g., 10 seconds) enable faster failover but increase query volume and cost. For critical applications, use Global Accelerator instead of DNS failover for sub-2-minute failover without DNS caching delays.

Detailed Example 2: Latency-Based Routing for Global Users

Your application runs in us-east-1, eu-west-1, and ap-southeast-1. You want users automatically routed to the nearest Region for lowest latency.

Route 53 Configuration:

Hosted Zone: example.com

Health Checks:

  1. Health Check US: Monitor alb-us.example.com
  2. Health Check EU: Monitor alb-eu.example.com
  3. Health Check AP: Monitor alb-ap.example.com

DNS Records (Latency Routing Policy):

  1. Record: www.example.com, Type: A, Routing: Latency, Region: us-east-1, Value: Alias to alb-us.example.com, Health Check: Health Check US, Record ID: us
  2. Record: www.example.com, Type: A, Routing: Latency, Region: eu-west-1, Value: Alias to alb-eu.example.com, Health Check: Health Check EU, Record ID: eu
  3. Record: www.example.com, Type: A, Routing: Latency, Region: ap-southeast-1, Value: Alias to alb-ap.example.com, Health Check: Health Check AP, Record ID: ap

How It Works:

  • When a user in New York queries www.example.com, Route 53 measures latency from New York to each Region
  • Route 53 determines us-east-1 has lowest latency (20ms) and returns alb-us.example.com's IP
  • When a user in London queries www.example.com, Route 53 determines eu-west-1 has lowest latency (15ms) and returns alb-eu.example.com's IP
  • When a user in Tokyo queries www.example.com, Route 53 determines ap-southeast-1 has lowest latency (10ms) and returns alb-ap.example.com's IP

Latency Measurement: Route 53 uses historical latency data from AWS's edge network to determine which Region has lowest latency for each resolver. This is more accurate than geolocation because it accounts for actual network performance, not just geographic distance.

Failover Integration: If alb-us.example.com fails health checks, Route 53 stops returning it and routes US users to the next lowest-latency healthy Region (likely eu-west-1). This provides automatic failover without separate failover records.

Performance Results:

  • US users: 20ms latency (vs 150ms to ap-southeast-1)
  • EU users: 15ms latency (vs 80ms to us-east-1)
  • AP users: 10ms latency (vs 200ms to us-east-1)
  • Automatic failover if any Region becomes unhealthy

Detailed Example 3: Weighted Routing for Blue/Green Deployments

You're deploying a new version of your application and want to gradually shift traffic from the old version (blue) to the new version (green) to minimize risk.

Architecture:

  • Blue (old version): ALB in us-east-1 (alb-blue.example.com)
  • Green (new version): ALB in us-east-1 (alb-green.example.com)

Route 53 Configuration:

Phase 1 - Initial Deployment (100% Blue):

  1. Record: www.example.com, Type: A, Routing: Weighted, Weight: 100, Value: Alias to alb-blue.example.com, Record ID: blue
  2. Record: www.example.com, Type: A, Routing: Weighted, Weight: 0, Value: Alias to alb-green.example.com, Record ID: green

Phase 2 - Canary Testing (90% Blue, 10% Green):

  1. Update blue record: Weight: 90
  2. Update green record: Weight: 10

Phase 3 - Gradual Rollout (50% Blue, 50% Green):

  1. Update blue record: Weight: 50
  2. Update green record: Weight: 50

Phase 4 - Complete Migration (100% Green):

  1. Update blue record: Weight: 0
  2. Update green record: Weight: 100

How It Works:

  • Route 53 distributes queries based on weights: Weight / Total Weight
  • In Phase 2, 10% of queries return green's IP, 90% return blue's IP
  • Users are randomly assigned to blue or green based on weights
  • If issues are detected in green, immediately set green's weight to 0 to route all traffic back to blue

Advantages:

  • Gradual rollout reduces risk (only 10% of users affected initially)
  • Easy rollback (set green weight to 0)
  • No application changes required (DNS-based)
  • Can test green with real production traffic

Limitations:

  • DNS caching means weight changes take effect gradually (TTL delay)
  • Users might switch between blue and green if they query multiple times
  • Not suitable for stateful applications without session persistence

šŸ’” Tips for Understanding Route 53:

  • Alias Records: Use for AWS resources (ALB, CloudFront, S3); free queries, automatic IP updates
  • CNAME Limitations: Cannot create CNAME for zone apex (example.com); use alias records instead
  • TTL Selection: 60-300 seconds for frequently changing records; 3600+ seconds for stable records
  • Health Check Frequency: 30 seconds for most use cases; 10 seconds for faster failover (costs more)
  • Routing Policy Selection: Failover for active-passive, Latency for global performance, Weighted for gradual rollouts, Geolocation for compliance

āš ļø Common Mistakes & Misconceptions:

Mistake 1: Using CNAME records for zone apex (example.com)

  • Why it's wrong: DNS specification prohibits CNAME at zone apex; causes DNS resolution failures
  • Correct understanding: Use alias records for zone apex; they function like CNAME but are allowed at apex

Mistake 2: Setting TTL too high for failover scenarios

  • Why it's wrong: High TTL (3600 seconds) means users cache DNS for 1 hour; failover takes 1 hour + health check time
  • Correct understanding: Use 60-300 second TTL for failover scenarios; balance between failover speed and query cost

Mistake 3: Not configuring health checks for failover routing

  • Why it's wrong: Without health checks, Route 53 continues returning unhealthy endpoints; failover doesn't work
  • Correct understanding: Always configure health checks for failover, latency, and weighted routing policies

Mistake 4: Expecting instant DNS changes

  • Why it's wrong: DNS changes are subject to TTL caching; users with cached entries don't see changes until TTL expires
  • Correct understanding: DNS changes take TTL + propagation time (typically TTL + 60 seconds); plan accordingly

šŸ”— Connections to Other Topics:

  • Relates to CloudFront because: CloudFront distributions use Route 53 for DNS; create alias records pointing to CloudFront
  • Builds on VPC by: Private hosted zones associate with VPCs for internal DNS resolution
  • Often used with ACM to: Route 53 validates domain ownership for ACM certificate issuance via DNS validation

Due to the extensive nature of this study guide (targeting 120,000-240,000 words with 120-200 diagrams), I've made significant progress but need to continue. I've created:

Completed So Far:

  • 00_overview (complete)
  • 01_fundamentals (complete with 3 diagrams)
  • 02_domain_1_network_design (in progress - Section 1 complete, Section 2 partial)

Word Count Estimate: ~25,000 words so far
Diagrams Created: 7 diagrams

Remaining Work:

  • Complete Domain 1 (Sections 2-6)
  • Domain 2 (Network Implementation)
  • Domain 3 (Network Management)
  • Domain 4 (Network Security)
  • Integration chapter
  • Study strategies
  • Final checklist
  • Appendices (CRITICAL - required for completion)

I should continue generating content to reach 95% token usage. Let me continue:

Route 53 Resolver and Hybrid DNS

What it is: Route 53 Resolver is a regional DNS service that provides DNS resolution for VPCs and enables hybrid DNS between AWS and on-premises networks. It consists of inbound endpoints (allow on-premises to query AWS DNS) and outbound endpoints (allow AWS to query on-premises DNS) with conditional forwarding rules.

Why it exists: In hybrid cloud environments, resources in AWS need to resolve on-premises domain names (e.g., internal.company.com), and on-premises resources need to resolve AWS private hosted zone names (e.g., internal.aws.company.com). Traditional DNS forwarding requires managing DNS servers in EC2, which adds operational overhead. Route 53 Resolver solves this by providing managed DNS endpoints and conditional forwarding rules, eliminating the need for custom DNS infrastructure.

Real-world analogy: Think of Route 53 Resolver like a bilingual translator at a border crossing. The inbound endpoint translates AWS DNS queries for on-premises systems (on-premises asks "where is db.internal.aws.company.com?" and the inbound endpoint answers). The outbound endpoint translates on-premises DNS queries for AWS systems (AWS asks "where is fileserver.internal.company.com?" and the outbound endpoint forwards to on-premises DNS).

How it works (Detailed step-by-step):

  1. Inbound Endpoint Creation: You create an inbound endpoint in your VPC, specifying 2+ subnets in different AZs for high availability. Route 53 Resolver provisions ENIs (Elastic Network Interfaces) in these subnets with private IP addresses (e.g., 10.0.1.10, 10.0.2.10). These IPs become DNS servers that on-premises systems can query.

  2. On-Premises DNS Configuration: You configure your on-premises DNS servers to forward queries for AWS domains (e.g., *.aws.company.com) to the inbound endpoint IPs (10.0.1.10, 10.0.2.10). This is typically done via conditional forwarders in Active Directory DNS or BIND.

  3. Inbound Query Flow: When an on-premises server queries db.internal.aws.company.com, the on-premises DNS server forwards the query to the Route 53 Resolver inbound endpoint (10.0.1.10) over Direct Connect or VPN. The inbound endpoint queries the Route 53 private hosted zone for internal.aws.company.com, retrieves the answer (e.g., 10.0.10.50), and returns it to the on-premises DNS server, which caches and returns it to the requesting server.

  4. Outbound Endpoint Creation: You create an outbound endpoint in your VPC, specifying 2+ subnets in different AZs. Route 53 Resolver provisions ENIs in these subnets. You then create resolver rules that define which domains to forward to on-premises DNS servers.

  5. Resolver Rules Configuration: You create forwarding rules specifying: (1) Domain name (e.g., internal.company.com), (2) Target IPs (on-premises DNS servers, e.g., 192.168.1.10, 192.168.1.11), (3) Rule type (forward or system). Forward rules send queries to target IPs; system rules use Route 53 Resolver's default resolution.

  6. Outbound Query Flow: When an EC2 instance in AWS queries fileserver.internal.company.com, the query goes to the VPC's DNS server (VPC_CIDR+2, e.g., 10.0.0.2). The VPC DNS server checks resolver rules, finds a match for internal.company.com, and forwards the query to the outbound endpoint. The outbound endpoint forwards to the on-premises DNS servers (192.168.1.10) over Direct Connect or VPN. The on-premises DNS server resolves the query and returns the answer, which flows back through the outbound endpoint to the EC2 instance.

šŸ“Š Route 53 Resolver Hybrid DNS Diagram:

graph TB
    subgraph "On-Premises Network (192.168.0.0/16)"
        ONPREM_SERVER[Application Server<br/>192.168.10.50<br/>Queries: db.aws.company.com]
        ONPREM_DNS[On-Premises DNS<br/>192.168.1.10<br/>Conditional Forwarder:<br/>*.aws.company.com → 10.0.1.10]
        ONPREM_FILE[File Server<br/>fileserver.company.com<br/>192.168.20.30]
    end
    
    DX[Direct Connect<br/>or VPN]
    
    subgraph "AWS VPC: 10.0.0.0/16"
        subgraph "Subnet 10.0.1.0/24 (AZ-a)"
            INBOUND_1[Inbound Endpoint<br/>ENI: 10.0.1.10]
            OUTBOUND_1[Outbound Endpoint<br/>ENI: 10.0.1.20]
        end
        
        subgraph "Subnet 10.0.2.0/24 (AZ-b)"
            INBOUND_2[Inbound Endpoint<br/>ENI: 10.0.2.10]
            OUTBOUND_2[Outbound Endpoint<br/>ENI: 10.0.2.20]
        end
        
        VPC_DNS[VPC DNS Resolver<br/>10.0.0.2<br/>Resolver Rules:<br/>company.com → Outbound]
        
        EC2[EC2 Instance<br/>10.0.10.50<br/>Queries: fileserver.company.com]
        
        RDS[RDS Database<br/>db.aws.company.com<br/>10.0.20.100]
    end
    
    R53_PRIVATE[Route 53<br/>Private Hosted Zone<br/>aws.company.com]
    
    ONPREM_SERVER -->|1. Query:<br/>db.aws.company.com| ONPREM_DNS
    ONPREM_DNS -->|2. Forward to<br/>Inbound Endpoint| DX
    DX -->|3. Route to AWS| INBOUND_1
    INBOUND_1 -->|4. Query Private<br/>Hosted Zone| R53_PRIVATE
    R53_PRIVATE -->|5. Return:<br/>10.0.20.100| INBOUND_1
    INBOUND_1 -->|6. Return Answer| DX
    DX -->|7. Return to<br/>On-Premises| ONPREM_DNS
    ONPREM_DNS -->|8. Return to<br/>Application| ONPREM_SERVER
    
    EC2 -->|1. Query:<br/>fileserver.company.com| VPC_DNS
    VPC_DNS -->|2. Match Rule:<br/>Forward to Outbound| OUTBOUND_1
    OUTBOUND_1 -->|3. Forward to<br/>On-Premises DNS| DX
    DX -->|4. Route to<br/>On-Premises| ONPREM_DNS
    ONPREM_DNS -->|5. Resolve:<br/>192.168.20.30| DX
    DX -->|6. Return Answer| OUTBOUND_1
    OUTBOUND_1 -->|7. Return to VPC DNS| VPC_DNS
    VPC_DNS -->|8. Return to EC2| EC2
    
    style ONPREM_SERVER fill:#e1f5fe
    style ONPREM_DNS fill:#fff3e0
    style ONPREM_FILE fill:#e1f5fe
    style INBOUND_1 fill:#c8e6c9
    style INBOUND_2 fill:#c8e6c9
    style OUTBOUND_1 fill:#f3e5f5
    style OUTBOUND_2 fill:#f3e5f5
    style VPC_DNS fill:#ffecb3
    style EC2 fill:#e1f5fe
    style RDS fill:#ffebee
    style R53_PRIVATE fill:#fff3e0

See: diagrams/02_domain_1_route53_resolver_hybrid.mmd

Diagram Explanation (detailed):

This diagram illustrates bidirectional DNS resolution in a hybrid cloud environment using Route 53 Resolver. The architecture enables on-premises systems to resolve AWS private DNS names and AWS systems to resolve on-premises DNS names.

Inbound Flow (On-Premises → AWS): An on-premises application server (192.168.10.50) needs to connect to an RDS database in AWS (db.aws.company.com). (1) The application queries its configured DNS server (on-premises DNS at 192.168.1.10). (2) The on-premises DNS server has a conditional forwarder configured: queries for *.aws.company.com are forwarded to the Route 53 Resolver inbound endpoint (10.0.1.10). (3) The query traverses the Direct Connect or VPN connection to AWS. (4) The inbound endpoint receives the query and forwards it to Route 53's private hosted zone for aws.company.com. (5) Route 53 looks up db.aws.company.com in the private hosted zone and returns 10.0.20.100 (the RDS endpoint's private IP). (6-8) The answer flows back through the inbound endpoint, over Direct Connect/VPN, to the on-premises DNS server, and finally to the application server. The application can now connect to the RDS database using its private IP.

Outbound Flow (AWS → On-Premises): An EC2 instance in AWS (10.0.10.50) needs to access a file server on-premises (fileserver.company.com). (1) The EC2 instance queries the VPC's DNS resolver (10.0.0.2, which is VPC_CIDR+2). (2) The VPC DNS resolver checks Route 53 Resolver rules and finds a match: queries for company.com should be forwarded via the outbound endpoint. (3) The outbound endpoint forwards the query to the configured target IPs (on-premises DNS servers at 192.168.1.10) over Direct Connect/VPN. (4) The query reaches the on-premises DNS server. (5) The on-premises DNS server resolves fileserver.company.com to 192.168.20.30 (the file server's IP). (6-8) The answer flows back through Direct Connect/VPN, to the outbound endpoint, to the VPC DNS resolver, and finally to the EC2 instance. The EC2 instance can now connect to the file server using its on-premises IP.

High Availability: Both inbound and outbound endpoints are deployed across multiple AZs (10.0.1.0/24 in AZ-a and 10.0.2.0/24 in AZ-b). If AZ-a fails, DNS resolution continues via AZ-b endpoints. On-premises DNS servers should be configured with both endpoint IPs (10.0.1.10 and 10.0.2.10) for redundancy.

Security: DNS queries between AWS and on-premises traverse the private Direct Connect or VPN connection, not the public internet. This ensures DNS queries are encrypted (if using VPN) and not exposed to internet-based attacks. Additionally, security groups on the inbound/outbound endpoint ENIs control which sources can query them.

Performance: Route 53 Resolver endpoints are regional services with low latency (typically < 5ms within the same Region). DNS queries from on-premises to AWS add the Direct Connect/VPN latency (typically 1-10ms for Direct Connect, 20-50ms for VPN) plus the resolver processing time.

⭐ Must Know (Critical Route 53 Resolver Facts):

  • Inbound Endpoints: Allow on-premises to query AWS DNS; provide IP addresses for on-premises DNS forwarders
  • Outbound Endpoints: Allow AWS to query on-premises DNS; use resolver rules to define forwarding
  • Resolver Rules: Forward (send to target IPs) or System (use Route 53 default resolution)
  • Rule Sharing: Can share resolver rules across accounts using AWS RAM (Resource Access Manager)
  • Endpoint Pricing: $0.125/hour per endpoint per AZ ($0.25/hour for 2-AZ HA setup)
  • Query Pricing: $0.40/million queries through endpoints
  • DNS Firewall: Can attach DNS Firewall rules to VPCs to block malicious domains
  • Query Logging: Can log all DNS queries for security analysis and troubleshooting

Detailed Example 1: Enterprise Hybrid DNS with Active Directory

Your company has Active Directory on-premises (company.com) and is migrating applications to AWS. You need seamless DNS resolution between on-premises and AWS.

Requirements:

  • On-premises AD servers must resolve AWS private hosted zone (aws.company.com)
  • AWS EC2 instances must resolve on-premises AD domain (company.com)
  • High availability across multiple AZs
  • Centralized DNS management

Architecture:

On-Premises:

  • Active Directory DNS servers: 192.168.1.10, 192.168.1.11
  • Domain: company.com
  • Conditional forwarder: aws.company.com → 10.0.1.10, 10.0.2.10

AWS VPC (10.0.0.0/16):

  • Inbound endpoints: 10.0.1.10 (AZ-a), 10.0.2.10 (AZ-b)
  • Outbound endpoints: 10.0.1.20 (AZ-a), 10.0.2.20 (AZ-b)
  • Private hosted zone: aws.company.com
  • Resolver rule: company.com → 192.168.1.10, 192.168.1.11

DNS Records:

  • Private hosted zone (aws.company.com):
    • db.aws.company.com → 10.0.20.100 (RDS)
    • api.aws.company.com → 10.0.10.50 (ALB)
    • *.aws.company.com → Various AWS resources

Traffic Flows:

  1. On-premises user accesses api.aws.company.com → AD DNS forwards to inbound endpoint → Route 53 private hosted zone returns ALB IP
  2. EC2 instance joins company.com domain → VPC DNS forwards to outbound endpoint → AD DNS returns domain controller IPs
  3. EC2 instance accesses fileserver.company.com → VPC DNS forwards to outbound endpoint → AD DNS returns file server IP

Benefits:

  • Seamless integration with existing AD infrastructure
  • No need to manage DNS servers in EC2
  • Automatic failover across AZs
  • Centralized DNS management in AD and Route 53

Detailed Example 2: Multi-Account DNS with Shared Resolver Rules

Your organization has 50 AWS accounts, each with VPCs that need to resolve on-premises DNS. You want to centralize DNS configuration instead of configuring resolver rules in each account.

Architecture:

Shared Services Account:

  • VPC with outbound endpoints
  • Resolver rules for on-premises domains (company.com, internal.company.com)
  • Share resolver rules with all accounts using AWS RAM

Application Accounts (50 accounts):

  • VPCs associated with shared resolver rules
  • No need to create outbound endpoints or rules in each account
  • Automatically inherit DNS forwarding configuration

Configuration Steps:

  1. In shared services account, create outbound endpoints in VPC
  2. Create resolver rules: company.com → 192.168.1.10, 192.168.1.11
  3. Share resolver rules with AWS Organization using AWS RAM
  4. In each application account, associate VPCs with shared resolver rules

Benefits:

  • Centralized DNS configuration (single source of truth)
  • Reduced operational overhead (no per-account configuration)
  • Consistent DNS behavior across all accounts
  • Easy updates (change rules in shared services account, applies to all)

Cost Optimization: Instead of 50 outbound endpoints ($0.25/hour Ɨ 50 = $12.50/hour = $9,000/month), you have 1 outbound endpoint ($0.25/hour = $180/month), saving $8,820/month.

Detailed Example 3: DNS Firewall for Security

You want to prevent EC2 instances from resolving known malicious domains and exfiltrating data to command-and-control servers.

Route 53 Resolver DNS Firewall Configuration:

Domain Lists:

  1. AWS Managed Threat List (malware, botnet, phishing domains)
  2. Custom block list (competitor domains, social media, file sharing)
  3. Custom allow list (approved external services)

Firewall Rules (evaluated in order):

  1. Rule 1: Allow list → ALLOW (priority 100)
  2. Rule 2: AWS Managed Threat List → BLOCK (priority 200)
  3. Rule 3: Custom block list → BLOCK (priority 300)
  4. Rule 4: Default → ALLOW (priority 1000)

Actions:

  • ALLOW: Resolve normally
  • BLOCK: Return NXDOMAIN (domain doesn't exist)
  • ALERT: Log but allow resolution

How It Works:

  • EC2 instance queries malware.example.com
  • VPC DNS resolver checks DNS Firewall rules
  • Domain matches AWS Managed Threat List (Rule 2)
  • DNS Firewall returns NXDOMAIN
  • Query is logged to CloudWatch Logs for security analysis
  • EC2 instance cannot connect to malicious domain

Monitoring:

  • CloudWatch Logs capture all blocked queries
  • CloudWatch Alarms alert on high block rates (potential compromise)
  • Security team reviews logs for threat intelligence

Benefits:

  • Prevents data exfiltration via DNS tunneling
  • Blocks connections to known malicious domains
  • Provides visibility into DNS-based threats
  • No changes required to EC2 instances or applications

šŸ’” Tips for Route 53 Resolver:

  • Inbound for On-Premises → AWS: Configure conditional forwarders in on-premises DNS pointing to inbound endpoint IPs
  • Outbound for AWS → On-Premises: Create resolver rules for on-premises domains pointing to on-premises DNS IPs
  • High Availability: Deploy endpoints in 2+ AZs; configure on-premises DNS with both endpoint IPs
  • Rule Sharing: Use AWS RAM to share resolver rules across accounts for centralized management
  • DNS Firewall: Enable for security; use AWS Managed Threat List plus custom lists

āš ļø Common Mistakes & Misconceptions:

Mistake 1: Not configuring security groups on resolver endpoint ENIs

  • Why it's wrong: Without security group rules allowing UDP/TCP 53, DNS queries are blocked
  • Correct understanding: Create security group allowing inbound UDP/TCP 53 from on-premises CIDR (for inbound endpoints) or allowing outbound UDP/TCP 53 to on-premises DNS IPs (for outbound endpoints)

Mistake 2: Creating resolver rules for AWS domains

  • Why it's wrong: VPC DNS resolver already handles AWS domains (private hosted zones, AWS service endpoints); resolver rules override this
  • Correct understanding: Only create resolver rules for on-premises or external domains; let VPC DNS handle AWS domains

Mistake 3: Not deploying endpoints in multiple AZs

  • Why it's wrong: Single AZ deployment creates single point of failure; if AZ fails, DNS resolution fails
  • Correct understanding: Deploy endpoints in 2+ AZs for high availability; costs more but prevents outages

Mistake 4: Forgetting to associate VPCs with resolver rules

  • Why it's wrong: Resolver rules don't automatically apply to VPCs; must explicitly associate
  • Correct understanding: After creating resolver rules, associate them with VPCs (or share via AWS RAM and associate in target accounts)

šŸ”— Connections to Other Topics:

  • Relates to Direct Connect because: Resolver endpoints use Direct Connect or VPN for hybrid connectivity; DNS queries traverse these connections
  • Builds on VPC by: Resolver endpoints are deployed in VPC subnets; use VPC security groups and route tables
  • Often used with AWS RAM to: Share resolver rules across accounts for centralized DNS management

Section 3: Load Balancing Solutions

Introduction

The problem: Applications need to distribute traffic across multiple servers for high availability, scalability, and fault tolerance. Traditional load balancers are hardware appliances that are expensive, difficult to scale, and require manual configuration. In cloud environments, traffic patterns are dynamic, with servers scaling up and down automatically. Additionally, modern applications use different protocols (HTTP/HTTPS, TCP, UDP) and require advanced features like SSL termination, content-based routing, and WebSocket support.

The solution: AWS Elastic Load Balancing (ELB) provides managed load balancing services that automatically distribute traffic across multiple targets (EC2 instances, containers, IP addresses, Lambda functions) in one or more Availability Zones. ELB offers three types of load balancers: Application Load Balancer (ALB) for HTTP/HTTPS (layer 7), Network Load Balancer (NLB) for TCP/UDP/TLS (layer 4), and Gateway Load Balancer (GWLB) for third-party virtual appliances. Each type is designed for specific use cases and provides automatic scaling, health checks, and integration with AWS services.

Why it's tested: The ANS-C01 exam extensively tests load balancer selection, configuration, and integration. You must understand the differences between ALB, NLB, and GWLB, when to use each, how to configure advanced features (target groups, health checks, SSL/TLS, cross-zone load balancing), and how to integrate with other AWS services (Auto Scaling, CloudFront, Global Accelerator, WAF).

Core Concepts

Application Load Balancer (ALB)

What it is: ALB operates at layer 7 (application layer) of the OSI model, routing HTTP and HTTPS traffic based on content (URL paths, hostnames, headers, query strings). ALB supports advanced features like host-based routing, path-based routing, HTTP/2, WebSocket, gRPC, and integration with AWS WAF for security.

Why it exists: Traditional layer 4 load balancers can only route based on IP and port, not application-level content. If you have multiple microservices behind a single load balancer, a layer 4 load balancer cannot route /api/users to the user service and /api/orders to the order service. ALB solves this by inspecting HTTP requests and routing based on content, enabling a single load balancer to serve multiple applications or microservices with different routing rules.

Real-world analogy: Think of ALB like a hotel concierge. When guests arrive (HTTP requests), the concierge reads their reservation details (URL path, headers) and directs them to the appropriate floor and room (target group). Guests going to the restaurant (/restaurant) are directed to the restaurant floor, guests going to the spa (/spa) are directed to the spa floor. The concierge makes intelligent routing decisions based on the request content, not just the entrance they used.

How it works (Detailed step-by-step):

  1. ALB Creation: You create an ALB in a VPC, specifying 2+ subnets in different AZs for high availability. The ALB is assigned a DNS name (e.g., my-alb-1234567890.us-east-1.elb.amazonaws.com) and optionally you can use Route 53 alias records to map a custom domain (e.g., www.example.com) to the ALB.

  2. Listener Configuration: You configure listeners that define the protocol and port the ALB listens on (e.g., HTTP:80, HTTPS:443). For HTTPS listeners, you attach an SSL/TLS certificate from ACM (AWS Certificate Manager) or upload your own. You can configure multiple listeners on different ports.

  3. Target Group Creation: You create target groups that define the backend targets (EC2 instances, IP addresses, Lambda functions, or ALB itself for chaining). Each target group has a protocol (HTTP or HTTPS), port, health check configuration, and target type. You can have multiple target groups for different applications or microservices.

  4. Routing Rules: You configure listener rules that route requests to target groups based on conditions: (1) Host-based routing: route api.example.com to API target group, www.example.com to web target group. (2) Path-based routing: route /api/* to API target group, /images/* to image service target group. (3) Header-based routing: route requests with header X-Custom-Header: mobile to mobile target group. (4) Query string routing: route requests with ?version=2 to v2 target group.

  5. Request Processing: When a client sends an HTTP request to the ALB, the ALB terminates the TCP connection (connection termination). The ALB evaluates listener rules in priority order (lowest number first) to determine which target group to route to. The ALB selects a healthy target from the target group using the configured algorithm (round robin by default, or least outstanding requests). The ALB establishes a new connection to the target and forwards the request, adding headers like X-Forwarded-For (client IP), X-Forwarded-Proto (original protocol), and X-Forwarded-Port (original port).

  6. Health Checks: The ALB performs health checks on targets every 5-300 seconds (configurable). Health checks send HTTP/HTTPS requests to a specified path (e.g., /health) and expect a specific response code (e.g., 200). If a target fails the health check threshold (e.g., 2 consecutive failures), it's marked unhealthy and removed from rotation. When it passes the healthy threshold (e.g., 5 consecutive successes), it's marked healthy and added back.

  7. Response Handling: The target processes the request and returns an HTTP response to the ALB. The ALB forwards the response to the client. If the target is slow or unresponsive, the ALB enforces timeouts (idle timeout: 60 seconds default, can be configured 1-4000 seconds). The ALB maintains connection pooling to targets, reusing connections for multiple requests to reduce latency.

šŸ“Š Application Load Balancer Architecture Diagram:

graph TB
    USERS[Users]
    
    subgraph "Application Load Balancer"
        LISTENER_80[Listener: HTTP:80<br/>Redirect to HTTPS]
        LISTENER_443[Listener: HTTPS:443<br/>Certificate: ACM]
        
        RULES[Routing Rules:<br/>1. api.example.com → API TG<br/>2. /images/* → Image TG<br/>3. /* → Web TG]
    end
    
    subgraph "Target Groups"
        TG_API[API Target Group<br/>Protocol: HTTP:8080<br/>Health: /health]
        TG_IMG[Image Target Group<br/>Protocol: HTTP:8081<br/>Health: /ping]
        TG_WEB[Web Target Group<br/>Protocol: HTTP:80<br/>Health: /]
    end
    
    subgraph "Availability Zone 1a"
        API1[API Server 1<br/>10.0.1.10:8080]
        IMG1[Image Server 1<br/>10.0.1.20:8081]
        WEB1[Web Server 1<br/>10.0.1.30:80]
    end
    
    subgraph "Availability Zone 1b"
        API2[API Server 2<br/>10.0.2.10:8080]
        IMG2[Image Server 2<br/>10.0.2.20:8081]
        WEB2[Web Server 2<br/>10.0.2.30:80]
    end
    
    USERS -->|HTTP/HTTPS| LISTENER_80
    USERS --> LISTENER_443
    
    LISTENER_80 -->|301 Redirect| USERS
    LISTENER_443 --> RULES
    
    RULES -->|api.example.com| TG_API
    RULES -->|/images/*| TG_IMG
    RULES -->|Default| TG_WEB
    
    TG_API --> API1
    TG_API --> API2
    TG_IMG --> IMG1
    TG_IMG --> IMG2
    TG_WEB --> WEB1
    TG_WEB --> WEB2
    
    style LISTENER_80 fill:#e1f5fe
    style LISTENER_443 fill:#c8e6c9
    style RULES fill:#fff3e0
    style TG_API fill:#f3e5f5
    style TG_IMG fill:#f3e5f5
    style TG_WEB fill:#f3e5f5

See: diagrams/02_domain_1_alb_architecture.mmd

⭐ Must Know (Critical ALB Facts):

  • Layer 7: Operates at application layer; routes based on HTTP content (path, host, headers)
  • Protocols: HTTP, HTTPS, HTTP/2, WebSocket, gRPC
  • Target Types: EC2 instances, IP addresses, Lambda functions, ALB (chaining)
  • SSL/TLS Termination: ALB terminates SSL/TLS; can re-encrypt to targets or use HTTP
  • Cross-Zone Load Balancing: Enabled by default; distributes traffic evenly across all AZs
  • Connection Draining: Deregistration delay (0-3600 seconds, default 300); completes in-flight requests before removing target
  • Sticky Sessions: Cookie-based session affinity; routes same client to same target
  • WAF Integration: Attach AWS WAF web ACL for layer 7 protection
  • Pricing: $0.0225/hour + $0.008/LCU (Load Balancer Capacity Unit)

Domain 1 Summary

This chapter covered Network Design (30% of exam), including:

  • āœ… Edge services (CloudFront, Global Accelerator)
  • āœ… DNS solutions (Route 53, Resolver, hybrid DNS)
  • āœ… Load balancing (ALB overview)

Remaining sections (to be continued):

  • Load balancing (NLB, GWLB, comparison)
  • Logging and monitoring
  • Hybrid connectivity (Direct Connect, VPN)
  • Multi-account/multi-region architectures

Next Chapter: Domain 2 - Network Implementation (03_domain_2_network_implementation)

Network Load Balancer (NLB)

What it is: Network Load Balancer (NLB) is a Layer 4 (transport layer) load balancer that distributes TCP, UDP, and TLS traffic across targets based on IP protocol data. Unlike ALB which operates at Layer 7 and inspects HTTP content, NLB operates at Layer 4 and makes routing decisions based solely on network-level information (IP addresses, ports, protocols) without examining application-level data.

Why it exists: Many applications require ultra-low latency (sub-millisecond), extreme throughput (millions of requests per second), static IP addresses for whitelisting, or need to preserve the client's source IP address. ALB, while powerful for HTTP/HTTPS applications, adds latency due to connection termination and HTTP parsing. NLB solves these challenges by providing a high-performance, low-latency load balancing solution that operates at the network layer, making it ideal for TCP/UDP applications, gaming servers, IoT devices, financial trading platforms, and any workload requiring predictable static IPs or extreme performance.

Real-world analogy: Think of NLB as a high-speed mail sorting facility that routes packages based only on the destination address label (IP and port) without opening the packages to inspect contents. It's extremely fast because it doesn't need to understand what's inside - it just reads the address and forwards. ALB, in contrast, is like a postal worker who opens each package, reads the contents, and routes based on what's inside - more intelligent but slower.

How it works (Detailed step-by-step):

  1. NLB Creation: You create an NLB in a VPC, specifying 1+ subnets in different AZs. For each enabled AZ, AWS creates a load balancer node with a network interface. For internet-facing NLBs, you can optionally assign one Elastic IP address per subnet, giving you static, predictable IP addresses that never change. This is critical for scenarios where clients need to whitelist IPs or DNS isn't suitable.

  2. Listener Configuration: You configure listeners that define the protocol and port the NLB listens on. Supported protocols: TCP (for general TCP traffic), TLS (for encrypted TCP with SSL/TLS termination at NLB), UDP (for UDP traffic like DNS, gaming, IoT), TCP_UDP (for protocols that use both). Each listener forwards traffic to a target group. Unlike ALB, NLB listeners don't support complex routing rules - they simply forward all traffic on that port to the target group.

  3. Target Group Creation: You create target groups with protocol (TCP, TLS, UDP, TCP_UDP), port, and target type. Target types: (1) Instance: Register EC2 instances by instance ID; NLB uses the instance's primary private IP. (2) IP: Register any IP address, including on-premises servers via Direct Connect/VPN, containers, or instances in peered VPCs. (3) ALB: Register an ALB as a target, enabling you to combine NLB's static IPs with ALB's Layer 7 routing capabilities.

  4. Flow Hash Routing: When a client connects, NLB uses a flow hash algorithm to select a target. For TCP: hash is based on protocol, source IP, source port, destination IP, destination port, and TCP sequence number. For UDP: hash is based on protocol, source IP, source port, destination IP, and destination port. This algorithm ensures that all packets in a flow (a unique combination of these parameters) are routed to the same target for the duration of the flow. Different flows from the same client can go to different targets.

  5. Connection Handling: NLB operates in pass-through mode - it doesn't terminate TCP connections like ALB does. Instead, it forwards packets directly to targets, preserving the client's source IP address (when target type is instance or IP with client IP preservation enabled). The target sees the actual client IP, not the NLB's IP. This is crucial for applications that need to know the client's IP for logging, security, or geolocation. The TCP connection is established directly between the client and the target, with NLB acting as a transparent proxy.

  6. Cross-Zone Load Balancing: By default, each NLB node distributes traffic only to targets in its own AZ. If you enable cross-zone load balancing (disabled by default, unlike ALB), each NLB node distributes traffic evenly across all healthy targets in all enabled AZs. This improves availability but incurs cross-AZ data transfer charges. For example, with 2 targets in AZ-A and 8 targets in AZ-B, without cross-zone, AZ-A's node sends 100% to its 2 targets (50% each), and AZ-B's node sends 100% to its 8 targets (12.5% each). With cross-zone enabled, all 10 targets receive 10% each regardless of AZ.

  7. Health Checks: NLB performs health checks on targets at the target group level. Health check protocols: TCP (establishes TCP connection), HTTP/HTTPS (sends HTTP request to specified path). Health check interval: 10 or 30 seconds. Healthy threshold: 2-10 consecutive successes. Unhealthy threshold: 2-10 consecutive failures. When a target fails health checks, NLB stops routing new flows to it but maintains existing connections until they close naturally or timeout.

  8. TLS Termination: For TLS listeners, NLB can terminate TLS connections, decrypt traffic, and forward unencrypted TCP to targets (similar to ALB's HTTPS termination). You attach an ACM certificate or upload your own. NLB supports SNI (Server Name Indication), allowing multiple TLS certificates on a single listener for different domains. After termination, NLB can re-encrypt traffic to targets using TLS or send plain TCP.

  9. Static IP and DNS: Each NLB node has a static private IP in its subnet. For internet-facing NLBs, you can assign Elastic IPs (one per AZ), giving you static public IPs. The NLB's DNS name resolves to these IPs. Clients can connect directly to the IPs or use the DNS name. Static IPs are essential for firewall whitelisting, compliance requirements, or applications that cache IP addresses.

šŸ“Š Network Load Balancer Architecture Diagram:

graph TB
    INTERNET[Internet Clients]
    
    subgraph "Network Load Balancer"
        subgraph "AZ-1a Node"
            EIP1[Elastic IP: 54.123.45.67<br/>Private IP: 10.0.1.100]
            LISTENER1[Listener: TCP:443<br/>TLS Termination]
        end
        
        subgraph "AZ-1b Node"
            EIP2[Elastic IP: 54.123.45.68<br/>Private IP: 10.0.2.100]
            LISTENER2[Listener: TCP:443<br/>TLS Termination]
        end
    end
    
    subgraph "Target Group: TCP:8443"
        TG[Health Check: TCP:8443<br/>Interval: 30s<br/>Threshold: 3/3]
    end
    
    subgraph "Availability Zone 1a"
        APP1[App Server 1<br/>10.0.1.10:8443<br/>Sees Client IP]
        APP2[App Server 2<br/>10.0.1.20:8443<br/>Sees Client IP]
    end
    
    subgraph "Availability Zone 1b"
        APP3[App Server 3<br/>10.0.2.10:8443<br/>Sees Client IP]
        APP4[App Server 4<br/>10.0.2.20:8443<br/>Sees Client IP]
    end
    
    INTERNET -->|TCP:443| EIP1
    INTERNET -->|TCP:443| EIP2
    
    EIP1 --> LISTENER1
    EIP2 --> LISTENER2
    
    LISTENER1 -->|Flow Hash<br/>Preserves Client IP| TG
    LISTENER2 -->|Flow Hash<br/>Preserves Client IP| TG
    
    TG -->|Default: Same AZ Only| APP1
    TG --> APP2
    TG --> APP3
    TG --> APP4
    
    style EIP1 fill:#c8e6c9
    style EIP2 fill:#c8e6c9
    style LISTENER1 fill:#e1f5fe
    style LISTENER2 fill:#e1f5fe
    style TG fill:#fff3e0

See: diagrams/02_domain_1_nlb_architecture.mmd

Diagram Explanation (detailed):
This diagram illustrates a Network Load Balancer deployed across two Availability Zones with static Elastic IP addresses. Each AZ has an NLB node with its own Elastic IP (54.123.45.67 in AZ-1a, 54.123.45.68 in AZ-1b) and private IP. Internet clients connect to either Elastic IP via DNS resolution. The NLB's DNS name (e.g., my-nlb-abc123.elb.us-east-1.amazonaws.com) resolves to both Elastic IPs, and clients use DNS round-robin or connection-based selection. Each NLB node has a TCP:443 listener configured for TLS termination, decrypting incoming TLS traffic using an ACM certificate. The listener forwards decrypted traffic to the target group on TCP:8443. The target group contains four application servers (two per AZ) and performs TCP health checks every 30 seconds, requiring 3 consecutive successes to mark a target healthy. The NLB uses flow hash routing to select targets, and crucially, preserves the client's source IP address so application servers see the actual client IP (not the NLB's IP). By default, cross-zone load balancing is disabled, meaning the AZ-1a node routes only to APP1 and APP2, while the AZ-1b node routes only to APP3 and APP4. This minimizes cross-AZ data transfer costs but can lead to uneven distribution if AZs have different numbers of targets.

Detailed Example 1: Gaming Server with NLB and Static IPs
You're running a multiplayer gaming platform with game servers on EC2 instances. Players connect via a custom TCP protocol on port 7777. Your requirements: (1) Ultra-low latency (< 5ms added by load balancer), (2) Static IP addresses that players can whitelist in firewalls, (3) Preserve client IPs for anti-cheat systems, (4) Handle 100,000+ concurrent connections. Solution: Deploy an NLB with Elastic IPs in two AZs. Configure a TCP:7777 listener forwarding to a target group with your game server instances (target type: instance). Enable client IP preservation. The NLB provides static Elastic IPs (e.g., 54.123.45.67, 54.123.45.68) that you publish to players. Players configure their game clients to connect to these IPs. When a player connects, the NLB uses flow hash to select a game server and forwards packets with the player's original source IP preserved. The game server sees the player's real IP for geolocation and anti-cheat. The NLB adds < 1ms latency because it operates at Layer 4 without connection termination. Health checks ensure only healthy game servers receive traffic. If a server fails, existing player connections are maintained until they disconnect, while new connections go to healthy servers. The static IPs never change, so players can whitelist them permanently.

Detailed Example 2: Hybrid Architecture with On-Premises Targets
Your company is migrating to AWS but needs to keep some application servers on-premises during the transition. You want a single load balancer endpoint that distributes traffic to both AWS and on-premises servers. Solution: Deploy an NLB in AWS with a Direct Connect or VPN connection to your on-premises data center. Create a target group with target type "IP" and register both AWS instance IPs (e.g., 10.0.1.10, 10.0.1.20) and on-premises server IPs (e.g., 192.168.1.10, 192.168.1.20). Configure a TCP:443 listener. The NLB distributes traffic across all registered IPs regardless of location. Clients connect to the NLB's DNS name or Elastic IP. The NLB routes traffic to AWS targets over the VPC network and to on-premises targets over Direct Connect/VPN. Health checks monitor both AWS and on-premises servers. If on-premises servers fail, traffic automatically shifts to AWS. As you migrate more workloads to AWS, you simply deregister on-premises IPs and register new AWS IPs without changing the client-facing endpoint. This enables seamless, gradual migration with zero downtime.

Detailed Example 3: NLB with ALB Targets for Static IP + Layer 7 Routing
You have a microservices application that needs both static IP addresses (for partner integrations that require IP whitelisting) and advanced Layer 7 routing (path-based routing to different microservices). NLB provides static IPs but doesn't support Layer 7 routing. ALB supports Layer 7 routing but has dynamic IPs. Solution: Deploy an NLB with Elastic IPs as the internet-facing entry point. Create an internal ALB behind the NLB. Register the ALB as a target in the NLB's target group (target type: ALB). Configure the ALB with path-based routing rules: /api/* → API microservice, /auth/* → Auth microservice, /data/* → Data microservice. Clients connect to the NLB's static Elastic IPs. The NLB forwards traffic to the ALB. The ALB performs Layer 7 inspection and routes requests to the appropriate microservice based on the URL path. This architecture combines NLB's static IPs with ALB's intelligent routing. Partners whitelist the NLB's Elastic IPs. The ALB handles SSL/TLS termination, path-based routing, and health checks for each microservice. You get the best of both worlds: predictable IPs for external integrations and flexible routing for internal microservices.

⭐ Must Know (Critical NLB Facts):

  • Layer 4: Operates at transport layer; routes based on IP, port, protocol (no HTTP inspection)
  • Protocols: TCP, TLS, UDP, TCP_UDP
  • Performance: Handles millions of requests per second with ultra-low latency (< 1ms added)
  • Static IPs: Each AZ gets a static private IP; optionally assign Elastic IPs for internet-facing
  • Client IP Preservation: Targets see the client's source IP (when target type is instance or IP)
  • Target Types: EC2 instances, IP addresses (including on-premises), ALB
  • Cross-Zone Load Balancing: Disabled by default (unlike ALB); incurs data transfer charges when enabled
  • Connection Mode: Pass-through (no connection termination); TCP connection is between client and target
  • Flow Hash: Routes flows based on 5-tuple (protocol, source IP/port, dest IP/port) + TCP sequence
  • TLS Termination: Supports TLS termination with ACM certificates and SNI
  • Health Checks: TCP or HTTP/HTTPS; interval 10 or 30 seconds
  • Pricing: $0.0225/hour + $0.006/NLCU (NLB Capacity Unit) - cheaper than ALB per LCU

When to use (Comprehensive):

  • āœ… Use when: You need ultra-low latency (< 5ms) for real-time applications (gaming, trading, IoT)
  • āœ… Use when: You need static IP addresses for firewall whitelisting or compliance requirements
  • āœ… Use when: You need to preserve client source IP addresses for logging, security, or geolocation
  • āœ… Use when: You're load balancing non-HTTP protocols (TCP, UDP, custom protocols)
  • āœ… Use when: You need extreme throughput (millions of requests per second)
  • āœ… Use when: You're integrating on-premises servers with AWS (hybrid architecture using IP targets)
  • āœ… Use when: You need to combine static IPs with ALB's Layer 7 routing (NLB → ALB architecture)
  • āŒ Don't use when: You need Layer 7 routing (path-based, host-based, header-based) - use ALB instead
  • āŒ Don't use when: You need WAF integration for web application protection - use ALB instead (NLB doesn't support WAF directly)
  • āŒ Don't use when: You need HTTP-specific features (redirects, fixed responses, authentication) - use ALB instead

Limitations & Constraints:

  • No Layer 7 routing: Cannot route based on HTTP paths, headers, or query strings (use ALB for this)
  • No WAF integration: Cannot attach AWS WAF directly to NLB (workaround: use NLB → ALB architecture)
  • Cross-zone load balancing disabled by default: Must explicitly enable and incurs cross-AZ data transfer charges
  • No connection draining: When deregistering targets, existing connections continue until they close or timeout (no graceful draining like ALB)
  • Limited health check options: Only TCP or HTTP/HTTPS (no custom health check logic)
  • No authentication: Cannot perform OAuth, OIDC, or Cognito authentication (use ALB for this)

šŸ’” Tips for Understanding:

  • Layer 4 vs Layer 7: Remember OSI model - Layer 4 (transport) sees IPs and ports, Layer 7 (application) sees HTTP content. NLB is "dumb but fast," ALB is "smart but slower."
  • Static IPs: Think of Elastic IPs as your "permanent phone number" - they never change, making them perfect for whitelisting.
  • Flow hash: All packets in a TCP connection go to the same target because they have the same 5-tuple. Different connections can go to different targets.
  • Client IP preservation: NLB is like a transparent glass door - the target sees through it to the real client. ALB is like a receptionist - the target sees the receptionist (ALB), not the original visitor.

āš ļø Common Mistakes & Misconceptions:

  • Mistake 1: Assuming NLB can do path-based routing like ALB
    • Why it's wrong: NLB operates at Layer 4 and doesn't inspect HTTP content, so it can't see URL paths
    • Correct understanding: NLB routes based only on IP, port, and protocol. For path-based routing, use ALB or put ALB behind NLB
  • Mistake 2: Expecting cross-zone load balancing to be enabled by default
    • Why it's wrong: Unlike ALB (where it's enabled by default), NLB has cross-zone disabled by default to minimize data transfer costs
    • Correct understanding: With cross-zone disabled, each NLB node routes only to targets in its own AZ. Enable cross-zone for even distribution across all AZs, but expect cross-AZ data transfer charges
  • Mistake 3: Thinking NLB terminates TCP connections like ALB terminates HTTP connections
    • Why it's wrong: NLB operates in pass-through mode, forwarding packets without terminating connections
    • Correct understanding: The TCP connection is established directly between the client and the target, with NLB acting as a transparent proxy. This is why client IPs are preserved and latency is minimal

šŸ”— Connections to Other Topics:

  • Relates to ALB because: Both are Elastic Load Balancing services, but NLB operates at Layer 4 (transport) while ALB operates at Layer 7 (application). Use NLB for performance and static IPs, ALB for intelligent HTTP routing
  • Builds on VPC networking by: Using VPC subnets, security groups, and network interfaces. NLB nodes are deployed in VPC subnets and use ENIs with static IPs
  • Often used with Global Accelerator to: Provide static anycast IPs that route to NLB endpoints in multiple regions, combining global static IPs with regional load balancing
  • Integrates with Direct Connect/VPN to: Load balance traffic to on-premises servers using IP target type, enabling hybrid architectures
  • Works with ALB in chained architecture: NLB (internet-facing with static IPs) → ALB (internal with Layer 7 routing) → microservices, combining benefits of both

Troubleshooting Common Issues:

  • Issue 1: Targets not receiving traffic despite being healthy
    • Solution: Check security groups on targets allow traffic from NLB's private IPs (or client IPs if client IP preservation is enabled). Verify target group protocol/port matches target's listening port
  • Issue 2: Uneven traffic distribution across AZs
    • Solution: Enable cross-zone load balancing if you want even distribution across all targets regardless of AZ. Without it, each NLB node routes only to targets in its own AZ
  • Issue 3: Client IP not preserved at target
    • Solution: Verify target type is "instance" or "IP" (not "ALB"). Check that client IP preservation is enabled in target group attributes. For UDP/TCP_UDP, client IP preservation is always enabled
  • Issue 4: TLS handshake failures
    • Solution: Verify ACM certificate is valid and covers the domain name clients are using. Check that TLS listener is configured with the correct certificate. Ensure clients support the TLS versions and cipher suites configured on the NLB

Gateway Load Balancer (GWLB)

What it is: Gateway Load Balancer (GWLB) is a Layer 3 (network layer) load balancer specifically designed to deploy, scale, and manage third-party virtual network appliances such as firewalls, intrusion detection/prevention systems (IDS/IPS), deep packet inspection (DPI) systems, and other security or monitoring appliances. Unlike ALB and NLB which distribute traffic to application servers, GWLB distributes traffic to security appliances that inspect, filter, or modify traffic before it reaches (or after it leaves) your applications.

Why it exists: Organizations often need to inspect all traffic entering or leaving their VPCs using specialized security appliances (firewalls, IDS/IPS, DPI). Before GWLB, deploying these appliances at scale was complex: you had to manually configure routing, manage appliance scaling, handle failover, and ensure symmetric traffic flow. GWLB solves this by providing a transparent, scalable insertion point for virtual appliances. It acts as both a gateway (single entry/exit point for traffic) and a load balancer (distributes traffic across multiple appliance instances), enabling centralized security inspection without complex routing or appliance management.

Real-world analogy: Think of GWLB as a security checkpoint at an airport where all passengers (traffic) must pass through. The checkpoint (GWLB) has multiple security lanes (appliance instances), and passengers are distributed across lanes for screening. After screening, passengers continue to their gates (applications). The checkpoint is transparent - passengers don't need to know which lane they'll use, and the checkpoint automatically scales lanes based on passenger volume. If a lane closes (appliance fails), passengers are routed to other lanes.

How it works (Detailed step-by-step):

  1. Architecture Setup: You deploy GWLB in a dedicated "security VPC" (service provider VPC) where your virtual appliances run. Your applications run in separate "application VPCs" (service consumer VPCs). GWLB endpoints (GWLBe) are created in the application VPCs to connect to the GWLB in the security VPC. This separation allows centralized security management - one security VPC with GWLB can serve multiple application VPCs.

  2. GWLB Creation: You create a GWLB in the security VPC, specifying subnets in multiple AZs for high availability. The GWLB operates at Layer 3, listening for all IP packets across all ports and protocols. Unlike ALB/NLB which have specific listeners (HTTP:80, TCP:443), GWLB captures all traffic indiscriminately.

  3. Target Group and Appliances: You create a target group and register your virtual appliance instances (firewalls, IDS/IPS, etc.) as targets. Target type is typically "instance" (EC2 instances running appliance software). The appliances must support the GENEVE protocol (Generic Network Virtualization Encapsulation) on port 6081, which GWLB uses to encapsulate and forward traffic.

  4. GWLB Endpoint Creation: In each application VPC, you create a GWLB endpoint (GWLBe) using AWS PrivateLink. The GWLBe is a VPC endpoint that provides private connectivity between the application VPC and the GWLB in the security VPC. You create the GWLBe in a dedicated subnet (not the same subnet as your applications).

  5. Route Table Configuration: This is the key to GWLB's transparency. You modify route tables in the application VPC to redirect traffic through the GWLBe: (1) For inbound traffic (from internet): In the Internet Gateway's route table (ingress routing), add a route for your application subnet's CIDR pointing to the GWLBe. Traffic from the internet hits the IGW, gets routed to GWLBe, inspected by appliances, then forwarded to applications. (2) For outbound traffic: In the application subnet's route table, add a route for 0.0.0.0/0 pointing to the GWLBe. Traffic from applications goes to GWLBe, gets inspected, then forwarded to IGW/NAT.

  6. Traffic Flow - Inbound: (1) Client sends request to application's public IP. (2) Traffic enters VPC via Internet Gateway. (3) IGW's ingress routing rule sends traffic to GWLBe. (4) GWLBe forwards traffic to GWLB in security VPC via PrivateLink. (5) GWLB encapsulates traffic in GENEVE and sends to a virtual appliance. (6) Appliance decapsulates, inspects traffic (allows/blocks/modifies), re-encapsulates, and returns to GWLB. (7) GWLB forwards traffic back to GWLBe. (8) GWLBe forwards traffic to application. (9) Application processes request and sends response. (10) Response follows reverse path through GWLBe → GWLB → appliance → GWLB → GWLBe → IGW → client.

  7. Flow Stickiness: GWLB maintains flow stickiness, ensuring all packets in a flow (defined by 5-tuple: protocol, source IP, source port, dest IP, dest port) go to the same appliance instance. This is critical for stateful appliances (firewalls, IDS/IPS) that need to see all packets in a connection to maintain state. You can configure stickiness as 5-tuple (default), 3-tuple (protocol, source IP, dest IP), or 2-tuple (source IP, dest IP).

  8. GENEVE Encapsulation: GWLB uses GENEVE protocol (RFC 8926) to encapsulate traffic before sending to appliances. GENEVE adds a header with metadata (source/destination VPC endpoint IDs, flow cookie for stickiness) and encapsulates the original IP packet. Appliances must support GENEVE to decapsulate, inspect the original packet, and re-encapsulate before returning to GWLB. This encapsulation preserves the original packet headers, allowing appliances to see the true source/destination IPs.

  9. Health Checks: GWLB performs health checks on appliance instances. Health check protocols: TCP, HTTP, HTTPS, or ICMP ping. If an appliance fails health checks, GWLB stops routing new flows to it. Existing flows may be maintained or redistributed based on configuration. When the appliance recovers, it's added back to rotation.

  10. Scaling: As traffic increases, you can add more appliance instances to the target group. GWLB automatically distributes new flows across all healthy appliances. You can use Auto Scaling groups to automatically scale appliances based on metrics (CPU, network throughput, custom metrics from appliances).

šŸ“Š Gateway Load Balancer Architecture Diagram:

graph TB
    INTERNET[Internet]
    
    subgraph "Application VPC (Consumer)"
        IGW[Internet Gateway<br/>Ingress Routing]
        
        subgraph "GWLB Endpoint Subnet"
            GWLBE[GWLB Endpoint<br/>PrivateLink]
        end
        
        subgraph "Application Subnet"
            APP1[Application Server<br/>10.0.1.10]
            APP2[Application Server<br/>10.0.1.20]
        end
        
        RT_IGW[IGW Route Table:<br/>10.0.1.0/24 → GWLBe]
        RT_APP[App Route Table:<br/>0.0.0.0/0 → GWLBe]
    end
    
    subgraph "Security VPC (Provider)"
        subgraph "Gateway Load Balancer"
            GWLB[GWLB<br/>Layer 3<br/>All Ports/Protocols]
        end
        
        subgraph "Target Group"
            TG[GENEVE Port 6081<br/>Flow Stickiness: 5-tuple]
        end
        
        subgraph "Appliance Subnet AZ-A"
            FW1[Firewall 1<br/>GENEVE Support]
            IDS1[IDS/IPS 1<br/>GENEVE Support]
        end
        
        subgraph "Appliance Subnet AZ-B"
            FW2[Firewall 2<br/>GENEVE Support]
            IDS2[IDS/IPS 2<br/>GENEVE Support]
        end
    end
    
    INTERNET -->|1. Request| IGW
    IGW -->|2. Ingress Route| GWLBE
    GWLBE <-->|3. PrivateLink| GWLB
    GWLB -->|4. GENEVE| TG
    TG --> FW1
    TG --> IDS1
    TG --> FW2
    TG --> IDS2
    FW1 -->|5. Inspected| GWLB
    GWLB -->|6. Return| GWLBE
    GWLBE -->|7. Forward| APP1
    GWLBE --> APP2
    APP1 -->|8. Response| GWLBE
    
    style GWLBE fill:#e1f5fe
    style GWLB fill:#c8e6c9
    style TG fill:#fff3e0
    style FW1 fill:#ffebee
    style FW2 fill:#ffebee
    style IDS1 fill:#f3e5f5
    style IDS2 fill:#f3e5f5

See: diagrams/02_domain_1_gwlb_architecture.mmd

Diagram Explanation (detailed):
This diagram shows a Gateway Load Balancer architecture with traffic inspection flow. The Application VPC (consumer) contains the applications that need protection, while the Security VPC (provider) contains the GWLB and security appliances. Traffic flow: (1) Internet client sends request to application's public IP. (2) Traffic enters via Internet Gateway, which has an ingress routing rule directing traffic destined for the application subnet (10.0.1.0/24) to the GWLB Endpoint. (3) The GWLBe forwards traffic over PrivateLink to the GWLB in the Security VPC. (4) GWLB encapsulates traffic in GENEVE protocol and distributes to security appliances (firewalls, IDS/IPS) based on flow hash with 5-tuple stickiness. (5) Appliances decapsulate, inspect traffic (checking for threats, applying firewall rules), and return verdict to GWLB. (6) GWLB forwards approved traffic back to GWLBe. (7) GWLBe delivers traffic to application servers. (8) Application response follows the reverse path through GWLBe → GWLB → appliances → GWLB → GWLBe → IGW → Internet. The application subnet's route table has 0.0.0.0/0 pointing to GWLBe, ensuring outbound traffic is also inspected. This architecture provides transparent, centralized security inspection without requiring applications to be aware of the inspection layer.

Detailed Example 1: Centralized Firewall for Multiple VPCs
Your organization has 10 application VPCs across different business units, and you need to enforce consistent firewall policies for all inbound and outbound traffic. Deploying firewalls in each VPC is expensive and difficult to manage. Solution: Deploy a GWLB in a central security VPC with a fleet of third-party firewall appliances (e.g., Palo Alto, Fortinet, Check Point). Create GWLB endpoints in each of the 10 application VPCs. Configure ingress routing at each VPC's Internet Gateway to send traffic to the local GWLBe. Configure application subnet route tables to send outbound traffic (0.0.0.0/0) to GWLBe. All traffic from all VPCs flows through the central GWLB to the firewall appliances for inspection. Security team manages firewall policies centrally in the security VPC. As traffic grows, add more firewall instances to the GWLB target group - GWLB automatically distributes load. This architecture provides centralized security governance, reduces costs (shared firewall infrastructure), and simplifies management (single pane of glass for all firewall policies).

Detailed Example 2: IDS/IPS for Threat Detection
You need to monitor all traffic in your VPC for security threats (malware, exploits, data exfiltration) using an intrusion detection/prevention system. Solution: Deploy GWLB with IDS/IPS appliances (e.g., Suricata, Snort) in a security VPC. Create GWLBe in your application VPC. Configure routing to send all traffic through GWLBe. The IDS/IPS appliances receive copies of all traffic via GENEVE encapsulation, perform deep packet inspection to detect threats, and can either alert (IDS mode) or block (IPS mode) malicious traffic. GWLB's flow stickiness ensures all packets in a connection go to the same IDS/IPS instance, allowing stateful analysis. The IDS/IPS logs threats to a SIEM system for security monitoring. If a threat is detected, the IPS can drop the connection by not returning the packet to GWLB. This provides comprehensive threat detection without requiring agents on application servers or complex network configuration.

Detailed Example 3: Multi-Region Security with GWLB
You have applications in multiple AWS regions and need consistent security inspection across all regions. Solution: Deploy GWLB with security appliances in each region's security VPC. Use AWS Transit Gateway to connect application VPCs to security VPCs within each region. For cross-region traffic, use Transit Gateway inter-region peering. Configure routing so that: (1) Intra-region traffic flows through the local GWLB for inspection. (2) Inter-region traffic flows through the source region's GWLB for egress inspection, then through the destination region's GWLB for ingress inspection. This ensures all traffic, regardless of source or destination, is inspected by security appliances. You can use AWS Firewall Manager to centrally manage security policies across all regions. This architecture provides defense-in-depth with regional security inspection and centralized policy management.

⭐ Must Know (Critical GWLB Facts):

  • Layer 3: Operates at network layer; inspects all IP packets across all ports and protocols
  • Purpose: Designed specifically for deploying and scaling third-party virtual security appliances
  • GENEVE Protocol: Uses GENEVE (port 6081) to encapsulate traffic between GWLB and appliances
  • GWLB Endpoints: VPC endpoints (PrivateLink) that connect application VPCs to GWLB in security VPC
  • Flow Stickiness: Maintains flow stickiness (5-tuple, 3-tuple, or 2-tuple) to route all packets in a flow to the same appliance
  • Transparent Gateway: Acts as single entry/exit point for traffic; applications are unaware of inspection
  • Ingress Routing: Uses IGW ingress routing to intercept inbound traffic before it reaches applications
  • Symmetric Routing: Ensures traffic flows through the same appliance in both directions (inbound and outbound)
  • Target Types: Typically EC2 instances running virtual appliance software; must support GENEVE
  • Use Cases: Firewalls, IDS/IPS, DPI, data loss prevention (DLP), network monitoring
  • Pricing: $0.0125/hour + $0.004/GLCU (GWLB Capacity Unit)

When to use (Comprehensive):

  • āœ… Use when: You need to inspect all traffic with third-party security appliances (firewalls, IDS/IPS, DPI)
  • āœ… Use when: You need centralized security inspection for multiple VPCs or accounts
  • āœ… Use when: You need to scale security appliances automatically based on traffic volume
  • āœ… Use when: You need transparent traffic inspection without modifying applications
  • āœ… Use when: You need to enforce consistent security policies across multiple environments
  • āœ… Use when: You're using third-party virtual appliances that support GENEVE protocol
  • āœ… Use when: You need to inspect both inbound (from internet) and outbound (to internet) traffic
  • āŒ Don't use when: You only need AWS-native security (use AWS Network Firewall, Security Groups, NACLs instead)
  • āŒ Don't use when: You need Layer 7 application load balancing (use ALB instead)
  • āŒ Don't use when: Your appliances don't support GENEVE protocol (GWLB requires GENEVE)
  • āŒ Don't use when: You need to inspect traffic within a VPC only (use VPC Traffic Mirroring instead)

Limitations & Constraints:

  • GENEVE requirement: Appliances must support GENEVE protocol; not all appliances do (check vendor compatibility)
  • No Layer 7 inspection by GWLB: GWLB itself doesn't inspect application-layer data; it only distributes traffic to appliances that do the inspection
  • Cross-AZ data transfer costs: Traffic between GWLBe and GWLB across AZs incurs data transfer charges
  • Routing complexity: Requires careful route table configuration for ingress and egress routing
  • Appliance licensing: Third-party appliances may require separate licensing (BYOL or marketplace)
  • Performance overhead: GENEVE encapsulation and appliance inspection add latency (typically 5-20ms depending on appliance)

šŸ’” Tips for Understanding:

  • GWLB vs NLB: NLB distributes traffic to application servers; GWLB distributes traffic to security appliances that inspect traffic before it reaches applications
  • Transparent insertion: Think of GWLB as a "bump in the wire" - traffic flows through it transparently without applications knowing
  • GENEVE: It's like putting a letter (original packet) inside an envelope (GENEVE) with instructions for the security guard (appliance) to inspect
  • Flow stickiness: Critical for stateful appliances - imagine a security guard who needs to see all your bags to understand what you're carrying; if bags go to different guards, they can't build complete picture

āš ļø Common Mistakes & Misconceptions:

  • Mistake 1: Thinking GWLB performs security inspection itself
    • Why it's wrong: GWLB is just a load balancer that distributes traffic to appliances; the appliances do the actual inspection
    • Correct understanding: GWLB is the distribution mechanism; you must deploy and configure third-party security appliances to perform inspection
  • Mistake 2: Forgetting to configure ingress routing at the Internet Gateway
    • Why it's wrong: Without ingress routing, inbound traffic bypasses the GWLBe and goes directly to applications, skipping inspection
    • Correct understanding: Ingress routing at IGW is essential to intercept inbound traffic before it reaches applications; configure IGW route table with application subnet CIDR → GWLBe
  • Mistake 3: Using appliances that don't support GENEVE
    • Why it's wrong: GWLB requires GENEVE for encapsulation; appliances without GENEVE support can't communicate with GWLB
    • Correct understanding: Verify appliance vendor supports GENEVE before deploying with GWLB; check AWS Elastic Load Balancing Partners list for qualified appliances

šŸ”— Connections to Other Topics:

  • Relates to AWS Network Firewall because: Both provide network-level security, but Network Firewall is AWS-managed while GWLB enables third-party appliances. Use Network Firewall for AWS-native solution, GWLB for third-party appliances
  • Builds on VPC PrivateLink by: Using GWLB endpoints (a type of VPC endpoint) to connect application VPCs to security VPC privately without internet exposure
  • Often used with Transit Gateway to: Create hub-and-spoke architectures where Transit Gateway routes traffic between VPCs and GWLB provides centralized security inspection
  • Integrates with AWS Firewall Manager to: Centrally manage security policies across multiple accounts and regions when using GWLB for security inspection
  • Complements VPC Flow Logs by: GWLB provides active inspection and blocking, while Flow Logs provide passive monitoring and forensics

Troubleshooting Common Issues:

  • Issue 1: Traffic not reaching GWLB (bypassing inspection)
    • Solution: Verify route tables - IGW route table must have ingress routing rule for application subnet pointing to GWLBe; application subnet route table must have 0.0.0.0/0 pointing to GWLBe for outbound
  • Issue 2: Appliances not receiving traffic
    • Solution: Check appliance instances are registered in GWLB target group and passing health checks; verify appliances are listening on GENEVE port 6081; check security groups allow GENEVE traffic
  • Issue 3: Asymmetric routing causing connection failures
    • Solution: Ensure both inbound and outbound traffic flows through the same GWLBe and GWLB; verify route tables are configured symmetrically; check flow stickiness is enabled
  • Issue 4: High latency after deploying GWLB
    • Solution: Latency is expected due to GENEVE encapsulation and appliance inspection; optimize appliance performance; consider deploying appliances in same AZ as applications to reduce cross-AZ latency; evaluate if appliance inspection rules can be optimized

Load Balancer Comparison and Selection

Understanding when to use each load balancer type is critical for the ANS-C01 exam. Here's a comprehensive comparison:

Feature Application Load Balancer (ALB) Network Load Balancer (NLB) Gateway Load Balancer (GWLB)
OSI Layer Layer 7 (Application) Layer 4 (Transport) Layer 3 (Network)
Protocols HTTP, HTTPS, HTTP/2, WebSocket, gRPC TCP, TLS, UDP, TCP_UDP All IP protocols (GENEVE encapsulation)
Primary Use Case Web applications, microservices, APIs High-performance TCP/UDP apps, static IPs Security appliance insertion (firewalls, IDS/IPS)
Routing Content-based (path, host, header, query) Connection-based (5-tuple flow hash) Flow-based (5/3/2-tuple stickiness)
Target Types Instance, IP, Lambda, ALB Instance, IP, ALB Instance (security appliances)
Connection Handling Terminates connections (proxy) Pass-through (preserves client IP) Transparent gateway (GENEVE encapsulation)
Client IP Preservation Via X-Forwarded-For header Native (target sees client IP) Native (appliance sees original packet)
Static IP No (dynamic DNS) Yes (Elastic IP per AZ) No (uses GWLB endpoints)
Performance Moderate latency (~10-50ms) Ultra-low latency (<1ms) Moderate latency (5-20ms, depends on appliance)
Throughput Thousands of requests/sec Millions of requests/sec Depends on appliance capacity
SSL/TLS Termination Yes (ACM integration, SNI) Yes (ACM integration, SNI) No (appliances handle encryption)
Cross-Zone LB Enabled by default (free) Disabled by default (incurs charges) Enabled by default
Health Checks HTTP/HTTPS (advanced) TCP, HTTP/HTTPS TCP, HTTP/HTTPS, ICMP
Sticky Sessions Cookie-based (application or duration) Flow-based (5-tuple) Flow-based (5/3/2-tuple)
WAF Integration Yes (attach WAF web ACL) No (use ALB behind NLB) No (appliances provide security)
Authentication Yes (Cognito, OIDC, SAML) No No
Redirects Yes (HTTP to HTTPS, custom) No No
Fixed Response Yes (custom error pages) No No
WebSocket Yes (native support) Yes (TCP-based) N/A
HTTP/2 Yes (native support) No N/A
gRPC Yes (native support) Yes (TCP-based) N/A
Pricing (hourly) $0.0225/hour $0.0225/hour $0.0125/hour
Pricing (capacity) $0.008/LCU $0.006/NLCU $0.004/GLCU
šŸŽÆ Exam Tip Look for: path routing, microservices, HTTP features Look for: static IP, ultra-low latency, non-HTTP, client IP Look for: security appliances, centralized inspection, GENEVE

Decision Framework - Which Load Balancer to Choose:

šŸ“Š Load Balancer Selection Decision Tree:

graph TD
    START[Start: Need Load Balancer] --> Q1{Need security<br/>appliance inspection?}
    
    Q1 -->|Yes| GWLB[Gateway Load Balancer<br/>āœ… Firewalls, IDS/IPS, DPI]
    Q1 -->|No| Q2{HTTP/HTTPS<br/>application?}
    
    Q2 -->|Yes| Q3{Need advanced<br/>routing?}
    Q2 -->|No| Q4{Need static<br/>IP addresses?}
    
    Q3 -->|Yes: path/host/header| ALB1[Application Load Balancer<br/>āœ… Microservices, APIs]
    Q3 -->|No: simple HTTP| Q5{Need ultra-low<br/>latency?}
    
    Q4 -->|Yes| NLB1[Network Load Balancer<br/>āœ… Static IPs, TCP/UDP]
    Q4 -->|No| Q6{Need extreme<br/>throughput?}
    
    Q5 -->|Yes: <5ms| NLB2[Network Load Balancer<br/>āœ… Gaming, trading, IoT]
    Q5 -->|No| ALB2[Application Load Balancer<br/>āœ… Standard web apps]
    
    Q6 -->|Yes: millions req/sec| NLB3[Network Load Balancer<br/>āœ… High-volume TCP/UDP]
    Q6 -->|No| ALB3[Application Load Balancer<br/>āœ… Standard applications]
    
    style GWLB fill:#ffebee
    style ALB1 fill:#e1f5fe
    style ALB2 fill:#e1f5fe
    style ALB3 fill:#e1f5fe
    style NLB1 fill:#c8e6c9
    style NLB2 fill:#c8e6c9
    style NLB3 fill:#c8e6c9

See: diagrams/02_domain_1_lb_decision_tree.mmd

Common Exam Scenarios:

Scenario 1: Microservices with Path-Based Routing

  • Requirement: Route /api/* to API service, /auth/* to Auth service, /data/* to Data service
  • Answer: Application Load Balancer (ALB)
  • Why: ALB supports path-based routing rules; NLB and GWLB do not

Scenario 2: Gaming Server with Ultra-Low Latency

  • Requirement: Custom TCP protocol, <5ms latency, 100K+ concurrent connections
  • Answer: Network Load Balancer (NLB)
  • Why: NLB provides ultra-low latency (<1ms), handles millions of connections, supports custom TCP protocols

Scenario 3: Partner Integration Requiring IP Whitelisting

  • Requirement: Partners need to whitelist your IPs in their firewalls
  • Answer: Network Load Balancer (NLB) with Elastic IPs
  • Why: NLB provides static Elastic IPs that never change; ALB has dynamic IPs

Scenario 4: Centralized Firewall for Multiple VPCs

  • Requirement: Inspect all traffic from 10 VPCs with third-party firewalls
  • Answer: Gateway Load Balancer (GWLB)
  • Why: GWLB is designed for security appliance insertion and centralized inspection

Scenario 5: Web Application with WAF Protection

  • Requirement: HTTP application needing protection from SQL injection, XSS attacks
  • Answer: Application Load Balancer (ALB) with AWS WAF
  • Why: ALB integrates with AWS WAF; NLB and GWLB do not support WAF directly

Scenario 6: Hybrid Architecture with On-Premises Servers

  • Requirement: Load balance between AWS and on-premises servers during migration
  • Answer: Network Load Balancer (NLB) with IP targets
  • Why: NLB supports IP target type for registering on-premises IPs; provides flexibility for hybrid architectures

Scenario 7: Static IPs + Layer 7 Routing

  • Requirement: Need both static IPs (for whitelisting) and path-based routing (for microservices)
  • Answer: NLB (internet-facing) → ALB (internal) architecture
  • Why: Combines NLB's static IPs with ALB's Layer 7 routing capabilities

šŸŽÆ Exam Focus - Load Balancer Selection:

  • Keywords to watch:

    • "Path-based routing", "host-based routing", "HTTP headers" → ALB
    • "Static IP", "Elastic IP", "whitelist IP" → NLB
    • "Ultra-low latency", "millions of requests", "preserve client IP" → NLB
    • "Firewall", "IDS/IPS", "security appliance", "GENEVE" → GWLB
    • "WAF", "authentication", "Cognito" → ALB
    • "TCP/UDP", "non-HTTP protocol", "custom protocol" → NLB
    • "Centralized inspection", "multiple VPCs", "transparent gateway" → GWLB
  • Common traps:

    • Choosing ALB when static IPs are required (ALB doesn't support static IPs)
    • Choosing NLB when path-based routing is needed (NLB doesn't support Layer 7 routing)
    • Choosing ALB or NLB when security appliance inspection is needed (use GWLB)
    • Forgetting that NLB's cross-zone load balancing is disabled by default (unlike ALB)

Section 2: Logging and Monitoring Network Infrastructure

Introduction

The problem: Without visibility into network traffic and performance, you cannot troubleshoot connectivity issues, detect security threats, optimize costs, or ensure compliance. Network problems are invisible until they cause outages or security breaches.

The solution: AWS provides comprehensive logging and monitoring services that capture network traffic, track performance metrics, analyze connectivity, and alert on anomalies. These tools enable proactive monitoring, rapid troubleshooting, security analysis, and compliance auditing.

Why it's tested: Task Statement 1.4 requires you to "Define logging and monitoring requirements across AWS and hybrid networks." The exam tests your ability to select appropriate monitoring tools, configure logging destinations, analyze network data, and design monitoring strategies.

Core Concepts

VPC Flow Logs

What it is: VPC Flow Logs is a feature that captures metadata about IP traffic flowing to and from network interfaces in your VPC, including accepted and rejected traffic. Flow logs record information such as source/destination IPs, ports, protocols, packet/byte counts, and action (ACCEPT/REJECT). They do NOT capture packet payloads - only metadata about the traffic.

Why it exists: Network administrators need visibility into traffic patterns for troubleshooting (why can't my instance connect?), security analysis (is there unusual traffic?), compliance auditing (who accessed what?), and cost optimization (which resources generate most traffic?). Flow logs provide this visibility without impacting network performance.

How it works (Detailed step-by-step):

  1. Flow Log Creation: You create a flow log at the VPC, subnet, or network interface level. Scope determines what traffic is captured: VPC-level captures all ENIs in the VPC, subnet-level captures all ENIs in the subnet, ENI-level captures only that specific interface.

  2. Traffic Capture: AWS captures metadata for every network flow (a unique combination of source IP, destination IP, source port, destination port, and protocol) passing through the monitored interfaces. Capture happens outside the data path, so it doesn't affect performance or latency.

  3. Aggregation: Flow data is aggregated into flow log records over a capture window (default: 10 minutes, can be 1 minute for faster visibility). Each record represents traffic for one flow during that window.

  4. Publishing: Flow log records are published to one of three destinations: (1) CloudWatch Logs: For real-time analysis, alerting, and short-term retention. (2) S3: For long-term storage, compliance, and cost-effective archival. (3) Data Firehose: For streaming to third-party analytics tools or data lakes.

  5. Record Format: Each flow log record contains fields like: version, account-id, interface-id, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action (ACCEPT/REJECT), log-status. You can customize fields to include additional data like VPC ID, subnet ID, instance ID, TCP flags, flow direction.

⭐ Must Know (Critical Flow Logs Facts):

  • Metadata only: Captures traffic metadata, NOT packet payloads or content
  • Scope levels: VPC, subnet, or ENI level
  • Destinations: CloudWatch Logs, S3, or Data Firehose
  • Capture window: Default 10 minutes, can be 1 minute
  • No performance impact: Collected outside data path
  • Accepted and rejected: Captures both allowed and denied traffic
  • Use cases: Troubleshooting connectivity, security analysis, compliance, cost tracking
  • Limitations: Doesn't capture traffic to/from 169.254.169.254 (metadata service), DHCP, Amazon DNS, Windows license activation

Detailed Example 1: Troubleshooting Security Group Rules
Your application can't connect to a database, and you suspect security group misconfiguration. Enable VPC Flow Logs on the database's ENI, publish to CloudWatch Logs. After a few minutes, query logs for traffic from the application's IP to the database's IP on port 3306. You find records with action=REJECT, indicating traffic is being blocked. Check the database's security group - it only allows port 3306 from a different CIDR range. Update the security group to allow the application's IP, and subsequent flow logs show action=ACCEPT. Flow logs helped identify the exact blocking point without needing packet captures.

Detailed Example 2: Detecting Data Exfiltration
You want to detect if compromised instances are sending large amounts of data to external IPs. Enable VPC Flow Logs at VPC level, publish to S3. Use Amazon Athena to query flow logs daily, looking for instances with unusually high outbound bytes to non-AWS IPs. Create a CloudWatch alarm that triggers when any instance sends >10GB to a single external IP in an hour. One day, the alarm fires - an instance is sending 50GB to an unknown IP in a foreign country. Investigation reveals the instance was compromised and exfiltrating data. You terminate the instance and patch the vulnerability. Flow logs enabled early detection of the breach.

Detailed Example 3: Cost Optimization with Flow Logs
Your AWS bill shows high data transfer charges, but you don't know which resources are responsible. Enable VPC Flow Logs at VPC level, publish to S3. Use Athena to aggregate bytes by source instance ID and destination (internet, other VPCs, other regions). You discover one instance is sending 5TB/month to the internet - it's a misconfigured backup job uploading to a public S3 bucket instead of using a VPC endpoint. Reconfigure to use S3 VPC endpoint (free data transfer within region), saving $450/month. Flow logs identified the cost culprit.

šŸ”— Connections to Other Topics:

  • Relates to CloudWatch Logs because: Flow logs can be published to CloudWatch for real-time analysis and alerting
  • Integrates with Athena to: Query flow logs stored in S3 using SQL for analysis and reporting
  • Complements VPC Traffic Mirroring by: Flow logs provide metadata for all traffic; Traffic Mirroring provides full packet captures for deep inspection
  • Works with Security Groups and NACLs to: Flow logs show which rules are blocking traffic, helping troubleshoot security configurations

CloudWatch Metrics and Alarms

What it is: Amazon CloudWatch is a monitoring service that collects metrics (numerical data points over time) from AWS resources, creates alarms that trigger actions when metrics breach thresholds, and provides dashboards for visualization. For networking, CloudWatch tracks metrics like load balancer request counts, VPN tunnel status, Direct Connect connection state, and Transit Gateway bytes transferred.

Why it exists: Reactive monitoring (waiting for users to report problems) leads to prolonged outages and poor user experience. Proactive monitoring with CloudWatch enables you to detect issues before they impact users, automatically respond to problems, track performance trends, and ensure SLAs are met.

How it works (Detailed step-by-step):

  1. Metric Collection: AWS services automatically publish metrics to CloudWatch. For example, ALB publishes RequestCount, TargetResponseTime, HTTPCode_Target_5XX_Count every minute. You can also publish custom metrics using the CloudWatch API or agent.

  2. Metric Storage: CloudWatch stores metrics with different resolutions: standard (1-minute or 5-minute intervals) or high-resolution (1-second intervals). Metrics are retained for 15 months, with automatic aggregation over time (1-minute data for 15 days, 5-minute data for 63 days, 1-hour data for 455 days).

  3. Alarm Creation: You create alarms that monitor metrics and trigger actions when thresholds are breached. For example: "Alarm if ALB TargetResponseTime > 1 second for 2 consecutive periods (2 minutes)." Alarms have three states: OK (metric within threshold), ALARM (metric breached threshold), INSUFFICIENT_DATA (not enough data to evaluate).

  4. Alarm Actions: When an alarm enters ALARM state, it can trigger actions: (1) SNS notification: Send email, SMS, or trigger Lambda. (2) Auto Scaling action: Scale EC2 instances up/down. (3) Systems Manager action: Run automation documents. (4) EC2 action: Stop, terminate, reboot, or recover instances.

  5. Dashboards: Create CloudWatch dashboards to visualize multiple metrics in one view. Add line graphs, number widgets, alarms, and logs insights queries. Share dashboards across teams or make them public.

⭐ Must Know (Critical CloudWatch Facts):

  • Automatic metrics: AWS services publish metrics automatically (no configuration needed)
  • Custom metrics: You can publish your own metrics using API or CloudWatch agent
  • Alarm states: OK, ALARM, INSUFFICIENT_DATA
  • Alarm actions: SNS, Auto Scaling, Systems Manager, EC2 actions
  • Metric retention: 15 months with automatic aggregation
  • Resolution: Standard (1-minute) or high-resolution (1-second)
  • Dashboards: Visualize multiple metrics, shareable
  • Key networking metrics: ALB RequestCount/TargetResponseTime, NLB ActiveFlowCount/ProcessedBytes, VPN TunnelState/TunnelDataIn, DX ConnectionState/ConnectionBpsEgress

Detailed Example 1: Auto-Scaling Based on Network Metrics
Your application experiences traffic spikes during business hours. Create a CloudWatch alarm monitoring ALB RequestCount metric. Set threshold: "If RequestCount > 10,000 for 2 consecutive minutes, trigger scale-out." Configure Auto Scaling policy to add 2 instances when alarm triggers. During a traffic spike, RequestCount exceeds 10,000, alarm enters ALARM state, Auto Scaling adds instances, and load is distributed. When traffic decreases, a scale-in alarm (RequestCount < 2,000 for 10 minutes) removes excess instances. This automatic scaling maintains performance while optimizing costs.

Detailed Example 2: VPN Tunnel Monitoring
You have a Site-to-Site VPN with two tunnels for redundancy. Create CloudWatch alarms for each tunnel's TunnelState metric. Set alarm: "If TunnelState = 0 (DOWN) for 1 minute, send SNS notification to network team." One day, Tunnel 1 goes down due to ISP issue. Alarm triggers immediately, SNS sends email/SMS to on-call engineer. Engineer investigates and finds ISP outage. Tunnel 2 is still up, so traffic continues flowing. Engineer coordinates with ISP for repair. Without the alarm, the team wouldn't know Tunnel 1 was down until Tunnel 2 also failed, causing a complete outage.

Detailed Example 3: Direct Connect Monitoring Dashboard
You have multiple Direct Connect connections across regions. Create a CloudWatch dashboard showing: (1) ConnectionState for each connection (1=UP, 0=DOWN). (2) ConnectionBpsEgress/Ingress to track bandwidth utilization. (3) ConnectionPpsEgress/Ingress for packet rates. (4) Alarms for each connection (alert if DOWN or if bandwidth >80% of capacity). The dashboard provides at-a-glance visibility into all Direct Connect links. One day, you notice ConnectionBpsEgress approaching 80% on one link. Proactively order additional Direct Connect capacity before hitting limits. Dashboard enabled proactive capacity planning.

šŸ”— Connections to Other Topics:

  • Integrates with Auto Scaling to: Automatically scale resources based on CloudWatch metrics
  • Works with SNS to: Send notifications when alarms trigger
  • Complements VPC Flow Logs by: CloudWatch provides metrics (aggregated numbers), Flow Logs provide detailed records (individual flows)
  • Connects to Lambda via: CloudWatch Events/EventBridge can trigger Lambda functions based on metrics or alarms

VPC Reachability Analyzer

What it is: VPC Reachability Analyzer is a configuration analysis tool that verifies whether a network path exists between a source and destination in your VPC. It analyzes your VPC configuration (route tables, security groups, NACLs, gateways) to determine if traffic can flow, without sending actual packets. It's like a "dry run" of network connectivity.

Why it exists: Troubleshooting network connectivity issues is time-consuming and error-prone. You must manually check route tables, security groups, NACLs, and gateway configurations across multiple resources. Reachability Analyzer automates this analysis, quickly identifying the exact configuration blocking traffic, saving hours of troubleshooting time.

How it works (Detailed step-by-step):

  1. Analysis Creation: You specify a source (ENI, instance, internet gateway, VPC peering connection, Transit Gateway, VPN gateway) and destination (ENI, instance, internet gateway, etc.), along with protocol and port. For example: "Can instance i-abc123 reach instance i-def456 on TCP port 443?"

  2. Configuration Analysis: Reachability Analyzer examines all network configurations in the path: (1) Source security groups (outbound rules). (2) Source subnet NACL (outbound rules). (3) Route tables (does a route exist to destination?). (4) Intermediate hops (Transit Gateway, VPC peering, etc.). (5) Destination subnet NACL (inbound rules). (6) Destination security groups (inbound rules).

  3. Path Determination: The analyzer builds a logical path from source to destination, checking each hop. If any configuration blocks traffic (missing route, denied by security group, blocked by NACL), the analysis identifies the blocking component.

  4. Results: Analysis returns one of two results: (1) Reachable: A valid path exists; traffic can flow. The result shows the complete path with all hops. (2) Not Reachable: No valid path exists; traffic is blocked. The result shows where and why traffic is blocked (e.g., "Security group sg-abc123 denies inbound TCP 443").

  5. Continuous Monitoring: You can save analyses and re-run them periodically to verify connectivity remains intact after configuration changes. This is useful for compliance and change management.

⭐ Must Know (Critical Reachability Analyzer Facts):

  • Configuration analysis: Analyzes VPC configuration, doesn't send actual packets
  • Identifies blocking point: Shows exactly which configuration blocks traffic
  • Supported sources/destinations: ENI, instance, IGW, VPC peering, Transit Gateway, VPN gateway, NAT gateway
  • Protocol/port specific: Analyzes specific protocol (TCP/UDP/ICMP) and port
  • No performance impact: Analysis is offline; doesn't affect running traffic
  • Use cases: Troubleshooting connectivity, validating changes, compliance verification
  • Limitations: Only analyzes AWS-managed configurations; doesn't analyze OS-level firewalls or application-level issues

Detailed Example 1: Troubleshooting Instance Connectivity
Your web server (instance A) can't connect to your database (instance B) on port 3306. Instead of manually checking security groups, NACLs, and route tables, you run Reachability Analyzer with source=instance A, destination=instance B, protocol=TCP, port=3306. Analysis returns "Not Reachable" with reason: "Security group sg-database denies inbound TCP 3306 from instance A's security group." You update sg-database to allow TCP 3306 from sg-webserver, re-run analysis, and it returns "Reachable." Problem solved in 2 minutes instead of 30 minutes of manual checking.

Detailed Example 2: Validating Network Changes
Before deploying a new application, you want to verify connectivity between all components. Create Reachability Analyzer analyses for all required paths: (1) ALB → web servers (TCP 80). (2) Web servers → app servers (TCP 8080). (3) App servers → database (TCP 3306). (4) App servers → internet (TCP 443 for API calls). All analyses return "Reachable," confirming the network is correctly configured. After deployment, you re-run analyses weekly to ensure no configuration drift breaks connectivity. This proactive validation prevents outages caused by accidental misconfigurations.

Detailed Example 3: Compliance Auditing
Your compliance team requires proof that production databases are NOT accessible from the internet. Create Reachability Analyzer analysis with source=Internet Gateway, destination=database instance, protocol=TCP, port=3306. Analysis returns "Not Reachable" with reason: "No route from IGW to database subnet; database is in private subnet with no internet route." Save this analysis and re-run monthly as part of compliance reporting. If someone accidentally makes the database public, the analysis will detect it immediately. Reachability Analyzer provides automated compliance verification.

šŸ”— Connections to Other Topics:

  • Complements VPC Flow Logs by: Reachability Analyzer shows if traffic SHOULD flow (configuration), Flow Logs show if traffic ACTUALLY flows (reality)
  • Works with Security Groups and NACLs to: Analyzes security group and NACL rules to determine if they allow or block traffic
  • Integrates with Route Tables to: Verifies routes exist for traffic to reach destination
  • Useful for Transit Gateway architectures: Analyzes complex multi-VPC connectivity through Transit Gateway

Chapter Summary

What We Covered

This chapter covered Domain 1: Network Design (30% of exam), focusing on:

āœ… Edge Network Services:

  • CloudFront for content delivery with global edge locations
  • Global Accelerator for static anycast IPs and global traffic management
  • Integration patterns and use case selection

āœ… DNS Solutions:

  • Route 53 public and private hosted zones
  • Route 53 Resolver for hybrid DNS
  • Traffic management policies (latency, geolocation, weighted, failover)
  • DNSSEC for DNS security

āœ… Load Balancing:

  • Application Load Balancer (ALB) for Layer 7 HTTP/HTTPS routing
  • Network Load Balancer (NLB) for Layer 4 TCP/UDP with static IPs
  • Gateway Load Balancer (GWLB) for security appliance insertion
  • Load balancer selection criteria and decision frameworks

āœ… Logging and Monitoring:

  • VPC Flow Logs for traffic metadata capture and analysis
  • CloudWatch metrics and alarms for proactive monitoring
  • VPC Reachability Analyzer for configuration validation
  • Monitoring strategies for hybrid networks

Note: This chapter covered the foundational concepts for Domain 1. Additional topics (hybrid connectivity with Direct Connect/VPN, multi-account/multi-region architectures) will be covered in Domain 2 (Network Implementation) as they involve both design and implementation aspects.

Critical Takeaways

  1. Edge Services: CloudFront for content caching, Global Accelerator for network performance - choose based on whether you need caching or just routing
  2. DNS: Route 53 provides public DNS, private DNS, and hybrid DNS via Resolver endpoints
  3. Load Balancers: ALB for HTTP routing, NLB for performance/static IPs, GWLB for security appliances
  4. Monitoring: Flow Logs for traffic analysis, CloudWatch for metrics/alarms, Reachability Analyzer for config validation

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain when to use CloudFront vs Global Accelerator
  • I can design a hybrid DNS solution with Route 53 Resolver
  • I can select the appropriate load balancer (ALB/NLB/GWLB) for different scenarios
  • I understand how to use VPC Flow Logs for troubleshooting
  • I can create CloudWatch alarms for network monitoring
  • I know how to use Reachability Analyzer to validate connectivity

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle: Questions 1-30
  • Expected score: 75%+ to proceed

If you scored below 75%:

  • Review load balancer comparison table and decision tree
  • Practice identifying blocking points with Reachability Analyzer concepts
  • Study the CloudFront vs Global Accelerator use cases
  • Redraw architecture diagrams from memory

Quick Reference Card

Edge Services:

  • CloudFront: Content caching, HTTPS, custom SSL, Lambda@Edge, origin failover
  • Global Accelerator: Static anycast IPs, TCP/UDP, health checks, traffic dials

DNS:

  • Public Hosted Zone: Internet-facing DNS records
  • Private Hosted Zone: VPC-internal DNS records
  • Resolver Endpoints: Inbound (on-premises → AWS), Outbound (AWS → on-premises)

Load Balancers:

  • ALB: Layer 7, HTTP/HTTPS, path/host routing, WAF integration
  • NLB: Layer 4, TCP/UDP/TLS, static IPs, ultra-low latency, client IP preservation
  • GWLB: Layer 3, security appliances, GENEVE, transparent gateway

Monitoring:

  • Flow Logs: Traffic metadata, troubleshooting, security analysis
  • CloudWatch: Metrics, alarms, dashboards, auto-scaling triggers
  • Reachability Analyzer: Configuration validation, connectivity verification

Next Chapter: Domain 2 - Network Implementation (03_domain_2_network_implementation)

In the next chapter, we'll cover implementing routing and connectivity for hybrid networks (Direct Connect, VPN), multi-account architectures (Transit Gateway, VPC peering, PrivateLink), complex DNS implementations, and network automation with Infrastructure as Code.


Chapter 2: Network Implementation (26% of exam)

Chapter Overview

What you'll learn:

  • Implementing hybrid connectivity with Direct Connect and VPN
  • Configuring multi-account and multi-VPC architectures
  • Implementing complex DNS solutions
  • Automating network infrastructure with IaC
  • Configuring routing protocols (BGP) and traffic engineering

Time to complete: 12-16 hours
Prerequisites: Chapter 0 (Fundamentals), Chapter 1 (Network Design)

Exam Weight: 26% of scored content - This is the second-largest domain, focusing on hands-on implementation of network solutions.


Section 1: Hybrid Connectivity - Direct Connect and VPN

Introduction

The problem: Organizations need reliable, high-bandwidth, low-latency connectivity between their on-premises data centers and AWS. Internet-based VPNs provide connectivity but suffer from unpredictable performance, bandwidth limitations, and security concerns. Migrating workloads, accessing AWS services, and building hybrid architectures require better connectivity solutions.

The solution: AWS provides two primary hybrid connectivity options: (1) AWS Direct Connect - dedicated network connection from on-premises to AWS with consistent performance and high bandwidth. (2) AWS Site-to-Site VPN - encrypted IPsec tunnels over the internet for secure connectivity. These can be used independently or combined for redundancy and failover.

Why it's tested: Task Statement 2.1 requires you to "Implement routing and connectivity between on-premises networks and the AWS Cloud." The exam tests your ability to configure Direct Connect, VPN, BGP routing, failover scenarios, and hybrid DNS.

Core Concepts

AWS Direct Connect

What it is: AWS Direct Connect is a dedicated network connection between your on-premises network and AWS. It provides a private, high-bandwidth (50 Mbps to 100 Gbps), low-latency connection that bypasses the public internet. Direct Connect uses industry-standard 802.1Q VLANs to create virtual interfaces (VIFs) that connect to different AWS services or VPCs.

Why it exists: Internet-based connectivity has several limitations: (1) Unpredictable latency and jitter due to internet routing. (2) Bandwidth constraints and congestion. (3) Security concerns with data traversing public internet. (4) High data transfer costs for large volumes. (5) Compliance requirements for private connectivity. Direct Connect solves these by providing a dedicated, private connection with predictable performance, lower costs for high-volume data transfer, and enhanced security.

Real-world analogy: Think of Direct Connect as a private highway between your office and AWS, while internet VPN is like driving on public roads. The private highway has dedicated lanes (guaranteed bandwidth), no traffic lights or congestion (consistent latency), and is gated (private and secure). Public roads have variable traffic (unpredictable performance), traffic jams (congestion), and anyone can use them (security concerns).

How it works (Detailed step-by-step):

  1. Physical Connection: You establish a physical connection from your on-premises network to an AWS Direct Connect location (a colocation facility or partner data center). Options: (1) Dedicated Connection: You order a dedicated 1 Gbps, 10 Gbps, or 100 Gbps port directly from AWS. AWS provisions a cross-connect from your router to an AWS router in the Direct Connect location. (2) Hosted Connection: You order a connection (50 Mbps to 10 Gbps) from an AWS Direct Connect Partner. The partner provides the physical connection and manages the cross-connect.

  2. Virtual Interface (VIF) Creation: After the physical connection is established, you create virtual interfaces (VIFs) to access AWS services. VIF types: (1) Private VIF: Connects to a single VPC via Virtual Private Gateway (VGW) or to multiple VPCs via Direct Connect Gateway + Transit Gateway. Used for accessing private IP resources in VPCs. (2) Public VIF: Connects to AWS public services (S3, DynamoDB, etc.) using public IPs. Does NOT traverse the internet - stays on AWS backbone. (3) Transit VIF: Connects to Direct Connect Gateway associated with Transit Gateway, enabling access to multiple VPCs across regions and accounts.

  3. BGP Configuration: Direct Connect uses BGP (Border Gateway Protocol) for dynamic routing. You configure BGP sessions between your on-premises router and AWS routers. For each VIF, you specify: (1) BGP ASN (Autonomous System Number): Your ASN (public or private) and AWS's ASN (default 64512 for VGW, 64512 for Direct Connect Gateway). (2) BGP authentication: MD5 password for session security. (3) Prefixes: Routes you advertise to AWS (your on-premises networks) and routes AWS advertises to you (VPC CIDRs or AWS public prefixes).

  4. VLAN Tagging: Each VIF uses a unique VLAN ID (802.1Q tag) to multiplex multiple VIFs over a single physical connection. Your router must support VLAN tagging and be configured to tag traffic for each VIF with the correct VLAN ID. For example: VLAN 100 for Private VIF to Production VPC, VLAN 200 for Private VIF to Development VPC, VLAN 300 for Public VIF.

  5. Traffic Flow - Private VIF: (1) On-premises application sends packet to AWS VPC resource (e.g., 10.0.1.10). (2) On-premises router checks BGP routing table, sees route to 10.0.0.0/16 via Direct Connect. (3) Router tags packet with VLAN ID for Private VIF and sends over Direct Connect. (4) AWS router receives packet, removes VLAN tag, and forwards to VGW. (5) VGW routes packet to VPC based on VPC route tables. (6) Packet reaches destination instance. (7) Response follows reverse path.

  6. Traffic Flow - Public VIF: (1) On-premises application sends packet to AWS public service (e.g., S3 bucket at 52.219.x.x). (2) On-premises router checks BGP routing table, sees route to AWS public IP ranges via Direct Connect. (3) Router tags packet with VLAN ID for Public VIF and sends over Direct Connect. (4) AWS router receives packet and forwards to S3 service over AWS backbone (not internet). (5) S3 processes request and returns response via same path. Note: Traffic never touches the public internet, providing better performance and security.

  7. Redundancy: For high availability, deploy redundant Direct Connect connections: (1) Multiple connections to the same Direct Connect location (protects against connection failure). (2) Connections to different Direct Connect locations (protects against location failure). (3) Connections in different AWS regions (protects against regional failure). Use BGP attributes (AS-PATH prepending, MED, local preference) to control active/passive or active/active traffic distribution.

  8. Direct Connect Gateway: To connect to multiple VPCs across regions, use Direct Connect Gateway (DXGW). DXGW is a global resource that associates with multiple VGWs (up to 10) or Transit Gateways (up to 3) across regions. You create a single Transit VIF to DXGW, and DXGW routes traffic to the appropriate VPC based on destination IP. This eliminates the need for separate VIFs per VPC.

  9. Jumbo Frames: Direct Connect supports jumbo frames (MTU up to 9001 bytes) for improved throughput. Enable jumbo frames on your VIF and ensure your on-premises network supports it. Jumbo frames reduce packet overhead and increase efficiency for large data transfers.

  10. Monitoring: Monitor Direct Connect using CloudWatch metrics: ConnectionState (1=UP, 0=DOWN), ConnectionBpsEgress/Ingress (bandwidth utilization), ConnectionPpsEgress/Ingress (packet rates), ConnectionLightLevelTx/Rx (optical signal strength for fiber connections). Set alarms for connection down or high utilization.

šŸ“Š Direct Connect Architecture Diagram:

graph TB
    subgraph "On-Premises Data Center"
        ONPREM_ROUTER[Customer Router<br/>BGP ASN: 65001<br/>VLAN Support]
        ONPREM_NET[On-Premises Network<br/>192.168.0.0/16]
    end
    
    subgraph "Direct Connect Location"
        DX_PORT[Direct Connect Port<br/>10 Gbps Dedicated<br/>Cross-Connect]
    end
    
    subgraph "AWS Region: us-east-1"
        subgraph "Direct Connect Gateway"
            DXGW[Direct Connect Gateway<br/>Global Resource]
        end
        
        subgraph "VPC 1: Production"
            VGW1[Virtual Private Gateway<br/>BGP ASN: 64512]
            VPC1[VPC CIDR: 10.0.0.0/16]
        end
        
        subgraph "VPC 2: Development"
            VGW2[Virtual Private Gateway<br/>BGP ASN: 64512]
            VPC2[VPC CIDR: 10.1.0.0/16]
        end
        
        PUBLIC_SERVICES[AWS Public Services<br/>S3, DynamoDB, etc.]
    end
    
    ONPREM_NET --> ONPREM_ROUTER
    ONPREM_ROUTER -->|Private VIF<br/>VLAN 100<br/>BGP Session| DX_PORT
    ONPREM_ROUTER -->|Public VIF<br/>VLAN 300<br/>BGP Session| DX_PORT
    
    DX_PORT -->|Transit VIF<br/>VLAN 100| DXGW
    DX_PORT -->|Public VIF<br/>VLAN 300| PUBLIC_SERVICES
    
    DXGW --> VGW1
    DXGW --> VGW2
    VGW1 --> VPC1
    VGW2 --> VPC2
    
    style ONPREM_ROUTER fill:#e1f5fe
    style DX_PORT fill:#c8e6c9
    style DXGW fill:#fff3e0
    style VGW1 fill:#f3e5f5
    style VGW2 fill:#f3e5f5

See: diagrams/03_domain_2_direct_connect_architecture.mmd

Diagram Explanation (detailed):
This diagram shows a complete Direct Connect architecture with multiple VIFs. The on-premises data center has a customer router (BGP ASN 65001) with VLAN support connected to a 10 Gbps dedicated Direct Connect port in a colocation facility. The router creates two VIFs: (1) Transit VIF on VLAN 100 with BGP session to Direct Connect Gateway, providing access to multiple VPCs. (2) Public VIF on VLAN 300 with BGP session to AWS public services. The Transit VIF connects to Direct Connect Gateway (a global resource), which is associated with Virtual Private Gateways in VPC 1 (Production, 10.0.0.0/16) and VPC 2 (Development, 10.1.0.0/16). Traffic from on-premises to VPC resources is tagged with VLAN 100, routed through DXGW to the appropriate VGW, and delivered to the VPC. Traffic to AWS public services (S3, DynamoDB) is tagged with VLAN 300 and routed directly to those services over the AWS backbone without touching the internet. BGP dynamically exchanges routes: on-premises advertises 192.168.0.0/16, AWS advertises 10.0.0.0/16, 10.1.0.0/16, and public IP ranges. This architecture provides private, high-bandwidth connectivity to multiple VPCs and public services over a single physical connection.

⭐ Must Know (Critical Direct Connect Facts):

  • Dedicated vs Hosted: Dedicated (1/10/100 Gbps from AWS), Hosted (50 Mbps-10 Gbps from partners)
  • VIF Types: Private (VPC access), Public (AWS public services), Transit (multiple VPCs via DXGW)
  • BGP Required: Dynamic routing using BGP; must configure BGP sessions for each VIF
  • VLAN Tagging: Each VIF uses unique VLAN ID (802.1Q); router must support VLANs
  • Direct Connect Gateway: Global resource; connects to up to 10 VGWs or 3 Transit Gateways across regions
  • Jumbo Frames: Supports MTU up to 9001 bytes for improved throughput
  • Not Encrypted: Direct Connect is private but not encrypted; use VPN over Direct Connect for encryption
  • Redundancy: Deploy multiple connections for high availability; use BGP for failover
  • Pricing: Port hours + data transfer out (data transfer in is free)
  • Setup Time: 2-4 weeks for dedicated connection provisioning

Detailed Example 1: Hybrid Cloud with Direct Connect
Your company is migrating to AWS but needs to keep some workloads on-premises for compliance. You need low-latency, high-bandwidth connectivity between on-premises and AWS. Solution: Order a 10 Gbps dedicated Direct Connect connection to the nearest Direct Connect location. Establish a cross-connect from your router to AWS. Create a Transit VIF to Direct Connect Gateway. Associate DXGW with VGWs in your Production and Development VPCs. Configure BGP on your router, advertising your on-premises CIDR (192.168.0.0/16). AWS advertises VPC CIDRs (10.0.0.0/16, 10.1.0.0/16). Applications on-premises can now access AWS resources at 10 Gbps with <5ms latency. You also create a Public VIF to access S3 for backups, avoiding internet egress charges. This hybrid architecture enables seamless integration between on-premises and AWS.

Detailed Example 2: Multi-Region Direct Connect with Redundancy
Your application runs in us-east-1 and eu-west-1, and you need redundant connectivity from your data center to both regions. Solution: Order two Direct Connect connections: (1) Connection 1 to Direct Connect location in New York (near us-east-1). (2) Connection 2 to Direct Connect location in London (near eu-west-1). Create Transit VIFs from both connections to a single Direct Connect Gateway. Associate DXGW with Transit Gateways in both regions. Configure BGP with AS-PATH prepending to prefer Connection 1 for us-east-1 traffic and Connection 2 for eu-west-1 traffic. If Connection 1 fails, BGP automatically reroutes us-east-1 traffic through Connection 2 (via London to us-east-1). This provides geographic redundancy and optimal routing.

Detailed Example 3: Direct Connect + VPN for Encryption
Your security policy requires all data in transit to be encrypted, but you also need Direct Connect's performance. Solution: Establish a Direct Connect connection with a Private VIF to your VPC's VGW. On the same VGW, configure a Site-to-Site VPN with two IPsec tunnels. Configure BGP on both Direct Connect and VPN, but use BGP attributes to prefer Direct Connect (shorter AS-PATH or higher local preference). Traffic flows over Direct Connect (unencrypted but private) for performance. If Direct Connect fails, BGP automatically fails over to VPN (encrypted but slower). For encryption over Direct Connect, configure IPsec tunnels over the Direct Connect connection (VPN over Direct Connect), combining Direct Connect's performance with VPN's encryption. This satisfies both performance and security requirements.

šŸ”— Connections to Other Topics:

  • Works with Virtual Private Gateway (VGW) to: Connect Direct Connect to a single VPC
  • Integrates with Direct Connect Gateway to: Connect to multiple VPCs across regions
  • Combines with Transit Gateway to: Create hub-and-spoke architectures with centralized routing
  • Uses BGP for: Dynamic routing and automatic failover
  • Complements Site-to-Site VPN for: Redundancy and encrypted backup connectivity

AWS Site-to-Site VPN

What it is: AWS Site-to-Site VPN creates encrypted IPsec tunnels over the internet between your on-premises network and AWS VPCs. Each VPN connection consists of two tunnels (for redundancy) that terminate on a Virtual Private Gateway (VGW) or Transit Gateway in AWS and on a customer gateway device (router/firewall) on-premises.

Why it exists: Not all organizations can justify the cost or setup time of Direct Connect, especially for: (1) Temporary connectivity needs (migrations, testing). (2) Low-bandwidth requirements (<1 Gbps). (3) Quick setup requirements (VPN can be configured in hours vs weeks for Direct Connect). (4) Backup connectivity for Direct Connect. (5) Encrypted connectivity requirements. Site-to-Site VPN provides a cost-effective, quick-to-deploy solution for hybrid connectivity.

How it works (Detailed step-by-step):

  1. Customer Gateway: You create a Customer Gateway (CGW) resource in AWS representing your on-premises VPN device. Specify the device's public IP address and BGP ASN (if using dynamic routing) or static routes (if using static routing).

  2. Virtual Private Gateway or Transit Gateway: Attach a VGW to your VPC or use a Transit Gateway. The VGW/TGW acts as the AWS-side VPN endpoint.

  3. VPN Connection Creation: Create a Site-to-Site VPN connection between the CGW and VGW/TGW. AWS automatically creates two IPsec tunnels (Tunnel 1 and Tunnel 2) for redundancy, each with unique public IP addresses on the AWS side.

  4. Configuration Download: Download the VPN configuration file for your specific customer gateway device (Cisco, Juniper, Palo Alto, etc.). The file contains pre-shared keys, tunnel IPs, and configuration commands.

  5. Customer Gateway Configuration: Configure your on-premises VPN device using the downloaded configuration. Set up: (1) IPsec parameters (encryption algorithms, DH groups, lifetime). (2) Pre-shared keys for authentication. (3) Tunnel IPs and BGP sessions (if using dynamic routing). (4) Static routes (if using static routing).

  6. Tunnel Establishment: Your VPN device initiates IPsec tunnels to AWS. AWS responds, and tunnels are established. Both tunnels should be UP for full redundancy.

  7. Routing - Dynamic (BGP): If using BGP, configure BGP sessions over each tunnel. Your device advertises on-premises routes to AWS, and AWS advertises VPC routes to you. BGP provides automatic failover - if Tunnel 1 fails, traffic automatically switches to Tunnel 2.

  8. Routing - Static: If using static routing, manually configure routes on both sides. On-premises: route VPC CIDR to VPN tunnels. AWS: add static routes in VGW route table for on-premises CIDRs. Static routing requires manual intervention for failover.

  9. Traffic Flow: (1) On-premises application sends packet to AWS VPC resource. (2) On-premises router checks routing table, sees route to VPC CIDR via VPN. (3) Router encrypts packet with IPsec and sends through active tunnel. (4) AWS VGW/TGW receives encrypted packet, decrypts, and forwards to VPC. (5) Packet reaches destination. (6) Response follows reverse path.

  10. Monitoring: Monitor VPN using CloudWatch metrics: TunnelState (1=UP, 0=DOWN), TunnelDataIn/Out (bytes transferred). Set alarms for tunnel down events.

⭐ Must Know (Critical VPN Facts):

  • Two Tunnels: Each VPN connection has two tunnels for redundancy
  • IPsec Encryption: Uses IPsec for encryption (AES-128, AES-256, AES-128-GCM, AES-256-GCM)
  • Routing Options: Dynamic (BGP) or Static routing
  • Bandwidth: Up to 1.25 Gbps per tunnel (aggregate 2.5 Gbps with both tunnels)
  • Latency: Variable (depends on internet path); typically 50-200ms
  • Cost: $0.05/hour per VPN connection + data transfer charges
  • Setup Time: Hours (vs weeks for Direct Connect)
  • Use Cases: Backup for Direct Connect, temporary connectivity, low-bandwidth needs, encrypted connectivity

Detailed Example 1: VPN as Direct Connect Backup
You have a Direct Connect connection for primary connectivity but need a backup for failover. Solution: Create a Site-to-Site VPN connection to the same VGW as your Direct Connect. Configure BGP on both Direct Connect and VPN. Use BGP attributes to prefer Direct Connect (shorter AS-PATH or higher local preference). Normal traffic flows over Direct Connect. If Direct Connect fails, BGP detects the failure and automatically reroutes traffic through VPN tunnels. When Direct Connect recovers, traffic automatically fails back. This provides automatic failover without manual intervention.

Detailed Example 2: Quick Migration Connectivity
You're migrating workloads to AWS and need connectivity immediately. Direct Connect would take 4 weeks to provision. Solution: Create a Site-to-Site VPN connection in 2 hours. Configure your on-premises firewall with the VPN configuration. Establish tunnels and start migrating workloads. Once migration is complete, you can either keep the VPN (if bandwidth is sufficient) or upgrade to Direct Connect and use VPN as backup. VPN enables immediate migration without waiting for Direct Connect.

Detailed Example 3: Multi-Region VPN with Transit Gateway
You have VPCs in us-east-1 and eu-west-1 and need VPN connectivity from your data center to both regions. Solution: Deploy Transit Gateways in both regions. Create VPN connections from your on-premises router to both Transit Gateways. Configure BGP to advertise your on-premises CIDR to both regions. Use BGP attributes to prefer the geographically closer Transit Gateway for each region's traffic. Attach VPCs in each region to their local Transit Gateway. This provides VPN connectivity to multiple regions with optimal routing.

šŸ”— Connections to Other Topics:

  • Complements Direct Connect for: Backup connectivity and encrypted traffic
  • Works with Virtual Private Gateway to: Connect VPN to a single VPC
  • Integrates with Transit Gateway to: Connect VPN to multiple VPCs
  • Uses BGP for: Dynamic routing and automatic failover
  • Requires Customer Gateway device: On-premises VPN-capable router or firewall

Section 2: Multi-Account and Multi-VPC Architectures

Introduction

The problem: Organizations typically have multiple AWS accounts (for different teams, environments, or business units) and multiple VPCs (for network isolation). Connecting these accounts and VPCs securely and efficiently is complex. Without proper architecture, you end up with a mesh of VPC peering connections that's difficult to manage, doesn't scale, and lacks centralized control.

The solution: AWS provides several services for multi-account/multi-VPC connectivity: (1) Transit Gateway - hub-and-spoke architecture for connecting VPCs, VPNs, and Direct Connect. (2) VPC Peering - direct connection between two VPCs. (3) AWS PrivateLink - private connectivity to services without VPC peering. (4) AWS Resource Access Manager (RAM) - share resources across accounts. These services enable scalable, manageable, and secure multi-account architectures.

Why it's tested: Task Statement 2.2 requires you to "Implement routing and connectivity across multiple AWS accounts, Regions, and VPCs." The exam tests your ability to design and implement complex multi-account architectures, choose appropriate connectivity patterns, and manage routing at scale.

Core Concepts

AWS Transit Gateway

What it is: AWS Transit Gateway is a regional network hub that connects VPCs, VPN connections, and Direct Connect gateways within a region. It acts as a cloud router, enabling you to connect thousands of VPCs and on-premises networks through a single gateway. Transit Gateway simplifies network architecture by replacing complex VPC peering meshes with a hub-and-spoke model.

Why it exists: As organizations grow, they create many VPCs across accounts and regions. Connecting them with VPC peering creates a mesh topology that doesn't scale: (1) N VPCs require N*(N-1)/2 peering connections. (2) Each peering connection requires separate route table entries. (3) No transitive routing - VPC A can't reach VPC C through VPC B. (4) Management complexity increases exponentially. Transit Gateway solves this by providing a central hub that all VPCs connect to, enabling transitive routing, centralized management, and scalability to thousands of VPCs.

How it works (Detailed step-by-step):

  1. Transit Gateway Creation: You create a Transit Gateway in a region. Specify: (1) ASN for BGP routing. (2) Default route table association/propagation settings. (3) DNS support. (4) VPN ECMP support (Equal-Cost Multi-Path for load balancing across VPN tunnels).

  2. Attachments: You attach resources to the Transit Gateway: (1) VPC Attachment: Connects a VPC to TGW; specify subnets in each AZ. (2) VPN Attachment: Connects Site-to-Site VPN to TGW. (3) Direct Connect Gateway Attachment: Connects Direct Connect to TGW. (4) Peering Attachment: Connects to another Transit Gateway in a different region. (5) Connect Attachment: Connects SD-WAN appliances using GRE tunnels.

  3. Route Tables: Transit Gateway has its own route tables (separate from VPC route tables). You create TGW route tables to control routing between attachments. Each attachment is associated with one route table. Routes can be: (1) Static: Manually added routes. (2) Propagated: Automatically learned from attachments (VPC CIDRs, BGP routes from VPN/Direct Connect).

  4. Routing Logic: When a packet arrives at Transit Gateway from an attachment, TGW: (1) Looks up the destination IP in the attachment's associated route table. (2) Finds the matching route (longest prefix match). (3) Forwards the packet to the target attachment specified in the route. (4) The target attachment delivers the packet to its destination (VPC, VPN, Direct Connect).

  5. Route Propagation: Enable route propagation to automatically populate TGW route tables: (1) VPC attachments propagate their VPC CIDR. (2) VPN attachments propagate BGP-learned routes from on-premises. (3) Direct Connect attachments propagate BGP-learned routes. This eliminates manual route management.

  6. Route Table Association: Each attachment is associated with one TGW route table. This determines which routes the attachment can use for outbound traffic. For example, Production VPC attachment associated with Production route table can only route to destinations in that table.

  7. Segmentation: Use multiple TGW route tables to create network segmentation. For example: (1) Production route table: Only Production VPCs and on-premises. (2) Development route table: Only Development VPCs. (3) Shared Services route table: All VPCs can reach shared services (DNS, Active Directory). This prevents Development from accessing Production while allowing both to access Shared Services.

  8. Inter-Region Peering: Connect Transit Gateways in different regions using TGW Peering. Create a peering attachment between TGWs, and add static routes in each TGW's route table pointing to the other region's CIDRs via the peering attachment. This enables cross-region connectivity.

  9. Appliance Mode: Enable appliance mode on VPC attachments when routing traffic through security appliances (firewalls). Appliance mode ensures symmetric routing (both directions of a flow use the same appliance instance), which is required for stateful inspection.

  10. Monitoring: Monitor Transit Gateway using CloudWatch metrics: BytesIn/Out, PacketsIn/Out, PacketDropCountBlackhole (packets dropped due to no route), PacketDropCountNoRoute. Use Transit Gateway Network Manager for topology visualization and monitoring.

šŸ“Š Transit Gateway Hub-and-Spoke Architecture:

graph TB
    subgraph "On-Premises"
        ONPREM[On-Premises Network<br/>192.168.0.0/16]
        VPN_DEVICE[VPN Device]
    end
    
    subgraph "AWS Region: us-east-1"
        subgraph "Transit Gateway"
            TGW[Transit Gateway<br/>ASN: 64512<br/>Hub for all connectivity]
            
            TGW_RT_PROD[TGW Route Table: Production<br/>Routes to Prod VPCs + On-Prem]
            TGW_RT_DEV[TGW Route Table: Development<br/>Routes to Dev VPCs only]
            TGW_RT_SHARED[TGW Route Table: Shared<br/>Routes to all VPCs]
        end
        
        subgraph "Production VPCs"
            VPC_PROD1[VPC: Prod-App<br/>10.0.0.0/16]
            VPC_PROD2[VPC: Prod-DB<br/>10.1.0.0/16]
        end
        
        subgraph "Development VPCs"
            VPC_DEV1[VPC: Dev-App<br/>10.10.0.0/16]
            VPC_DEV2[VPC: Dev-DB<br/>10.11.0.0/16]
        end
        
        subgraph "Shared Services VPC"
            VPC_SHARED[VPC: Shared Services<br/>10.100.0.0/16<br/>DNS, AD, Monitoring]
        end
    end
    
    ONPREM --> VPN_DEVICE
    VPN_DEVICE -->|VPN Attachment<br/>BGP| TGW
    
    TGW --> TGW_RT_PROD
    TGW --> TGW_RT_DEV
    TGW --> TGW_RT_SHARED
    
    TGW_RT_PROD --> VPC_PROD1
    TGW_RT_PROD --> VPC_PROD2
    TGW_RT_PROD --> ONPREM
    
    TGW_RT_DEV --> VPC_DEV1
    TGW_RT_DEV --> VPC_DEV2
    
    TGW_RT_SHARED --> VPC_SHARED
    TGW_RT_SHARED --> VPC_PROD1
    TGW_RT_SHARED --> VPC_PROD2
    TGW_RT_SHARED --> VPC_DEV1
    TGW_RT_SHARED --> VPC_DEV2
    
    style TGW fill:#c8e6c9
    style TGW_RT_PROD fill:#ffebee
    style TGW_RT_DEV fill:#e1f5fe
    style TGW_RT_SHARED fill:#fff3e0

See: diagrams/03_domain_2_transit_gateway_architecture.mmd

⭐ Must Know (Critical Transit Gateway Facts):

  • Regional Hub: Transit Gateway is regional; use TGW peering for cross-region
  • Attachments: VPC, VPN, Direct Connect Gateway, Peering, Connect (SD-WAN)
  • Route Tables: TGW has its own route tables; each attachment associates with one
  • Route Propagation: Automatically learn routes from attachments
  • Segmentation: Use multiple route tables to isolate networks (Production vs Development)
  • Transitive Routing: Enabled by default; VPC A can reach VPC C through TGW
  • Appliance Mode: Required for routing through stateful security appliances
  • Limits: 5,000 attachments per TGW, 10,000 routes per route table
  • Pricing: $0.05/hour per attachment + $0.02/GB data processed
  • Use Cases: Hub-and-spoke architectures, centralized egress, shared services, multi-account connectivity

Detailed Example 1: Centralized Egress with Transit Gateway
You have 20 VPCs that need internet access, but you don't want to deploy NAT Gateways in each VPC (cost: $0.045/hour Ɨ 20 = $0.90/hour = $648/month). Solution: Create a centralized Egress VPC with NAT Gateways. Attach all VPCs to Transit Gateway. Configure TGW route table to route 0.0.0.0/0 to Egress VPC attachment. In Egress VPC, route 0.0.0.0/0 to NAT Gateway, then to Internet Gateway. All VPCs' outbound internet traffic flows through TGW to Egress VPC's NAT Gateways. Cost: $0.05/hour Ɨ 20 attachments + $0.045/hour Ɨ 2 NAT Gateways (for redundancy) = $1.09/hour = $785/month. Savings: $648 - $785 = -$137/month (slightly more expensive), but you gain centralized control, easier monitoring, and simplified management. For 50+ VPCs, savings become significant.

Detailed Example 2: Multi-Account Architecture with Transit Gateway
Your organization has 10 AWS accounts (one per team) with multiple VPCs per account. You need to connect all VPCs and provide access to on-premises. Solution: Create a Transit Gateway in a central Networking account. Use AWS Resource Access Manager (RAM) to share the TGW with all other accounts. Each account attaches its VPCs to the shared TGW. Create TGW route tables for segmentation: Production, Development, Shared Services. Associate each VPC attachment with the appropriate route table. Attach VPN connection from on-premises to TGW. All accounts can now communicate through TGW, with segmentation enforced by route tables. This provides centralized network management across all accounts.

Detailed Example 3: Multi-Region with Transit Gateway Peering
You have applications in us-east-1 and eu-west-1 that need to communicate. Solution: Create Transit Gateways in both regions. Attach VPCs in each region to their local TGW. Create a TGW peering attachment between us-east-1 TGW and eu-west-1 TGW. In us-east-1 TGW route table, add static route for eu-west-1 VPC CIDRs pointing to peering attachment. In eu-west-1 TGW route table, add static route for us-east-1 VPC CIDRs pointing to peering attachment. Traffic between regions now flows through TGW peering over AWS backbone (not internet), providing low-latency, high-bandwidth cross-region connectivity.

šŸ”— Connections to Other Topics:

  • Replaces VPC Peering for: Scalable multi-VPC connectivity with transitive routing
  • Integrates with Direct Connect via: Direct Connect Gateway attachment for hybrid connectivity
  • Works with Site-to-Site VPN via: VPN attachment for encrypted on-premises connectivity
  • Enables Centralized Egress by: Routing all internet traffic through a central Egress VPC
  • Supports Network Segmentation through: Multiple route tables isolating different environments

Chapter Summary

What We Covered

This chapter covered Domain 2: Network Implementation (26% of exam), focusing on:

āœ… Hybrid Connectivity:

  • AWS Direct Connect for dedicated, high-bandwidth connectivity
  • Virtual Interfaces (Private, Public, Transit VIFs)
  • Direct Connect Gateway for multi-VPC/multi-region access
  • Site-to-Site VPN for encrypted connectivity over internet
  • BGP routing for dynamic failover
  • Combining Direct Connect + VPN for redundancy

āœ… Multi-Account/Multi-VPC Architectures:

  • AWS Transit Gateway for hub-and-spoke connectivity
  • Transit Gateway route tables for network segmentation
  • Transit Gateway attachments (VPC, VPN, Direct Connect, Peering)
  • Multi-region connectivity with TGW peering
  • Centralized egress and shared services patterns

Note: Additional Domain 2 topics (VPC Peering, PrivateLink, complex DNS, network automation) are covered in the appendices and integration chapter for comprehensive understanding.

Critical Takeaways

  1. Direct Connect: Dedicated connection with VIFs for VPC/public service access; requires BGP; 2-4 weeks setup
  2. Site-to-Site VPN: Encrypted IPsec tunnels; two tunnels per connection; hours to setup; good for backup
  3. Transit Gateway: Regional hub connecting VPCs, VPNs, Direct Connect; enables transitive routing; use route tables for segmentation
  4. Hybrid Redundancy: Combine Direct Connect + VPN for automatic failover using BGP

Self-Assessment Checklist

Test yourself before moving on:

  • I can explain the difference between Private, Public, and Transit VIFs
  • I understand how to configure BGP for Direct Connect failover to VPN
  • I can design a Transit Gateway architecture with network segmentation
  • I know when to use Direct Connect vs VPN vs both
  • I can explain how Transit Gateway route tables control traffic flow

Practice Questions

Try these from your practice test bundles:

  • Domain 2 Bundle: Questions 1-30
  • Expected score: 75%+ to proceed

Quick Reference Card

Direct Connect:

  • VIF Types: Private (VPC), Public (AWS services), Transit (multiple VPCs via DXGW)
  • Bandwidth: 50 Mbps - 100 Gbps
  • Setup: 2-4 weeks
  • Pricing: Port hours + data transfer out

Site-to-Site VPN:

  • Tunnels: 2 per connection (redundancy)
  • Bandwidth: Up to 1.25 Gbps per tunnel
  • Routing: BGP (dynamic) or Static
  • Setup: Hours
  • Pricing: $0.05/hour + data transfer

Transit Gateway:

  • Attachments: VPC, VPN, DX Gateway, Peering, Connect
  • Route Tables: Multiple for segmentation
  • Transitive Routing: Enabled
  • Pricing: $0.05/hour per attachment + $0.02/GB processed

Next Chapter: Domain 3 - Network Management and Operation (04_domain_3_network_management)

VPC Peering

What it is: VPC Peering is a networking connection between two VPCs that enables routing traffic between them using private IP addresses. Peered VPCs can be in the same account or different accounts, and in the same region or different regions (inter-region peering). Traffic between peered VPCs stays on the AWS private network and never traverses the public internet.

Why it exists: Organizations often need to connect VPCs for: (1) Sharing resources (databases, file servers) between VPCs. (2) Enabling communication between applications in different VPCs. (3) Connecting VPCs across accounts (multi-account architectures). (4) Disaster recovery (replicating data between regions). VPC Peering provides a simple, direct connection between two VPCs without requiring gateways, VPN connections, or separate network appliances.

How it works (Detailed):

  1. Peering Connection Creation: You create a VPC peering connection from the requester VPC to the accepter VPC. If VPCs are in different accounts, the accepter account must accept the peering request. AWS creates a peering connection resource that represents the connection.

  2. Route Table Updates: After the peering connection is active, you must update route tables in both VPCs to enable traffic flow. In requester VPC: Add route for accepter VPC's CIDR pointing to the peering connection. In accepter VPC: Add route for requester VPC's CIDR pointing to the peering connection. Without these routes, traffic won't flow even though the peering connection exists.

  3. Security Group Rules: Update security groups to allow traffic from the peered VPC. You can reference security groups from the peered VPC (if in same region) or use CIDR blocks. For example, allow inbound TCP 3306 from the peered VPC's security group or CIDR.

  4. Traffic Flow: When an instance in VPC A sends traffic to an IP in VPC B's CIDR, the VPC A route table directs traffic to the peering connection. AWS routes the traffic over the private network to VPC B. VPC B's route table directs traffic to the destination instance. Response traffic follows the reverse path.

  5. No Transitive Routing: VPC Peering does NOT support transitive routing. If VPC A peers with VPC B, and VPC B peers with VPC C, VPC A cannot reach VPC C through VPC B. You must create a direct peering connection between VPC A and VPC C. This is a key limitation that Transit Gateway solves.

  6. Inter-Region Peering: VPC Peering works across regions. Traffic between regions is encrypted automatically and travels over the AWS global network (not the public internet). Inter-region peering has the same limitations as intra-region peering (no transitive routing, no overlapping CIDRs).

  7. DNS Resolution: Enable DNS resolution for the peering connection to allow instances to resolve private DNS hostnames from the peered VPC. Without this, you must use IP addresses instead of DNS names.

⭐ Must Know (Critical VPC Peering Facts):

  • One-to-One: Connects exactly two VPCs
  • No Transitive Routing: VPC A cannot reach VPC C through VPC B
  • No Overlapping CIDRs: Peered VPCs must have non-overlapping IP ranges
  • Route Tables: Must manually update route tables in both VPCs
  • Security Groups: Must allow traffic from peered VPC
  • Inter-Region: Supported; traffic encrypted and over AWS backbone
  • Pricing: Free for same-region, $0.01/GB for inter-region data transfer
  • Use Cases: Simple two-VPC connections, shared services, cross-account connectivity

Detailed Example 1: Shared Services VPC
You have a Production VPC (10.0.0.0/16) and a Shared Services VPC (10.100.0.0/16) with Active Directory, DNS, and monitoring tools. Applications in Production need to access Shared Services. Solution: Create VPC peering connection from Production to Shared Services. In Production VPC route table, add route: 10.100.0.0/16 → peering connection. In Shared Services route table, add route: 10.0.0.0/16 → peering connection. Update security groups in Shared Services to allow traffic from Production VPC's CIDR. Enable DNS resolution. Production applications can now access Shared Services using private IPs or DNS names.

Detailed Example 2: Cross-Account VPC Peering
Your company has separate AWS accounts for Development (Account A) and Production (Account B). Development team needs to test integration with Production database. Solution: In Account A (Development), create VPC peering request to Account B's Production VPC. In Account B, accept the peering request. Update route tables in both VPCs. Update security groups in Production to allow traffic from Development VPC's CIDR (only for specific test database). Development can now access the test database in Production for integration testing. After testing, delete the peering connection to restore isolation.

Detailed Example 3: Disaster Recovery with Inter-Region Peering
Your primary application runs in us-east-1 (10.0.0.0/16), and you need to replicate data to a DR site in us-west-2 (10.1.0.0/16). Solution: Create inter-region VPC peering connection between us-east-1 and us-west-2 VPCs. Update route tables in both VPCs. Configure database replication from us-east-1 to us-west-2 over the peering connection. Traffic is encrypted automatically and travels over AWS backbone (not internet). In a disaster, fail over to us-west-2. Inter-region peering provides secure, high-bandwidth connectivity for DR replication.

šŸ”— Connections to Other Topics:

  • Replaced by Transit Gateway for: Scalable multi-VPC connectivity with transitive routing
  • Requires Route Table updates: Must manually add routes in both VPCs
  • Works with Security Groups: Can reference security groups from peered VPC (same region)
  • Enables Cross-Account connectivity: Connect VPCs in different AWS accounts

When to use VPC Peering vs Transit Gateway:

  • Use VPC Peering when: Connecting only 2 VPCs, simple architecture, cost-sensitive (peering is cheaper)
  • Use Transit Gateway when: Connecting 3+ VPCs, need transitive routing, need centralized management, complex routing requirements

AWS PrivateLink

What it is: AWS PrivateLink provides private connectivity between VPCs and AWS services or your own services without exposing traffic to the public internet. It uses VPC endpoints powered by PrivateLink to access services privately. PrivateLink enables you to expose your own services (running on NLB) to other VPCs or accounts without VPC peering or internet gateways.

Why it exists: Traditional methods of accessing services across VPCs have limitations: (1) VPC Peering: Exposes entire VPC CIDR, not just specific services. (2) Public internet: Requires internet gateway, NAT, and exposes traffic to internet. (3) VPN: Adds complexity and latency. PrivateLink solves this by providing private, service-level access without exposing entire networks. It's ideal for SaaS providers offering services to customers, or for internal service sharing across accounts.

How it works (Detailed):

  1. Service Provider Setup: You (the service provider) deploy your service behind a Network Load Balancer in your VPC. Create a VPC endpoint service (PrivateLink service) associated with the NLB. Configure acceptance settings (auto-accept or manual approval for connections). Optionally, whitelist specific AWS accounts or IAM principals that can connect.

  2. Service Consumer Setup: The service consumer creates a VPC endpoint (interface endpoint) in their VPC, specifying your endpoint service name. The endpoint creates elastic network interfaces (ENIs) in the consumer's VPC subnets. These ENIs have private IPs from the consumer's VPC CIDR.

  3. Connection Approval: If you configured manual approval, you must approve the connection request from the consumer. Once approved, the endpoint becomes available.

  4. DNS Resolution: PrivateLink creates a private DNS name for the endpoint (e.g., vpce-abc123-xyz.vpce-svc-123456.us-east-1.vpce.amazonaws.com). Optionally, enable private DNS to use your service's custom DNS name (e.g., api.example.com) to resolve to the endpoint's private IPs.

  5. Traffic Flow: When the consumer's application sends traffic to the endpoint's DNS name or IP, traffic is routed to the endpoint ENI in the consumer's VPC. PrivateLink forwards traffic over the AWS private network to your NLB in the provider VPC. The NLB distributes traffic to your service instances. Response traffic follows the reverse path. Traffic never leaves the AWS network.

  6. Security: The consumer's security groups control access to the endpoint ENI. Your NLB's security groups control access to your service instances. You can also use endpoint policies (IAM policies) to control which principals can use the endpoint.

  7. Scalability: PrivateLink scales automatically. You can have thousands of consumers connecting to your service. Each consumer gets their own endpoint ENIs in their VPC, and PrivateLink handles the routing.

⭐ Must Know (Critical PrivateLink Facts):

  • Private Connectivity: Access services without internet gateway, NAT, or VPC peering
  • Service-Level Access: Exposes specific services, not entire VPC CIDR
  • Provider/Consumer Model: Service provider exposes service, consumers connect via endpoints
  • Network Load Balancer: Service must be behind NLB (not ALB or other load balancers)
  • Interface Endpoints: Consumer creates interface endpoints (ENIs) in their VPC
  • Private DNS: Optionally use custom DNS names for endpoints
  • Cross-Account: Supports cross-account and cross-region connectivity
  • Pricing: $0.01/hour per endpoint + $0.01/GB data processed
  • Use Cases: SaaS services, shared services, third-party integrations, AWS service access

Detailed Example 1: SaaS Provider Offering Service to Customers
You're a SaaS provider offering an API service to customers. You don't want customers to access your service over the public internet (security concerns), and you don't want to peer VPCs with every customer (doesn't scale). Solution: Deploy your API service behind an NLB in your VPC. Create a VPC endpoint service (PrivateLink) associated with the NLB. Share the endpoint service name with customers. Each customer creates a VPC endpoint in their VPC pointing to your service. Customers access your API using the endpoint's private DNS name. Traffic flows privately over AWS network. You can onboard thousands of customers without VPC peering or internet exposure.

Detailed Example 2: Shared Services Across Accounts
Your organization has 50 AWS accounts, and you want to provide a centralized logging service that all accounts can use. Solution: Deploy the logging service behind an NLB in a central Logging account. Create a VPC endpoint service. Use AWS Resource Access Manager (RAM) to share the endpoint service with all accounts in your AWS Organization. Each account creates a VPC endpoint to the logging service. Applications in all accounts send logs to the endpoint's private IP. Logs are collected centrally without VPC peering or internet traffic. This provides scalable, private access to shared services.

Detailed Example 3: Third-Party Integration
Your application needs to integrate with a third-party SaaS provider that offers PrivateLink connectivity. The provider gives you their endpoint service name. Solution: Create a VPC endpoint in your VPC using the provider's endpoint service name. The provider approves your connection request. Your application accesses the provider's service using the endpoint's private DNS name. Traffic flows privately over AWS network, avoiding the public internet. This provides secure, low-latency integration with the third-party service.

šŸ”— Connections to Other Topics:

  • Requires Network Load Balancer for: Service provider to expose services
  • Creates Interface Endpoints (ENIs): In consumer VPC for private access
  • Enables Cross-Account access: Without VPC peering or internet exposure
  • Works with Private DNS: Custom DNS names resolve to endpoint private IPs
  • Complements VPC Peering by: Providing service-level access instead of network-level access

When to use PrivateLink vs VPC Peering:

  • Use PrivateLink when: Exposing specific services (not entire VPC), SaaS provider model, need to scale to many consumers
  • Use VPC Peering when: Need full network connectivity between two VPCs, simpler architecture for 1-to-1 connections

Section 3: Complex DNS Architectures

Introduction

The problem: Hybrid environments require DNS resolution between on-premises and AWS, across multiple VPCs, and across accounts. Applications need to resolve private DNS names for resources in other networks. Managing DNS at scale with multiple domains, accounts, and regions is complex.

The solution: AWS Route 53 provides comprehensive DNS solutions: (1) Public hosted zones for internet-facing DNS. (2) Private hosted zones for VPC-internal DNS. (3) Route 53 Resolver for hybrid DNS (forwarding between on-premises and AWS). (4) Resolver rules for conditional forwarding. (5) Resolver endpoints (inbound and outbound) for hybrid connectivity. These enable complex DNS architectures spanning on-premises, AWS, and multi-account environments.

Why it's tested: Task Statement 2.3 requires you to "Implement complex hybrid and multi-account DNS architectures." The exam tests your ability to design DNS solutions for hybrid environments, configure conditional forwarding, and troubleshoot DNS issues.

Core Concepts

Route 53 Resolver Endpoints

What it is: Route 53 Resolver endpoints enable DNS resolution between your VPC and on-premises networks. There are two types: (1) Inbound endpoints: Allow on-premises DNS servers to forward queries to Route 53 Resolver in your VPC. (2) Outbound endpoints: Allow Route 53 Resolver in your VPC to forward queries to on-premises DNS servers. Resolver endpoints are ENIs deployed in your VPC subnets.

Why it exists: By default, Route 53 Resolver in a VPC can only resolve DNS queries for resources within that VPC and public DNS. It cannot resolve on-premises DNS names, and on-premises DNS servers cannot resolve VPC private DNS names. Resolver endpoints bridge this gap, enabling bidirectional DNS resolution between AWS and on-premises.

How it works (Detailed):

  1. Inbound Endpoint Creation: You create an inbound Resolver endpoint in your VPC, specifying subnets in 2+ AZs for high availability. Route 53 creates ENIs in those subnets with private IPs. These IPs are the DNS server addresses that on-premises DNS servers will forward queries to.

  2. On-Premises Configuration for Inbound: Configure your on-premises DNS servers to forward queries for AWS domains (e.g., *.aws.internal, *.compute.internal, or your private hosted zone domains) to the inbound endpoint's IPs. When on-premises applications query AWS DNS names, the on-premises DNS server forwards the query to the inbound endpoint. Route 53 Resolver resolves the query and returns the result.

  3. Outbound Endpoint Creation: You create an outbound Resolver endpoint in your VPC, specifying subnets in 2+ AZs. Route 53 creates ENIs in those subnets. These ENIs are used by Route 53 Resolver to forward queries to on-premises DNS servers.

  4. Resolver Rules for Outbound: You create Resolver rules that specify which domains to forward to on-premises DNS servers. For example, forward queries for *.corp.example.com to on-premises DNS servers at 192.168.1.10 and 192.168.1.11. Associate the rule with VPCs where you want the forwarding to apply.

  5. Query Flow - Inbound: (1) On-premises application queries myapp.aws.internal. (2) On-premises DNS server forwards query to inbound endpoint IP. (3) Route 53 Resolver in VPC resolves the query (checks private hosted zones, VPC DNS). (4) Resolver returns the result to on-premises DNS server. (5) On-premises DNS server returns result to application.

  6. Query Flow - Outbound: (1) EC2 instance in VPC queries fileserver.corp.example.com. (2) Route 53 Resolver checks Resolver rules and finds a rule for *.corp.example.com. (3) Resolver forwards query via outbound endpoint to on-premises DNS servers (192.168.1.10). (4) On-premises DNS server resolves the query and returns the result. (5) Resolver returns result to EC2 instance.

  7. Sharing Resolver Rules: Use AWS Resource Access Manager (RAM) to share Resolver rules across accounts. This enables centralized DNS management - create rules in a central account and share with all other accounts.

⭐ Must Know (Critical Resolver Endpoints Facts):

  • Inbound Endpoint: On-premises → AWS DNS resolution
  • Outbound Endpoint: AWS → on-premises DNS resolution
  • Resolver Rules: Specify which domains to forward (for outbound)
  • ENIs: Endpoints are ENIs in VPC subnets (need 2+ AZs for HA)
  • Conditional Forwarding: Forward specific domains, not all queries
  • Sharing: Use AWS RAM to share rules across accounts
  • Pricing: $0.125/hour per endpoint + $0.40 per million queries
  • Use Cases: Hybrid DNS, multi-account DNS, conditional forwarding

Detailed Example 1: Hybrid DNS with Inbound and Outbound Endpoints
Your company has on-premises data center with DNS servers (192.168.1.10, 192.168.1.11) and AWS VPCs. On-premises applications need to resolve AWS private DNS names, and AWS applications need to resolve on-premises DNS names. Solution: (1) Create inbound Resolver endpoint in AWS VPC. Note the endpoint IPs (e.g., 10.0.1.100, 10.0.2.100). (2) Configure on-premises DNS servers to forward queries for *.aws.internal and your private hosted zone domains to the inbound endpoint IPs. (3) Create outbound Resolver endpoint in AWS VPC. (4) Create Resolver rule to forward queries for *.corp.example.com to on-premises DNS servers (192.168.1.10, 192.168.1.11). (5) Associate the rule with your VPCs. Now, on-premises applications can resolve AWS DNS names, and AWS applications can resolve on-premises DNS names. Bidirectional DNS resolution is established.

Detailed Example 2: Centralized DNS Management Across Accounts
Your organization has 20 AWS accounts, and you want centralized DNS management. All accounts need to resolve on-premises DNS names. Solution: (1) In a central Networking account, create an outbound Resolver endpoint. (2) Create Resolver rules for on-premises domains (*.corp.example.com → on-premises DNS servers). (3) Use AWS RAM to share the Resolver rules with all accounts in your AWS Organization. (4) In each account, associate the shared rules with VPCs. All accounts can now resolve on-premises DNS names using the centralized rules. You manage DNS forwarding in one place, and changes propagate to all accounts automatically.

Detailed Example 3: Multi-Region DNS with Resolver Endpoints
You have VPCs in us-east-1 and eu-west-1, and both need to resolve on-premises DNS names. Solution: (1) Create outbound Resolver endpoints in both regions. (2) Create Resolver rules in both regions to forward on-premises domains to on-premises DNS servers. (3) Ensure on-premises DNS servers are reachable from both regions (via Direct Connect or VPN). Both regions can now resolve on-premises DNS names. Alternatively, create rules in one region and share via RAM to the other region (if using Transit Gateway for inter-region connectivity).

šŸ”— Connections to Other Topics:

  • Requires Direct Connect or VPN for: Connectivity between AWS and on-premises for DNS traffic
  • Works with Private Hosted Zones to: Resolve private DNS names in VPCs
  • Uses AWS RAM to: Share Resolver rules across accounts
  • Integrates with Transit Gateway for: Centralized DNS architecture across multiple VPCs

Chapter 3: Network Management and Operation (20% of exam)

Chapter Overview

What you'll learn:

  • Maintaining routing and connectivity for AWS and hybrid networks
  • Monitoring and analyzing network traffic for troubleshooting
  • Optimizing AWS networks for performance, reliability, and cost
  • Using network analysis tools (Flow Logs, Traffic Mirroring, Reachability Analyzer)
  • Troubleshooting common network issues

Time to complete: 10-14 hours
Prerequisites: Chapters 0-2 (Fundamentals, Network Design, Network Implementation)

Exam Weight: 20% of scored content - Focuses on operational tasks, troubleshooting, and optimization.


Section 1: Maintaining Routing and Connectivity

Introduction

The problem: Networks are dynamic - routes change, connections fail, configurations drift, and traffic patterns evolve. Without proper maintenance and monitoring, network issues go undetected until they cause outages. Manual route management doesn't scale and is error-prone.

The solution: AWS provides tools and best practices for maintaining network connectivity: (1) Dynamic routing with BGP for automatic failover. (2) Route propagation for automatic route updates. (3) CloudWatch monitoring for proactive issue detection. (4) Regular validation with Reachability Analyzer. These enable self-healing networks that adapt to failures and changes.

Why it's tested: Task Statement 3.1 requires you to "Maintain routing and connectivity on AWS and hybrid networks." The exam tests your ability to troubleshoot routing issues, optimize BGP configurations, and ensure high availability.

Core Concepts

BGP Routing for Hybrid Networks

What it is: Border Gateway Protocol (BGP) is the routing protocol used for dynamic routing between your on-premises network and AWS (via Direct Connect or VPN). BGP automatically exchanges routes, detects failures, and reroutes traffic without manual intervention.

Why it exists: Static routing requires manual updates whenever routes change, leading to: (1) Downtime during failover (manual intervention needed). (2) Configuration errors (typos, forgotten routes). (3) Scalability issues (managing hundreds of static routes). BGP solves this by automatically learning routes, detecting failures, and converging to new paths within seconds.

How it works (Detailed):

  1. BGP Session Establishment: Your router and AWS router establish a BGP session over the Direct Connect or VPN connection. They exchange OPEN messages with BGP parameters (ASN, hold time, router ID) and establish a TCP connection on port 179.

  2. Route Advertisement: Each router advertises its routes to the other: (1) Your router advertises on-premises networks (e.g., 192.168.0.0/16). (2) AWS advertises VPC CIDRs (e.g., 10.0.0.0/16) or public IP ranges. Routes include BGP attributes (AS-PATH, MED, local preference, communities) that influence routing decisions.

  3. Route Selection: When multiple paths exist to the same destination, BGP selects the best path using this decision process: (1) Highest local preference (locally configured). (2) Shortest AS-PATH (fewest autonomous systems traversed). (3) Lowest origin type (IGP < EGP < Incomplete). (4) Lowest MED (Multi-Exit Discriminator, advertised by neighbor). (5) eBGP over iBGP. (6) Lowest IGP metric to BGP next-hop. (7) Lowest router ID.

  4. Failover: If a connection fails (Direct Connect down, VPN tunnel down), BGP detects the failure via keepalive timeout (default: 90 seconds) or BFD (Bidirectional Forwarding Detection, sub-second detection). BGP withdraws routes learned from the failed connection and converges to alternate paths. Traffic automatically reroutes to backup connections.

  5. Traffic Engineering with BGP Attributes: You can influence traffic flow using BGP attributes: (1) AS-PATH Prepending: Add your ASN multiple times to make a path less preferred. Example: Prepend AS 65001 three times (65001 65001 65001) to make Direct Connect less preferred than VPN for inbound traffic. (2) MED (Multi-Exit Discriminator): Suggest to AWS which path to prefer when multiple connections exist. Lower MED is preferred. (3) Local Preference: Control outbound traffic preference locally. Higher local preference is preferred. (4) Communities: Tag routes for policy-based routing.

⭐ Must Know (Critical BGP Facts):

  • Dynamic Routing: BGP automatically learns and updates routes
  • Failover: Detects failures and reroutes traffic automatically (typically 90 seconds, faster with BFD)
  • AS-PATH: Shortest AS-PATH preferred; use prepending to make paths less preferred
  • MED: Lower MED preferred; use to influence inbound traffic from AWS
  • Local Preference: Higher local preference preferred; use to influence outbound traffic to AWS
  • BGP Session: Requires TCP port 179; uses keepalives to detect failures
  • Use Cases: Direct Connect, VPN, Transit Gateway, hybrid connectivity

Detailed Example 1: Active/Passive Failover with BGP
You have a primary Direct Connect and backup VPN to the same VGW. You want all traffic to use Direct Connect, failing over to VPN only if Direct Connect fails. Solution: Configure BGP on both connections. On your router, set higher local preference for routes learned from Direct Connect (e.g., 200) than VPN (e.g., 100). Outbound traffic prefers Direct Connect. For inbound traffic, use AS-PATH prepending on VPN - prepend your ASN 3 times on VPN advertisements. AWS sees shorter AS-PATH from Direct Connect and prefers it. If Direct Connect fails, BGP withdraws Direct Connect routes, and traffic automatically uses VPN. When Direct Connect recovers, BGP re-advertises routes, and traffic fails back.

Detailed Example 2: Active/Active Load Sharing with BGP
You have two Direct Connect connections and want to load balance traffic across both. Solution: Configure BGP on both connections with equal AS-PATH length and equal MED. AWS sees two equal-cost paths and load balances traffic across both connections using ECMP (Equal-Cost Multi-Path). On your side, configure equal local preference for both connections. Traffic is distributed roughly 50/50 across both connections. If one connection fails, all traffic automatically shifts to the remaining connection.

Detailed Example 3: Traffic Engineering for Cost Optimization
You have Direct Connect (expensive but fast) and VPN (cheap but slower). You want bulk data transfers to use VPN and latency-sensitive traffic to use Direct Connect. Solution: Use BGP communities to tag routes. Advertise your bulk data subnet (192.168.100.0/24) with a community tag indicating "low priority." Configure your router to set lower local preference for this subnet's traffic, routing it over VPN. Advertise your application subnet (192.168.1.0/24) with normal local preference, routing it over Direct Connect. This optimizes costs by using cheaper VPN for non-critical traffic.

šŸ”— Connections to Other Topics:

  • Used by Direct Connect for: Dynamic routing and automatic failover
  • Used by Site-to-Site VPN for: Dynamic routing instead of static routes
  • Integrates with Transit Gateway for: Dynamic route propagation from VPN/Direct Connect attachments
  • Enables Traffic Engineering through: BGP attributes (AS-PATH, MED, local preference)

Section 2: Monitoring and Analyzing Network Traffic

Introduction

The problem: Network issues are often invisible until they cause outages or performance degradation. Without visibility into traffic patterns, you can't troubleshoot connectivity problems, detect security threats, optimize performance, or understand costs.

The solution: AWS provides comprehensive network monitoring and analysis tools: (1) VPC Flow Logs for traffic metadata. (2) VPC Traffic Mirroring for deep packet inspection. (3) CloudWatch for metrics and alarms. (4) Reachability Analyzer for configuration validation. (5) Transit Gateway Network Manager for topology visualization. These tools provide end-to-end visibility into network behavior.

Why it's tested: Task Statement 3.2 requires you to "Monitor and analyze network traffic to troubleshoot and optimize connectivity patterns." The exam tests your ability to use monitoring tools, analyze traffic data, and troubleshoot network issues.

Core Concepts

VPC Traffic Mirroring

What it is: VPC Traffic Mirroring captures and copies network traffic from elastic network interfaces (ENIs) and sends it to monitoring appliances for deep packet inspection, security analysis, and troubleshooting. Unlike Flow Logs which capture metadata, Traffic Mirroring captures full packet payloads.

Why it exists: Some troubleshooting and security scenarios require examining actual packet contents: (1) Analyzing application-layer protocols (HTTP headers, SQL queries). (2) Detecting malware or data exfiltration in packet payloads. (3) Troubleshooting application-level issues (malformed packets, protocol errors). (4) Compliance requirements for packet-level auditing. Flow Logs provide metadata but not payloads. Traffic Mirroring fills this gap.

How it works (Detailed):

  1. Mirror Source: You specify ENIs to mirror. Traffic Mirroring captures all packets sent to/from these ENIs, including accepted and rejected traffic.

  2. Mirror Target: You specify where to send mirrored traffic. Target options: (1) ENI: Send to an ENI attached to a monitoring appliance (IDS/IPS, packet analyzer). (2) Network Load Balancer: Send to an NLB that distributes mirrored traffic across multiple monitoring appliances for scale.

  3. Mirror Filter: You create filters to specify which traffic to mirror. Filter rules based on: (1) Direction: Inbound, outbound, or both. (2) Protocol: TCP, UDP, ICMP, or all. (3) Source/Destination CIDR: Specific IP ranges. (4) Source/Destination Port: Specific ports. This reduces mirrored traffic volume and focuses on relevant traffic.

  4. Mirror Session: You create a mirror session that ties together source, target, and filter. Specify session number (priority) if multiple sessions exist on the same ENI. Traffic matching the filter is encapsulated and sent to the target.

  5. Encapsulation: Mirrored traffic is encapsulated in VXLAN (Virtual Extensible LAN) with a VXLAN Network Identifier (VNI). The monitoring appliance must support VXLAN decapsulation to extract the original packets.

  6. Analysis: The monitoring appliance receives mirrored traffic, decapsulates VXLAN, and analyzes the original packets. It can perform deep packet inspection, protocol analysis, threat detection, and forensics.

⭐ Must Know (Critical Traffic Mirroring Facts):

  • Full Packet Capture: Captures packet payloads, not just metadata (unlike Flow Logs)
  • Mirror Source: ENIs to monitor
  • Mirror Target: ENI or NLB (for distributing to multiple appliances)
  • Mirror Filter: Specify which traffic to mirror (protocol, port, CIDR)
  • VXLAN Encapsulation: Mirrored traffic encapsulated in VXLAN; appliance must support VXLAN
  • Use Cases: Deep packet inspection, IDS/IPS, troubleshooting, compliance
  • Limitations: Only within same VPC and AZ; performance impact on source ENI
  • Pricing: $0.015/hour per mirror session + data processing charges

Detailed Example 1: Troubleshooting Application Issues
Your application intermittently fails to connect to a database, but Flow Logs show traffic is accepted. You need to see actual packets to diagnose. Solution: Create a Traffic Mirroring session with source=application server's ENI, target=monitoring instance's ENI, filter=TCP port 3306 (MySQL). Deploy a packet analyzer (Wireshark, tcpdump) on the monitoring instance. Capture mirrored traffic and analyze. You discover the application is sending malformed SQL queries that the database rejects at the application layer (not network layer). Flow Logs showed ACCEPT because the TCP connection succeeded, but the application-layer query failed. Traffic Mirroring revealed the root cause.

Detailed Example 2: Security Threat Detection
You want to detect malware command-and-control (C2) traffic in your VPC. Solution: Deploy an IDS/IPS appliance (Suricata, Snort) behind an NLB. Create Traffic Mirroring sessions for all production instances, with target=NLB. Configure filter to mirror all outbound traffic. The IDS analyzes mirrored traffic for known C2 signatures, suspicious DNS queries, and data exfiltration patterns. When a compromised instance attempts to contact a C2 server, the IDS detects it and alerts your security team. Traffic Mirroring enabled real-time threat detection without impacting production traffic.

Detailed Example 3: Compliance Auditing
Your compliance requirements mandate packet-level auditing of all traffic to/from sensitive databases. Solution: Create Traffic Mirroring sessions for all database ENIs. Send mirrored traffic to a monitoring appliance that logs all packets to S3 for long-term retention. Configure filters to capture all traffic (no filtering). The appliance decapsulates VXLAN, extracts packets, and stores them in S3 with timestamps and metadata. During audits, you can retrieve and analyze packets to prove compliance. Traffic Mirroring provides the packet-level visibility required by compliance frameworks.

šŸ”— Connections to Other Topics:

  • Complements VPC Flow Logs by: Flow Logs provide metadata, Traffic Mirroring provides full packets
  • Works with IDS/IPS appliances: Sends mirrored traffic to security appliances for threat detection
  • Uses Network Load Balancer to: Distribute mirrored traffic across multiple monitoring appliances
  • Requires VXLAN support: Monitoring appliances must decapsulate VXLAN to extract packets

Section 3: Optimizing AWS Networks

Introduction

The problem: Networks often run suboptimally due to: (1) Inefficient routing (traffic taking longer paths). (2) Underutilized or overutilized resources. (3) Unnecessary costs (paying for unused bandwidth, inefficient architectures). (4) Performance bottlenecks (wrong network interface types, MTU mismatches). Optimization requires understanding performance characteristics, cost drivers, and architectural patterns.

The solution: AWS provides multiple optimization opportunities: (1) Choosing the right network interface (ENA, EFA). (2) Enabling jumbo frames for throughput. (3) Using VPC endpoints to avoid NAT Gateway costs. (4) Implementing centralized egress for cost savings. (5) Optimizing routing with Transit Gateway. These optimizations improve performance and reduce costs.

Why it's tested: Task Statement 3.3 requires you to "Optimize AWS networks for performance, reliability, and cost-effectiveness." The exam tests your ability to identify optimization opportunities and implement solutions.

Core Concepts

Network Interface Optimization

What it is: AWS offers different network interface types with varying performance characteristics: (1) Elastic Network Interface (ENI): Standard interface, up to 100 Gbps depending on instance type. (2) Elastic Network Adapter (ENA): Enhanced networking, up to 100 Gbps with lower latency and higher PPS. (3) Elastic Fabric Adapter (EFA): For HPC workloads, supports OS-bypass for ultra-low latency.

Why it exists: Different workloads have different network requirements. Standard ENI is sufficient for most workloads, but high-performance applications (databases, analytics, HPC) need enhanced networking. ENA provides better performance at no additional cost. EFA enables HPC applications to achieve near-bare-metal performance.

How to optimize:

  1. Enable ENA: Most modern instance types support ENA by default. Verify ENA is enabled: aws ec2 describe-instances --instance-ids i-xxx --query 'Reservations[].Instances[].EnaSupport'. If false, enable ENA on the instance. ENA provides up to 100 Gbps bandwidth, lower latency, and higher packets per second (PPS) compared to standard ENI.

  2. Use EFA for HPC: For tightly-coupled HPC workloads (MPI applications, computational fluid dynamics, weather modeling), use EFA. EFA supports OS-bypass, allowing applications to communicate directly with the network interface without kernel involvement, reducing latency to microseconds. Deploy instances in a cluster placement group for lowest latency.

  3. Jumbo Frames: Enable jumbo frames (MTU 9001) for improved throughput on large data transfers. Configure jumbo frames on: (1) EC2 instances (set MTU 9001 on network interface). (2) Direct Connect VIFs (enable jumbo frames on VIF). (3) VPN connections (not supported - VPN uses MTU 1500). Jumbo frames reduce packet overhead, increasing effective throughput by 10-20% for large transfers.

  4. Placement Groups: Use placement groups to optimize network performance: (1) Cluster: Instances in same AZ, low latency (single-digit microseconds), high throughput (up to 100 Gbps). (2) Partition: Instances spread across partitions (racks), reduces correlated failures. (3) Spread: Instances on distinct hardware, maximum isolation. For network-intensive workloads, use cluster placement groups.

⭐ Must Know (Network Interface Optimization):

  • ENA: Enhanced networking, up to 100 Gbps, lower latency, higher PPS - enable on all modern instances
  • EFA: For HPC, supports OS-bypass, ultra-low latency (microseconds) - use with cluster placement groups
  • Jumbo Frames: MTU 9001, improves throughput by 10-20% - enable on instances, Direct Connect, Transit Gateway
  • Placement Groups: Cluster for low latency, Partition for fault isolation, Spread for maximum isolation
  • Cost: ENA and EFA have no additional cost beyond instance pricing

Detailed Example 1: Database Performance Optimization
Your database instances are experiencing high network latency and low throughput. Solution: (1) Verify instances support ENA - if not, migrate to ENA-supported instance types (e.g., m5, r5, c5). (2) Enable jumbo frames (MTU 9001) on database instances and application instances. (3) Deploy database instances in a cluster placement group for lowest latency. (4) Use EBS-optimized instances to ensure network bandwidth isn't shared with EBS traffic. After optimization, latency drops from 5ms to <1ms, and throughput increases from 5 Gbps to 25 Gbps.

Detailed Example 2: HPC Workload with EFA
Your computational fluid dynamics (CFD) simulation requires ultra-low latency inter-node communication. Solution: Deploy instances with EFA support (e.g., c5n.18xlarge) in a cluster placement group. Install EFA drivers and configure MPI to use EFA's OS-bypass feature. Run CFD simulation with MPI across 100 nodes. EFA provides <10 microsecond latency between nodes, enabling near-linear scaling. Without EFA, latency would be 50-100 microseconds, significantly degrading performance.

Detailed Example 3: Cost Optimization with VPC Endpoints
Your application transfers 10 TB/month to S3, incurring NAT Gateway data processing charges ($0.045/GB = $450/month). Solution: Create a Gateway VPC Endpoint for S3. Update route tables to route S3 traffic to the VPC endpoint instead of NAT Gateway. Traffic to S3 now flows directly through the VPC endpoint (free) instead of through NAT Gateway. Savings: $450/month. Additionally, throughput improves because VPC endpoints don't have bandwidth limits like NAT Gateways.

šŸ”— Connections to Other Topics:

  • ENA is default on: Modern instance types (m5, r5, c5, etc.)
  • EFA is used with: HPC applications, MPI, cluster placement groups
  • Jumbo Frames work with: Direct Connect, Transit Gateway, VPC traffic (not VPN)
  • VPC Endpoints eliminate: NAT Gateway costs for AWS service access

Chapter Summary

What We Covered

This chapter covered Domain 3: Network Management and Operation (20% of exam), focusing on:

āœ… Maintaining Routing and Connectivity:

  • BGP routing for hybrid networks
  • Traffic engineering with BGP attributes (AS-PATH, MED, local preference)
  • Active/passive and active/active failover scenarios
  • Route propagation and dynamic routing

āœ… Monitoring and Analyzing Traffic:

  • VPC Traffic Mirroring for deep packet inspection
  • Using monitoring appliances for security and troubleshooting
  • VXLAN encapsulation and decapsulation
  • Complementing Flow Logs with Traffic Mirroring

āœ… Optimizing Networks:

  • Network interface types (ENI, ENA, EFA)
  • Jumbo frames for improved throughput
  • Placement groups for low latency
  • VPC endpoints for cost optimization

Critical Takeaways

  1. BGP: Use AS-PATH prepending and MED for traffic engineering; BGP provides automatic failover
  2. Traffic Mirroring: Captures full packets (not just metadata); requires VXLAN support on monitoring appliances
  3. ENA: Enable on all modern instances for better performance at no extra cost
  4. Jumbo Frames: MTU 9001 improves throughput by 10-20% for large transfers
  5. VPC Endpoints: Eliminate NAT Gateway costs for AWS service access

Self-Assessment Checklist

  • I can configure BGP for active/passive failover
  • I understand how to use AS-PATH prepending for traffic engineering
  • I can set up VPC Traffic Mirroring for packet analysis
  • I know when to use ENA vs EFA
  • I can optimize network costs with VPC endpoints

Practice Questions

Try these from your practice test bundles:

  • Domain 3 Bundle: Questions 1-25
  • Expected score: 75%+ to proceed

Quick Reference Card

BGP Attributes:

  • AS-PATH: Shortest preferred; prepend to make less preferred
  • MED: Lower preferred; use for inbound traffic control
  • Local Preference: Higher preferred; use for outbound traffic control

Traffic Mirroring:

  • Source: ENIs to monitor
  • Target: ENI or NLB
  • Filter: Protocol, port, CIDR
  • Encapsulation: VXLAN

Network Interfaces:

  • ENI: Standard, up to 100 Gbps
  • ENA: Enhanced, lower latency, higher PPS
  • EFA: HPC, OS-bypass, microsecond latency

Next Chapter: Domain 4 - Network Security, Compliance, and Governance (05_domain_4_security_compliance)


Chapter 4: Network Security, Compliance, and Governance (24% of exam)

Chapter Overview

What you'll learn:

  • Implementing network security features (AWS WAF, Shield, Network Firewall)
  • Securing inbound and outbound traffic flows
  • Validating and auditing security with monitoring services
  • Implementing encryption for data in transit
  • Ensuring compliance with security requirements

Time to complete: 12-16 hours
Prerequisites: Chapters 0-3 (Fundamentals through Network Management)

Exam Weight: 24% of scored content - Critical domain focusing on security, compliance, and governance.


Section 1: Network Security Features

Introduction

The problem: Networks face constant security threats: DDoS attacks, web application exploits (SQL injection, XSS), malware, data exfiltration, and unauthorized access. Traditional perimeter security (firewalls at the edge) is insufficient for cloud environments where traffic flows are complex and dynamic.

The solution: AWS provides multiple layers of network security: (1) AWS Shield for DDoS protection. (2) AWS WAF for web application protection. (3) AWS Network Firewall for stateful inspection. (4) Security Groups and NACLs for instance-level security. (5) VPC Flow Logs for security monitoring. These services provide defense-in-depth, protecting against various threat vectors.

Why it's tested: Task Statement 4.1 requires you to "Implement and maintain network features to meet security and compliance needs." The exam tests your ability to design secure network architectures, implement security controls, and respond to threats.

Core Concepts

AWS WAF (Web Application Firewall)

What it is: AWS WAF is a web application firewall that protects web applications from common web exploits and bots. It filters HTTP/HTTPS requests based on rules you define, blocking malicious traffic before it reaches your application. WAF integrates with CloudFront, Application Load Balancer, API Gateway, and AppSync.

Why it exists: Web applications face constant attacks: (1) SQL injection: Attackers inject malicious SQL to access databases. (2) Cross-site scripting (XSS): Attackers inject malicious scripts into web pages. (3) Bots: Automated scrapers, credential stuffing, inventory hoarding. (4) DDoS: Application-layer attacks overwhelming resources. Traditional firewalls operate at network layer and can't inspect HTTP content. WAF operates at application layer (Layer 7), inspecting HTTP requests and blocking attacks.

How it works (Detailed):

  1. Web ACL Creation: You create a Web Access Control List (Web ACL) that contains rules for filtering traffic. Associate the Web ACL with resources (CloudFront distribution, ALB, API Gateway).

  2. Rules: You add rules to the Web ACL. Rule types: (1) Managed Rules: Pre-configured rule groups from AWS or AWS Marketplace vendors (e.g., Core Rule Set, Known Bad Inputs, SQL Injection, Linux/Windows exploits). (2) Custom Rules: Rules you create based on conditions (IP addresses, HTTP headers, URI paths, query strings, request body). (3) Rate-based Rules: Block IPs exceeding request rate thresholds (e.g., >2000 requests per 5 minutes).

  3. Rule Evaluation: When a request arrives, WAF evaluates rules in priority order (lowest number first). For each rule, WAF checks if the request matches the rule's conditions. If matched, WAF takes the rule's action: (1) Allow: Pass request to application. (2) Block: Return 403 Forbidden. (3) Count: Count matches but don't block (for testing). (4) CAPTCHA: Challenge user with CAPTCHA. (5) Challenge: Challenge user with silent challenge (JavaScript/cookie validation).

  4. Default Action: If no rules match, WAF applies the default action (Allow or Block). Best practice: Set default to Allow and explicitly block malicious traffic with rules.

  5. Logging: Enable WAF logging to send request logs to CloudWatch Logs, S3, or Kinesis Data Firehose. Logs include request details, matched rules, and actions taken. Use logs for security analysis, compliance, and tuning rules.

  6. Bot Control: Use AWS WAF Bot Control managed rule group to detect and block bots. Bot Control categorizes bots (verified bots like Googlebot, unverified bots, scrapers) and allows you to block or challenge based on category.

  7. IP Reputation Lists: Use AWS Managed Rules IP Reputation List to block requests from known malicious IPs (botnets, scanners, Tor exit nodes).

⭐ Must Know (Critical WAF Facts):

  • Layer 7 Protection: Inspects HTTP/HTTPS requests at application layer
  • Integration: CloudFront, ALB, API Gateway, AppSync
  • Rule Types: Managed (AWS/Marketplace), Custom, Rate-based
  • Actions: Allow, Block, Count, CAPTCHA, Challenge
  • Managed Rules: Pre-configured rule groups for common threats (SQL injection, XSS, etc.)
  • Bot Control: Detect and block bots, scrapers, credential stuffing
  • Pricing: $5/month per Web ACL + $1/month per rule + $0.60 per million requests
  • Use Cases: Protect web apps from OWASP Top 10, block bots, rate limiting, geo-blocking

Detailed Example 1: Protecting Against SQL Injection
Your web application has a login form vulnerable to SQL injection. Solution: Create a Web ACL and attach it to your ALB. Add the AWS Managed Rules "SQL Database" rule group, which contains rules detecting SQL injection patterns in query strings, request bodies, and headers. Enable logging to CloudWatch. An attacker attempts SQL injection: username=admin' OR '1'='1. WAF detects the SQL injection pattern, blocks the request (403 Forbidden), and logs the attempt. Your application is protected without code changes.

Detailed Example 2: Rate Limiting to Prevent DDoS
Your API is being overwhelmed by a DDoS attack - a single IP is sending 10,000 requests per minute. Solution: Create a rate-based rule in your Web ACL: "Block IPs sending >2000 requests per 5 minutes." WAF tracks request rates per IP. When the attacker's IP exceeds 2000 requests in 5 minutes, WAF blocks all subsequent requests from that IP for the next 5 minutes. The attack is mitigated, and legitimate users can still access your API.

Detailed Example 3: Bot Management
Your e-commerce site is being scraped by bots, stealing product data and pricing. Solution: Add AWS WAF Bot Control managed rule group to your Web ACL. Configure Bot Control to: (1) Allow verified bots (Googlebot, Bingbot) for SEO. (2) Challenge unverified bots with CAPTCHA. (3) Block scrapers and malicious bots. Bot Control analyzes request patterns, JavaScript execution, and browser fingerprints to identify bots. Scrapers are blocked, verified bots are allowed, and legitimate users pass through. Your product data is protected.

šŸ”— Connections to Other Topics:

  • Integrates with Application Load Balancer to: Protect web applications behind ALB
  • Works with CloudFront to: Protect content at edge locations globally
  • Complements AWS Shield by: WAF protects Layer 7, Shield protects Layer 3/4
  • Uses CloudWatch for: Logging and monitoring WAF activity

AWS Shield (DDoS Protection)

What it is: AWS Shield is a managed DDoS (Distributed Denial of Service) protection service that safeguards applications running on AWS. Shield comes in two tiers: (1) Shield Standard: Automatic protection against common Layer 3/4 DDoS attacks, included free with all AWS accounts. (2) Shield Advanced: Enhanced protection, DDoS cost protection, 24/7 DDoS Response Team (DRT), and advanced monitoring.

Why it exists: DDoS attacks overwhelm applications with massive traffic volumes, making them unavailable to legitimate users. Attacks can be: (1) Volumetric: Flood network with traffic (UDP floods, DNS amplification). (2) Protocol: Exploit protocol weaknesses (SYN floods, fragmented packets). (3) Application-layer: Target application resources (HTTP floods). Shield protects against these attacks automatically.

How it works (Detailed):

  1. Shield Standard: Automatically enabled on all AWS accounts at no cost. Protects against common Layer 3/4 DDoS attacks: (1) SYN floods: Shield detects and drops malicious SYN packets. (2) UDP reflection attacks: Shield filters reflected traffic. (3) DNS query floods: Shield absorbs DNS floods at edge locations. Protection is always-on and transparent - no configuration needed.

  2. Shield Advanced: Provides enhanced protection for CloudFront, Route 53, ALB, NLB, Elastic IP, and Global Accelerator. Features: (1) Advanced attack detection: Machine learning detects sophisticated attacks. (2) DDoS cost protection: Credits for scaling charges during attacks. (3) DDoS Response Team (DRT): 24/7 access to AWS experts during attacks. (4) Real-time attack notifications: CloudWatch alarms for attacks. (5) Attack forensics: Detailed reports on attacks.

  3. Attack Detection: Shield continuously monitors traffic patterns using machine learning. When an attack is detected, Shield automatically applies mitigations: (1) Traffic scrubbing: Filters malicious traffic while allowing legitimate traffic. (2) Rate limiting: Limits request rates from attacking sources. (3) Geo-blocking: Blocks traffic from attack source regions (Shield Advanced).

  4. DDoS Response Team (DRT): With Shield Advanced, you can engage DRT during attacks. DRT analyzes the attack, applies custom mitigations, and helps optimize your architecture for DDoS resilience. DRT can also create WAF rules on your behalf to block application-layer attacks.

  5. Health-Based Detection: Shield Advanced monitors application health metrics (ALB target health, CloudFront error rates). If health degrades during high traffic, Shield assumes a DDoS attack and applies mitigations even if traffic patterns don't match known attack signatures.

⭐ Must Know (Critical Shield Facts):

  • Shield Standard: Free, automatic, protects against common Layer 3/4 DDoS attacks
  • Shield Advanced: $3,000/month, enhanced protection, DDoS cost protection, DRT access
  • Protected Resources: CloudFront, Route 53, ALB, NLB, Elastic IP, Global Accelerator
  • DDoS Cost Protection: Shield Advanced credits for scaling charges during attacks
  • DRT: 24/7 access to AWS DDoS experts (Shield Advanced only)
  • Integration: Works with WAF for application-layer protection
  • Use Cases: Protect public-facing applications from DDoS attacks

Detailed Example 1: Volumetric DDoS Attack
Your website is hit by a 100 Gbps UDP flood attack. Shield Standard automatically detects the attack and applies mitigations at AWS edge locations. The attack traffic is absorbed and filtered before reaching your application. Your website remains available to legitimate users. No action required from you - Shield Standard handled it automatically.

Detailed Example 2: Application-Layer DDoS with Shield Advanced
Your API behind an ALB is hit by an HTTP flood - 1 million requests per second from a botnet. Shield Advanced detects the attack based on abnormal traffic patterns and degraded ALB target health. Shield Advanced engages DRT, who analyzes the attack and creates WAF rules to block the botnet's traffic patterns. The attack is mitigated within minutes. Shield Advanced also provides DDoS cost protection, crediting any scaling charges incurred during the attack.

Detailed Example 3: Multi-Vector DDoS Attack
Your application faces a sophisticated multi-vector attack: (1) Volumetric: 50 Gbps SYN flood. (2) Protocol: Fragmented packet attack. (3) Application-layer: HTTP flood targeting login page. Shield Advanced detects all three vectors. Shield mitigates the volumetric and protocol attacks automatically. DRT creates WAF rules to block the HTTP flood. The attack is fully mitigated, and your application remains available. Post-attack, DRT provides a forensics report and recommendations for improving DDoS resilience.

šŸ”— Connections to Other Topics:

  • Complements AWS WAF by: Shield protects Layer 3/4, WAF protects Layer 7
  • Protects CloudFront distributions: DDoS protection at edge locations
  • Works with Route 53 to: Protect DNS from query floods
  • Integrates with Application Load Balancer for: DDoS protection for web applications

AWS Network Firewall

What it is: AWS Network Firewall is a managed, stateful network firewall service that provides filtering for VPC traffic. It inspects traffic at Layer 3-7, supports intrusion prevention (IPS), and allows custom rules using Suricata-compatible rule syntax. Network Firewall is deployed at the VPC level and can filter traffic between VPCs, to/from the internet, and to/from on-premises networks.

Why it exists: Security Groups and NACLs provide basic filtering but have limitations: (1) No stateful inspection of application protocols. (2) No intrusion prevention. (3) No deep packet inspection. (4) Limited rule expressiveness. Organizations need more advanced filtering capabilities, especially for: (1) Centralized egress filtering (blocking malicious domains). (2) Intrusion prevention (detecting and blocking exploits). (3) Protocol-aware filtering (inspecting HTTP, TLS, DNS). Network Firewall provides these capabilities.

How it works (Detailed):

  1. Firewall Deployment: You create a Network Firewall in a VPC. Specify subnets in each AZ where firewall endpoints will be deployed. Network Firewall creates an endpoint (ENI) in each subnet.

  2. Firewall Policy: You create a firewall policy that defines filtering rules. The policy contains: (1) Stateless rule groups: Fast, simple rules for basic filtering (allow/drop based on 5-tuple). (2) Stateful rule groups: Advanced rules for deep inspection (domain filtering, IPS, protocol detection). (3) Default actions: What to do with traffic that doesn't match any rules (allow or drop).

  3. Stateless Rules: Evaluated first for performance. Rules match on: (1) Source/destination IP and port. (2) Protocol. (3) TCP flags. Actions: Pass (to stateful engine), Drop, Forward to stateful engine. Use stateless rules for high-volume, simple filtering (e.g., block all traffic from specific IPs).

  4. Stateful Rules: Evaluated after stateless rules. Rule types: (1) Domain list: Allow/deny traffic to specific domains (e.g., block *.malware.com). (2) Suricata-compatible rules: Custom IPS rules using Suricata syntax (detect exploits, malware, data exfiltration). (3) Standard stateful rules: 5-tuple rules with protocol awareness. Stateful engine maintains connection state and inspects application-layer protocols.

  5. Intrusion Prevention (IPS): Use AWS Managed Threat Signatures (Suricata rules maintained by AWS) to detect and block known threats: (1) Malware C2 communication. (2) Exploit attempts. (3) Suspicious DNS queries. (4) Data exfiltration patterns. IPS rules are updated automatically by AWS.

  6. Traffic Flow: (1) Traffic enters VPC via Internet Gateway or VPN. (2) Route table directs traffic to Network Firewall endpoint. (3) Firewall inspects traffic using stateless and stateful rules. (4) Allowed traffic is forwarded to destination. (5) Blocked traffic is dropped and logged. (6) Return traffic follows reverse path through firewall.

  7. Logging: Network Firewall logs to CloudWatch Logs, S3, or Kinesis Data Firehose. Log types: (1) Flow logs: Metadata for all flows (similar to VPC Flow Logs). (2) Alert logs: Logs for traffic matching IPS rules. Use logs for security analysis, compliance, and incident response.

⭐ Must Know (Critical Network Firewall Facts):

  • Stateful Firewall: Maintains connection state, inspects application protocols
  • IPS: Intrusion prevention with Suricata-compatible rules
  • Domain Filtering: Block traffic to malicious domains
  • Deployment: VPC-level, with endpoints in subnets
  • Rule Types: Stateless (fast, simple), Stateful (deep inspection), IPS (threat detection)
  • Managed Threat Signatures: AWS-maintained Suricata rules for known threats
  • Logging: Flow logs and alert logs to CloudWatch, S3, or Kinesis
  • Pricing: $0.395/hour per firewall + $0.065/GB processed
  • Use Cases: Centralized egress filtering, IPS, domain blocking, compliance

Detailed Example 1: Centralized Egress Filtering
You have 20 VPCs that need internet access, but you want to block traffic to known malicious domains. Solution: Create a centralized Egress VPC with Network Firewall. Attach all VPCs to Transit Gateway. Route all internet-bound traffic (0.0.0.0/0) through TGW to Egress VPC. In Egress VPC, route traffic to Network Firewall endpoint, then to NAT Gateway, then to Internet Gateway. Configure Network Firewall with: (1) Domain list rule group blocking known malicious domains. (2) AWS Managed Threat Signatures for IPS. All outbound traffic is inspected by Network Firewall. Malicious traffic is blocked and logged. This provides centralized security control for all VPCs.

Detailed Example 2: Intrusion Prevention
Your application is vulnerable to known exploits, and you need to block exploit attempts. Solution: Deploy Network Firewall in your VPC. Configure a stateful rule group with AWS Managed Threat Signatures. Enable IPS mode (drop traffic matching signatures). An attacker attempts to exploit a vulnerability (e.g., Log4Shell). Network Firewall's IPS detects the exploit pattern in the HTTP request and drops the traffic. The attack is blocked, and an alert is logged to CloudWatch. Your application is protected without patching (though you should still patch!).

Detailed Example 3: Compliance with Domain Filtering
Your compliance requirements mandate blocking access to social media and file-sharing sites. Solution: Deploy Network Firewall in your VPC. Create a domain list rule group with domains to block: *.facebook.com, *.twitter.com, *.dropbox.com, *.wetransfer.com. Configure the rule group to deny traffic to these domains. Users attempting to access blocked sites receive connection refused. Network Firewall logs all blocked attempts to S3 for compliance auditing. This enforces acceptable use policies at the network level.

šŸ”— Connections to Other Topics:

  • Complements Security Groups and NACLs by: Providing stateful inspection and IPS
  • Works with Transit Gateway for: Centralized egress filtering across multiple VPCs
  • Integrates with CloudWatch for: Logging and monitoring firewall activity
  • Uses Suricata rules for: IPS and custom threat detection

Section 2: Encryption and Data Protection

Introduction

The problem: Data in transit over networks can be intercepted, eavesdropped, or tampered with. Compliance requirements (HIPAA, PCI-DSS, GDPR) mandate encryption for sensitive data. Without encryption, data is vulnerable to man-in-the-middle attacks, packet sniffing, and unauthorized access.

The solution: AWS provides multiple encryption options for data in transit: (1) TLS/SSL for HTTPS traffic. (2) IPsec for VPN connections. (3) MACsec for Direct Connect. (4) AWS Certificate Manager (ACM) for certificate management. These ensure data confidentiality and integrity during transmission.

Why it's tested: Task Statement 4.3 requires you to "Implement and maintain confidentiality of data and communications of the network." The exam tests your ability to implement encryption, manage certificates, and ensure secure communications.

Core Concepts

TLS/SSL Encryption

What it is: Transport Layer Security (TLS) and its predecessor Secure Sockets Layer (SSL) are cryptographic protocols that encrypt data in transit between clients and servers. TLS is used for HTTPS (HTTP over TLS), ensuring web traffic confidentiality and integrity.

Why it exists: HTTP traffic is unencrypted - anyone on the network path can read the data. This is unacceptable for sensitive data (passwords, credit cards, personal information). TLS encrypts traffic, preventing eavesdropping and tampering. It also provides authentication, ensuring clients connect to the legitimate server (not an imposter).

How it works (Detailed):

  1. TLS Handshake: Client and server negotiate encryption parameters: (1) Client sends ClientHello with supported TLS versions and cipher suites. (2) Server responds with ServerHello, selecting TLS version and cipher suite. (3) Server sends its certificate (public key). (4) Client verifies certificate against trusted Certificate Authorities (CAs). (5) Client and server exchange keys and establish encrypted session.

  2. Certificate Management with ACM: AWS Certificate Manager (ACM) provides free SSL/TLS certificates for use with AWS services (CloudFront, ALB, API Gateway). ACM handles certificate provisioning, renewal, and deployment. You request a certificate for your domain (e.g., www.example.com), validate domain ownership (DNS or email validation), and ACM issues the certificate. ACM automatically renews certificates before expiration.

  3. TLS Termination at Load Balancer: Deploy TLS termination at ALB or NLB. The load balancer decrypts incoming HTTPS traffic, inspects it (for WAF, routing decisions), and forwards to targets. Options: (1) Terminate TLS at load balancer, forward HTTP to targets (offloads encryption from targets). (2) Terminate TLS at load balancer, re-encrypt with TLS to targets (end-to-end encryption). (3) TLS passthrough (NLB only): Forward encrypted traffic to targets without decryption (targets handle TLS).

  4. TLS Versions and Cipher Suites: Configure TLS policies on load balancers to enforce strong encryption: (1) TLS 1.2 or 1.3 only (disable TLS 1.0/1.1 - vulnerable). (2) Strong cipher suites (AES-GCM, ChaCha20-Poly1305). (3) Perfect Forward Secrecy (PFS) cipher suites (ECDHE). Use AWS predefined security policies (e.g., ELBSecurityPolicy-TLS-1-2-2017-01) or create custom policies.

⭐ Must Know (Critical TLS Facts):

  • TLS Versions: Use TLS 1.2 or 1.3; disable TLS 1.0/1.1 (vulnerable)
  • ACM: Free SSL/TLS certificates for AWS services; automatic renewal
  • TLS Termination: Decrypt at load balancer; options: HTTP to targets, re-encrypt to targets, or passthrough
  • Cipher Suites: Use strong ciphers (AES-GCM, ChaCha20); enable PFS (ECDHE)
  • Certificate Validation: DNS validation (preferred) or email validation
  • SNI: Server Name Indication allows multiple certificates on one load balancer
  • Use Cases: HTTPS for web applications, API encryption, compliance requirements

Detailed Example 1: HTTPS for Web Application
Your web application handles credit card payments and must use HTTPS. Solution: Request an ACM certificate for your domain (www.example.com). Validate domain ownership via DNS (add CNAME record). ACM issues certificate. Create an ALB with HTTPS listener (port 443) and attach the ACM certificate. Configure HTTP listener (port 80) to redirect to HTTPS. Configure TLS policy to use TLS 1.2+ and strong cipher suites. All traffic to your application is now encrypted, meeting PCI-DSS requirements.

Detailed Example 2: End-to-End Encryption
Your security policy requires encryption from client to application servers (not just to load balancer). Solution: Deploy ALB with HTTPS listener and ACM certificate for TLS termination. Configure target group with HTTPS protocol (port 443). Install certificates on application servers (can use ACM Private CA for internal certificates). ALB terminates client TLS, inspects traffic (for WAF, routing), re-encrypts with TLS, and forwards to application servers. Application servers decrypt traffic. This provides end-to-end encryption while allowing ALB to inspect traffic.

Detailed Example 3: Multi-Domain Certificate with SNI
You host multiple domains (www.example.com, api.example.com, admin.example.com) on one ALB. Solution: Request ACM certificates for each domain. Add all certificates to the ALB's HTTPS listener. Enable SNI (Server Name Indication). When a client connects, it includes the domain name in the TLS handshake (SNI extension). ALB selects the appropriate certificate based on the domain name. This allows one ALB to serve multiple domains with different certificates.

šŸ”— Connections to Other Topics:

  • Used by Application Load Balancer for: HTTPS termination and re-encryption
  • Managed by AWS Certificate Manager for: Certificate provisioning and renewal
  • Required for PCI-DSS compliance when: Handling credit card data
  • Works with CloudFront for: HTTPS at edge locations

Chapter Summary

What We Covered

This chapter covered Domain 4: Network Security, Compliance, and Governance (24% of exam), focusing on:

āœ… Network Security Features:

  • AWS WAF for web application protection (SQL injection, XSS, bots)
  • AWS Shield for DDoS protection (Standard and Advanced)
  • AWS Network Firewall for stateful inspection and IPS
  • Managed rules and threat signatures

āœ… Encryption and Data Protection:

  • TLS/SSL for HTTPS encryption
  • AWS Certificate Manager for certificate management
  • TLS termination and end-to-end encryption
  • Strong cipher suites and TLS policies

Critical Takeaways

  1. WAF: Layer 7 protection; use managed rules for common threats; integrate with ALB/CloudFront
  2. Shield: Standard (free, automatic), Advanced ($3K/month, DRT access, cost protection)
  3. Network Firewall: Stateful firewall with IPS; use for centralized egress filtering and domain blocking
  4. TLS: Use TLS 1.2+; ACM provides free certificates; terminate at load balancer or end-to-end

Self-Assessment Checklist

  • I can configure AWS WAF to protect against SQL injection and XSS
  • I understand the difference between Shield Standard and Shield Advanced
  • I can deploy Network Firewall for centralized egress filtering
  • I know how to implement TLS termination at ALB with ACM certificates
  • I can configure strong TLS policies (TLS 1.2+, strong ciphers)

Practice Questions

Try these from your practice test bundles:

  • Domain 4 Bundle: Questions 1-30
  • Expected score: 75%+ to proceed

Quick Reference Card

AWS WAF:

  • Integration: CloudFront, ALB, API Gateway
  • Rules: Managed, Custom, Rate-based
  • Actions: Allow, Block, Count, CAPTCHA

AWS Shield:

  • Standard: Free, automatic, Layer 3/4
  • Advanced: $3K/month, DRT, cost protection

Network Firewall:

  • Stateful: Deep inspection, IPS
  • Rules: Stateless, Stateful, Domain lists
  • Pricing: $0.395/hour + $0.065/GB

TLS/SSL:

  • Versions: TLS 1.2, TLS 1.3 (disable 1.0/1.1)
  • ACM: Free certificates, automatic renewal
  • Termination: At load balancer or end-to-end

Next Chapter: Integration & Advanced Topics (06_integration)


Integration & Advanced Topics: Putting It All Together

Cross-Domain Scenarios

This chapter integrates concepts from all four domains to solve complex, real-world scenarios that span multiple topics.

Scenario Type 1: Global Multi-Region Architecture

What it tests: Understanding of edge services (Domain 1), hybrid connectivity (Domain 2), monitoring (Domain 3), and security (Domain 4).

How to approach:

  1. Identify global requirements (latency, availability, disaster recovery)
  2. Design edge layer (CloudFront, Global Accelerator)
  3. Design regional connectivity (Transit Gateway, Direct Connect)
  4. Implement security (WAF, Shield, Network Firewall)
  5. Configure monitoring (Flow Logs, CloudWatch, Reachability Analyzer)

Example Question Pattern:
"A global company needs to serve users in North America, Europe, and Asia with <50ms latency, protect against DDoS attacks, and provide failover between regions. Design the architecture."

Solution Approach:

  • Use Global Accelerator for static anycast IPs and global routing
  • Deploy ALBs in us-east-1, eu-west-1, ap-southeast-1
  • Use Shield Advanced for DDoS protection
  • Configure health checks for automatic regional failover
  • Use Route 53 for DNS-based failover as backup
  • Enable VPC Flow Logs and CloudWatch alarms for monitoring

Scenario Type 2: Hybrid Cloud with Centralized Security

What it tests: Direct Connect/VPN (Domain 2), Transit Gateway (Domain 2), Network Firewall (Domain 4), monitoring (Domain 3).

How to approach:

  1. Design hybrid connectivity (Direct Connect + VPN backup)
  2. Implement hub-and-spoke with Transit Gateway
  3. Deploy centralized security (Network Firewall in Egress VPC)
  4. Configure BGP for failover
  5. Implement monitoring and logging

Example Question Pattern:
"An enterprise has 50 VPCs across 3 regions and needs centralized egress filtering, hybrid connectivity to on-premises, and automatic failover. Design the solution."

Solution Approach:

  • Deploy Transit Gateways in each region
  • Connect TGWs with inter-region peering
  • Deploy Direct Connect with Transit VIF to each TGW
  • Configure Site-to-Site VPN as backup
  • Create Egress VPC with Network Firewall in each region
  • Route all internet traffic through Network Firewall
  • Use BGP for automatic failover between Direct Connect and VPN

Scenario Type 3: Compliance and Auditing

What it tests: Encryption (Domain 4), logging (Domain 3), DNS security (Domain 1), network segmentation (Domain 2).

How to approach:

  1. Identify compliance requirements (HIPAA, PCI-DSS, GDPR)
  2. Implement encryption (TLS, IPsec, MACsec)
  3. Configure comprehensive logging (Flow Logs, WAF logs, Network Firewall logs)
  4. Implement network segmentation (Transit Gateway route tables)
  5. Enable DNSSEC for DNS security

Example Question Pattern:
"A healthcare company must comply with HIPAA, requiring encryption in transit, comprehensive logging, and network segmentation between production and development. Design the solution."

Solution Approach:

  • Use TLS 1.2+ for all HTTPS traffic (ACM certificates)
  • Use IPsec VPN for on-premises connectivity
  • Enable VPC Flow Logs, WAF logs, Network Firewall logs to S3 for long-term retention
  • Use Transit Gateway with separate route tables for production and development
  • Enable DNSSEC on Route 53 hosted zones
  • Use Reachability Analyzer to validate segmentation

Advanced Topics

Multi-Account Networking with AWS Organizations

Prerequisites: Understanding of Transit Gateway, VPC sharing, AWS RAM

Builds on: Domain 2 (Multi-account architectures)

Why it's advanced: Requires coordinating networking across organizational units, managing shared resources, and implementing governance at scale.

Key Concepts:

  • AWS Organizations for account management
  • AWS Resource Access Manager (RAM) for sharing Transit Gateway, subnets, Route 53 Resolver rules
  • Service Control Policies (SCPs) for network governance
  • Centralized network account pattern

Implementation:

  1. Create network account in AWS Organizations
  2. Deploy Transit Gateway in network account
  3. Share Transit Gateway with all accounts using AWS RAM
  4. Each account attaches VPCs to shared Transit Gateway
  5. Network account manages Transit Gateway route tables
  6. Use SCPs to enforce network policies (e.g., prevent Direct Connect in non-network accounts)

SD-WAN Integration with Transit Gateway Connect

Prerequisites: Understanding of Transit Gateway, BGP, GRE tunnels

Builds on: Domain 2 (Hybrid connectivity)

Why it's advanced: Requires understanding of overlay networks, GRE encapsulation, and SD-WAN architectures.

Key Concepts:

  • Transit Gateway Connect attachment
  • GRE tunnels over Direct Connect or VPN
  • SD-WAN appliances (Cisco Viptela, VMware SD-WAN, etc.)
  • BGP over GRE for dynamic routing

Implementation:

  1. Deploy SD-WAN appliances in VPC
  2. Create Transit Gateway Connect attachment
  3. Configure GRE tunnels from SD-WAN appliances to Transit Gateway
  4. Establish BGP sessions over GRE tunnels
  5. SD-WAN appliances advertise on-premises routes via BGP
  6. Transit Gateway routes traffic to SD-WAN appliances for on-premises destinations

IPv6 Networking

Prerequisites: Understanding of VPC, routing, security groups

Builds on: Domain 1 (Network design), Domain 2 (Implementation)

Why it's advanced: Requires understanding IPv6 addressing, dual-stack configurations, and IPv6-specific security considerations.

Key Concepts:

  • IPv6 CIDR blocks (/56 for VPC, /64 for subnets)
  • Dual-stack (IPv4 + IPv6) vs IPv6-only
  • Egress-only Internet Gateway for IPv6
  • IPv6 in Direct Connect and VPN
  • Security Groups and NACLs for IPv6

Implementation:

  1. Associate IPv6 CIDR block with VPC (AWS-provided or BYOIP)
  2. Assign IPv6 CIDR blocks to subnets
  3. Update route tables (add ::/0 route to IGW or Egress-only IGW)
  4. Update Security Groups and NACLs for IPv6 rules
  5. Enable IPv6 on EC2 instances
  6. Configure DNS for AAAA records

Common Question Patterns

Pattern 1: "Choose the most cost-effective solution"

How to recognize:

  • Question mentions cost optimization or budget constraints
  • Multiple solutions work technically, but differ in cost

What they're testing:

  • Understanding of AWS pricing models
  • Ability to optimize architectures for cost

How to answer:

  1. Eliminate solutions that don't meet requirements
  2. Compare costs of remaining solutions:
    • NAT Gateway vs VPC Endpoint (VPC Endpoint is free for S3/DynamoDB)
    • Direct Connect vs VPN (VPN cheaper for <1 Gbps)
    • Centralized egress vs per-VPC NAT Gateways (centralized cheaper at scale)
  3. Choose the lowest-cost solution that meets all requirements

Example: "Which solution provides internet access for 50 VPCs most cost-effectively?"

  • Answer: Centralized egress VPC with Transit Gateway (shared NAT Gateways)

Pattern 2: "Ensure high availability and automatic failover"

How to recognize:

  • Question mentions HA, failover, redundancy, or disaster recovery
  • Scenario involves critical applications

What they're testing:

  • Understanding of redundancy patterns
  • Knowledge of automatic failover mechanisms

How to answer:

  1. Identify single points of failure
  2. Implement redundancy:
    • Multiple AZs for resources
    • Multiple connections for hybrid connectivity (Direct Connect + VPN)
    • Multiple regions for disaster recovery
  3. Configure automatic failover:
    • BGP for network failover
    • Route 53 health checks for DNS failover
    • Multi-AZ for managed services
  4. Verify solution has no single points of failure

Example: "Ensure on-premises connectivity has automatic failover."

  • Answer: Direct Connect (primary) + Site-to-Site VPN (backup) with BGP

Pattern 3: "Troubleshoot connectivity issue"

How to recognize:

  • Question describes a connectivity problem
  • Asks to identify the root cause or solution

What they're testing:

  • Systematic troubleshooting approach
  • Understanding of network layers and dependencies

How to answer:

  1. Check Layer 3 (routing):
    • Route tables have correct routes?
    • BGP sessions established?
  2. Check Layer 4 (security):
    • Security Groups allow traffic?
    • NACLs allow traffic (both inbound and outbound)?
  3. Check Layer 7 (application):
    • Application listening on correct port?
    • Load balancer health checks passing?
  4. Use tools:
    • VPC Flow Logs to see if traffic is accepted or rejected
    • Reachability Analyzer to validate configuration
  5. Identify the blocking point and fix it

Example: "Instance can't connect to database. Flow Logs show REJECT."

  • Answer: Check Security Group on database - likely missing inbound rule for application's IP/port

Quick Reference: Decision Frameworks

When to use CloudFront vs Global Accelerator

Requirement Solution
Content caching needed CloudFront
Static anycast IPs needed Global Accelerator
HTTP/HTTPS only CloudFront
TCP/UDP support needed Global Accelerator
Custom SSL certificates CloudFront
Health-based failover Global Accelerator

When to use Direct Connect vs VPN

Requirement Solution
>1 Gbps bandwidth Direct Connect
<1 Gbps bandwidth VPN (more cost-effective)
Consistent latency required Direct Connect
Quick setup (hours) VPN
Long-term connectivity Direct Connect
Temporary/test connectivity VPN
Encryption required VPN or VPN over Direct Connect

When to use ALB vs NLB vs GWLB

Requirement Solution
HTTP routing (path/host) ALB
Static IP addresses NLB
Ultra-low latency (<5ms) NLB
Security appliance insertion GWLB
WAF integration ALB
Preserve client IP NLB or GWLB

Next Chapter: Study Strategies & Test-Taking Techniques (07_study_strategies)


Study Strategies & Test-Taking Techniques

Effective Study Techniques

The 3-Pass Method

Pass 1: Understanding (Weeks 1-6)

  • Read each chapter thoroughly
  • Take detailed notes on ⭐ items
  • Complete practice exercises
  • Draw diagrams from memory
  • Focus on understanding WHY, not just WHAT

Pass 2: Application (Weeks 7-8)

  • Review chapter summaries only
  • Focus on decision frameworks and comparison tables
  • Practice full-length tests (aim for 70%+)
  • Analyze wrong answers to identify weak areas
  • Review weak areas in detail

Pass 3: Reinforcement (Weeks 9-10)

  • Review flagged items and weak areas
  • Memorize key facts (limits, pricing, features)
  • Practice tests (aim for 80%+)
  • Final review of cheat sheet
  • Simulate exam conditions

Active Learning Techniques

  1. Teach Someone: Explain concepts out loud as if teaching a colleague
  2. Draw Diagrams: Recreate architecture diagrams from memory
  3. Write Scenarios: Create your own exam questions
  4. Compare Options: Use comparison tables to understand trade-offs
  5. Hands-On Practice: Build architectures in AWS (use Free Tier where possible)

Memory Aids

Mnemonics for BGP Path Selection:
"Lazy Administrators Often Make Extra Income Regularly"

  • Local preference (highest)
  • AS-PATH (shortest)
  • Origin type (IGP < EGP < Incomplete)
  • MED (lowest)
  • EBGP over iBGP
  • IGP metric (lowest)
  • Router ID (lowest)

Load Balancer Selection:
"Application needs ALB, Network needs NLB, Gateway needs GWLB"

VIF Types:
"Private for Private IPs, Public for Public services, Transit for Transit Gateway"

Test-Taking Strategies

Time Management

  • Total time: 170 minutes (2 hours 50 minutes)
  • Total questions: 65 (50 scored + 15 unscored)
  • Time per question: ~2.6 minutes
  • Buffer time: 20 minutes for review

Strategy:

  • First pass (120 min): Answer all questions, flag difficult ones
  • Second pass (30 min): Review flagged questions
  • Final pass (20 min): Review marked answers, check for mistakes

Question Analysis Method

Step 1: Read the scenario (30 seconds)

  • Identify: Company type, current architecture, problem statement
  • Note: Key requirements (performance, cost, security, compliance)
  • Highlight: Constraint keywords (must, cannot, minimize, maximize)

Step 2: Identify constraints (15 seconds)

  • Cost requirements (most cost-effective, minimize cost)
  • Performance needs (low latency, high throughput, high availability)
  • Security requirements (encryption, compliance, isolation)
  • Operational overhead (minimize management, automatic)

Step 3: Eliminate wrong answers (30 seconds)

  • Remove options that violate hard constraints
  • Eliminate technically incorrect options
  • Remove options that don't meet all requirements

Step 4: Choose best answer (45 seconds)

  • If one option remains, select it
  • If multiple options remain, choose the one that best meets the primary requirement (usually stated first)
  • For "most cost-effective", choose the cheapest option that works
  • For "high availability", choose the option with most redundancy

Handling Difficult Questions

When stuck:

  1. Eliminate obviously wrong answers (usually 1-2 options)
  2. Look for constraint keywords in the question
  3. Choose the most commonly recommended AWS solution
  4. Flag and move on if still unsure (don't spend >3 minutes)
  5. Return during second pass with fresh perspective

āš ļø Never: Spend more than 3 minutes on one question initially

Common Traps to Avoid

Trap 1: Overcomplicating the solution

  • AWS exams favor simple, managed solutions over complex custom solutions
  • If an option seems overly complex, it's probably wrong

Trap 2: Ignoring cost constraints

  • "Most cost-effective" means choose the cheapest option that works
  • Don't choose expensive solutions when cheaper ones meet requirements

Trap 3: Missing the word "NOT"

  • Questions like "Which is NOT a benefit of..." require opposite thinking
  • Highlight "NOT" in the question to avoid mistakes

Trap 4: Assuming features that don't exist

  • Don't assume services have features they don't
  • If unsure, choose the option that uses well-known features

Trap 5: Choosing the first "correct" answer

  • All options may be technically correct, but one is BEST
  • Read all options before selecting

Exam Day Preparation

Day Before Exam

Final Review (2-3 hours max):

  1. Review cheat sheet (1 hour)
  2. Skim chapter summaries (1 hour)
  3. Review flagged items from practice tests (30 min)
  4. Relax and get 8 hours sleep

Don't: Try to learn new topics or cram

Morning Routine

  • Light review of cheat sheet (30 min)
  • Eat a good breakfast
  • Arrive 30 minutes early (online: test system 30 min early)
  • Bring ID and confirmation email

Brain Dump Strategy

When exam starts, immediately write down (on provided whiteboard or scratch paper):

  • BGP path selection order
  • Load balancer comparison (ALB vs NLB vs GWLB)
  • VIF types (Private, Public, Transit)
  • Key service limits (if you've memorized them)
  • Any formulas or mnemonics

During Exam

  • Follow time management strategy
  • Use scratch paper for complex scenarios
  • Flag questions for review (don't leave any blank)
  • Trust your preparation - first instinct is usually correct
  • Don't second-guess yourself excessively

After Exam

  • Results available immediately (pass/fail)
  • Detailed score report available within 5 business days
  • If you don't pass, review score report to identify weak domains
  • Focus study on weak areas and retake

Next Chapter: Final Week Checklist (08_final_checklist)


Final Week Checklist

7 Days Before Exam

Knowledge Audit

Go through this comprehensive checklist:

Domain 1: Network Design (30%)

  • I can explain when to use CloudFront vs Global Accelerator
  • I understand Route 53 Resolver endpoints (inbound vs outbound)
  • I can select the appropriate load balancer (ALB/NLB/GWLB) for any scenario
  • I know how to design DNS solutions for hybrid environments
  • I understand VPC Flow Logs, CloudWatch metrics, and Reachability Analyzer

Domain 2: Network Implementation (26%)

  • I can configure Direct Connect with Private, Public, and Transit VIFs
  • I understand BGP routing and how to configure failover
  • I can design Transit Gateway architectures with route table segmentation
  • I know when to use VPC Peering vs Transit Gateway vs PrivateLink
  • I can implement Site-to-Site VPN with dynamic or static routing

Domain 3: Network Management and Operation (20%)

  • I can use BGP attributes (AS-PATH, MED, local preference) for traffic engineering
  • I understand VPC Traffic Mirroring and when to use it
  • I can optimize networks with ENA, EFA, and jumbo frames
  • I know how to troubleshoot connectivity issues systematically
  • I can use Reachability Analyzer to validate configurations

Domain 4: Network Security, Compliance, and Governance (24%)

  • I can configure AWS WAF to protect against common web exploits
  • I understand the difference between Shield Standard and Shield Advanced
  • I can deploy AWS Network Firewall for centralized egress filtering
  • I know how to implement TLS/SSL with ACM certificates
  • I understand encryption options for data in transit (TLS, IPsec, MACsec)

If you checked fewer than 80%: Review those specific chapters and topics

Practice Test Marathon

  • Day 7: Full Practice Test 1 (target: 65%+)
  • Day 6: Review mistakes, study weak areas (spend 3-4 hours)
  • Day 5: Full Practice Test 2 (target: 75%+)
  • Day 4: Review mistakes, focus on patterns (spend 2-3 hours)
  • Day 3: Domain-focused tests for weak domains (target: 80%+)
  • Day 2: Full Practice Test 3 (target: 80%+)
  • Day 1: Review cheat sheet, relax, prepare for exam day

Day Before Exam

Final Review (2-3 hours max)

  1. Review cheat sheet (1 hour)

    • Focus on comparison tables
    • Review decision frameworks
    • Memorize key facts and limits
  2. Skim chapter summaries (1 hour)

    • Read "Critical Takeaways" from each chapter
    • Review "Quick Reference Cards"
    • Don't try to re-learn concepts
  3. Review flagged items (30 min)

    • Items you marked during study
    • Common mistakes from practice tests
    • Weak areas identified in practice tests

Don't: Try to learn new topics, cram, or study late into the night

Mental Preparation

  • Get 8 hours sleep (critical for cognitive performance)
  • Prepare exam day materials (ID, confirmation email)
  • Review testing center policies (or online proctoring requirements)
  • Set multiple alarms (don't oversleep!)
  • Plan route to testing center (or test internet connection for online)

Exam Day

Morning Routine

  • Light review of cheat sheet (30 min max)
  • Eat a good breakfast (avoid heavy meals that cause drowsiness)
  • Arrive 30 minutes early (or log in 30 min early for online)
  • Bring required ID and confirmation email
  • Use restroom before exam starts

Brain Dump Strategy

When exam starts, immediately write down on scratch paper:

BGP Path Selection:

  1. Local preference (highest)
  2. AS-PATH (shortest)
  3. Origin type (IGP < EGP < Incomplete)
  4. MED (lowest)
  5. eBGP over iBGP
  6. IGP metric (lowest)
  7. Router ID (lowest)

Load Balancer Comparison:

  • ALB: Layer 7, HTTP routing, WAF integration
  • NLB: Layer 4, static IPs, ultra-low latency, client IP preservation
  • GWLB: Layer 3, security appliances, GENEVE, transparent gateway

VIF Types:

  • Private VIF: VPC access via VGW or Direct Connect Gateway
  • Public VIF: AWS public services (S3, DynamoDB)
  • Transit VIF: Multiple VPCs via Direct Connect Gateway + Transit Gateway

Key Service Limits (if memorized):

  • Transit Gateway: 5,000 attachments, 10,000 routes per route table
  • Direct Connect: 50 VIFs per connection
  • VPC: 200 subnets per VPC, 200 route table entries per route table

Decision Frameworks:

  • Cost-effective: Choose cheapest option that meets requirements
  • High availability: Multiple AZs, multiple connections, BGP failover
  • Low latency: NLB, Direct Connect, cluster placement groups, ENA/EFA

During Exam

Time Management:

  • 170 minutes total, 65 questions
  • ~2.6 minutes per question
  • First pass: 120 minutes (answer all, flag difficult)
  • Second pass: 30 minutes (review flagged)
  • Final pass: 20 minutes (check for mistakes)

Question Strategy:

  1. Read scenario carefully, identify requirements and constraints
  2. Eliminate obviously wrong answers
  3. Choose best answer from remaining options
  4. Flag if unsure, move on (don't spend >3 minutes)
  5. Return to flagged questions in second pass

Common Patterns:

  • "Most cost-effective" → Choose cheapest option that works
  • "High availability" → Multiple AZs, automatic failover
  • "Low latency" → NLB, Direct Connect, ENA/EFA
  • "Centralized" → Transit Gateway, centralized egress VPC
  • "Security" → WAF, Shield, Network Firewall, encryption

Traps to Avoid:

  • Missing the word "NOT" in questions
  • Overcomplicating solutions (AWS favors simple, managed solutions)
  • Ignoring cost constraints when "cost-effective" is mentioned
  • Choosing first "correct" answer without reading all options
  • Second-guessing yourself excessively (trust your preparation)

After Exam

  • Results available immediately (pass/fail)
  • Detailed score report within 5 business days
  • If you pass: Celebrate! Update LinkedIn, resume
  • If you don't pass: Review score report, identify weak domains, study those areas, retake

Final Words

You're Ready When...

  • You score 80%+ on all practice tests
  • You can explain key concepts without notes
  • You recognize question patterns instantly
  • You make decisions quickly using frameworks
  • You feel confident (not anxious) about the exam

Remember

  • Trust your preparation: You've studied hard, you know the material
  • Manage your time: Don't spend too long on any one question
  • Read carefully: Pay attention to requirements and constraints
  • Don't overthink: First instinct is usually correct
  • Stay calm: Take deep breaths if you feel anxious

Exam Day Mindset

  • This is an advanced certification - it's supposed to be challenging
  • You don't need 100% to pass - 750/1000 (75%) is passing
  • Every question you answer correctly gets you closer to passing
  • If you don't know an answer, eliminate wrong options and make your best guess
  • You've prepared thoroughly - trust yourself

Good luck on your AWS Certified Advanced Networking - Specialty exam!

You've got this! šŸŽÆ


Final File: Appendices (99_appendices) - Already complete


Appendices

Appendix A: Quick Reference Tables

A.1 Service Comparison Matrix

VPN Solutions Comparison

Feature AWS Site-to-Site VPN AWS Client VPN AWS VPN CloudHub Third-Party VPN
Use Case Connect on-premises to VPC Remote user access Multiple sites to AWS Advanced features needed
Connection Type IPsec tunnel OpenVPN-based IPsec hub-and-spoke Varies by vendor
Bandwidth Up to 1.25 Gbps per tunnel Up to 10 Gbps aggregate Up to 1.25 Gbps per tunnel Depends on instance
High Availability 2 tunnels (active/passive) Multi-AZ by default Multiple VPN connections Configure manually
Routing Static or BGP Route table based BGP required BGP or static
Cost Model Per connection + data transfer Per association + connection hours Per connection + data transfer EC2 + licensing
Setup Complexity Medium Low Medium-High High
šŸŽÆ Exam Tip Default for hybrid connectivity For remote workers Multiple sites, same region When need advanced features

Direct Connect vs VPN Comparison

Aspect AWS Direct Connect Site-to-Site VPN Direct Connect + VPN
Connection Type Dedicated fiber Internet-based IPsec Hybrid approach
Bandwidth 50 Mbps - 100 Gbps Up to 1.25 Gbps per tunnel Combined capacity
Latency Low, consistent Variable (internet) Low with failover
Security Private connection Encrypted tunnel Private + encrypted backup
Setup Time Weeks to months Minutes to hours Weeks (DX) + hours (VPN)
Cost Port hours + data transfer Connection + data transfer Both combined
Reliability 99.9% SLA No SLA High (DX primary, VPN backup)
Use When High bandwidth, consistent performance Quick setup, encrypted Mission-critical with failover
šŸŽÆ Exam Tip Production workloads Dev/test or backup Best practice for production

Transit Gateway vs VPC Peering vs PrivateLink

Feature Transit Gateway VPC Peering AWS PrivateLink
Connectivity Model Hub-and-spoke Point-to-point Service-oriented
Max Connections 5,000 attachments 125 per VPC Unlimited endpoints
Transitive Routing āœ… Yes āŒ No N/A
Cross-Region āœ… Yes (peering) āœ… Yes āœ… Yes
Cross-Account āœ… Yes āœ… Yes āœ… Yes
IP Overlap āŒ No āŒ No āœ… Yes (via endpoints)
Bandwidth Up to 50 Gbps per AZ No limit Up to 10 Gbps per endpoint
Routing Route tables Route tables DNS-based
Cost Per attachment + data Free (data transfer only) Per endpoint + data
Use Case Complex multi-VPC Simple VPC-to-VPC Service exposure
šŸŽÆ Exam Tip Scalable hub architecture Simple, low-cost peering Private service access

Load Balancer Comparison

Feature Application LB (ALB) Network LB (NLB) Gateway LB (GWLB) Classic LB (CLB)
OSI Layer Layer 7 (HTTP/HTTPS) Layer 4 (TCP/UDP) Layer 3 (IP) Layer 4 & 7
Protocol Support HTTP, HTTPS, gRPC, WebSocket TCP, UDP, TLS IP packets TCP, SSL, HTTP, HTTPS
Performance High Ultra-high (millions RPS) High throughput Moderate
Static IP āŒ No (use Global Accelerator) āœ… Yes (Elastic IP) āŒ No āŒ No
Target Types Instance, IP, Lambda Instance, IP, ALB Instance, IP Instance only
Path-Based Routing āœ… Yes āŒ No āŒ No āŒ No
Host-Based Routing āœ… Yes āŒ No āŒ No āŒ No
WebSocket āœ… Yes āœ… Yes āŒ No āŒ No
Cross-Zone LB Always enabled Optional (free) Always enabled Optional (free)
Preserve Source IP Via X-Forwarded-For āœ… Yes āœ… Yes Via X-Forwarded-For
Use Case Web apps, microservices High performance, static IP Third-party appliances Legacy (deprecated)
šŸŽÆ Exam Tip Content-based routing Extreme performance Security appliances Don't choose for new

Route 53 Routing Policies Comparison

Policy Use Case Health Checks How It Works Exam Scenario
Simple Single resource āŒ No Returns all values randomly Basic DNS, no failover
Weighted A/B testing, gradual migration āœ… Yes Distributes by weight % "Route 20% to new version"
Latency Global apps, best performance āœ… Yes Routes to lowest latency "Users in multiple regions"
Failover Active-passive DR āœ… Yes (required) Primary → Secondary on failure "Automatic failover to DR"
Geolocation Content localization, compliance āœ… Yes Routes by user location "EU users to EU resources"
Geoproximity Traffic flow management āœ… Yes Routes by geographic distance + bias "Shift traffic between regions"
Multivalue Simple load balancing āœ… Yes Returns multiple healthy values "Distribute across multiple IPs"
IP-based Route by client IP ranges āœ… Yes Routes based on source IP "Corporate users to specific endpoint"

CloudFront vs Global Accelerator

Feature Amazon CloudFront AWS Global Accelerator
Purpose Content delivery (CDN) Application acceleration
Protocol HTTP/HTTPS, WebSocket TCP, UDP
Caching āœ… Yes (edge caching) āŒ No (proxying only)
Static IP āŒ No āœ… Yes (2 Anycast IPs)
Use Case Static/dynamic content, video Non-HTTP apps, gaming, VoIP
Edge Locations 400+ PoPs AWS edge network
Origin Types S3, HTTP servers, MediaStore ALB, NLB, EC2, Elastic IP
DDoS Protection AWS Shield Standard AWS Shield Standard
SSL/TLS Terminates at edge End-to-end or termination
Health Checks Origin health Endpoint health with failover
Traffic Management Cache behaviors Traffic dials, endpoint weights
šŸŽÆ Exam Tip Web content delivery TCP/UDP acceleration, static IP

DNS Record Types Quick Reference

Record Type Purpose Example Exam Relevance
A IPv4 address example.com → 192.0.2.1 ⭐ Most common
AAAA IPv6 address example.com → 2001:0db8::1 IPv6 scenarios
CNAME Alias to another name www → example.com āš ļø Can't use for apex
Alias AWS resource pointer example.com → ELB ⭐ Use for AWS resources
MX Mail server example.com → mail.example.com Email routing
TXT Text information Domain verification SPF, DKIM, verification
NS Name server Delegation Subdomain delegation
SOA Zone authority Zone metadata Zone management
PTR Reverse DNS IP → hostname Reverse lookups
SRV Service location _service._proto.name Service discovery
CAA Certificate authority Restrict CA Certificate control

A.2 Service Limits & Quotas

VPC Limits (Default)

Resource Default Limit Hard Limit Notes
VPCs per region 5 Increasable Soft limit
Subnets per VPC 200 Increasable Soft limit
IPv4 CIDR blocks per VPC 5 5 Hard limit
IPv6 CIDR blocks per VPC 1 1 Hard limit
Route tables per VPC 200 Increasable Includes main
Routes per route table 50 1,000 Non-propagated
BGP advertised routes 100 100 Hard limit
Elastic IPs per region 5 Increasable Soft limit
Internet gateways per region 5 Matches VPC limit One per VPC
NAT gateways per AZ 5 Increasable Soft limit
VPC peering connections per VPC 50 125 Active + pending
Security groups per VPC 2,500 10,000 Soft limit
Rules per security group 60 60 Inbound + outbound
Security groups per ENI 5 16 Soft limit
Network ACLs per VPC 200 Increasable Soft limit
Rules per network ACL 20 40 Inbound + outbound

Direct Connect Limits

Resource Limit Notes
Dedicated connections per region 10 Increasable
Hosted connections per region No limit Via partner
Virtual interfaces per connection 50 (private) + 1 (public) Hard limit
Routes advertised from AWS 100 Per BGP session
Routes advertised to AWS 100 Per BGP session
Direct Connect gateways per account 200 Increasable
VGWs per Direct Connect gateway 10 Hard limit
Transit gateways per DX gateway 3 Hard limit
VIFs per Direct Connect gateway 30 Hard limit

Transit Gateway Limits

Resource Default Limit Notes
Transit gateways per region 5 Increasable
Attachments per transit gateway 5,000 Includes VPC, VPN, DX
VPC attachments per TGW 5,000 Soft limit
VPN attachments per TGW 5,000 Soft limit
Peering attachments per TGW 50 Increasable
Route tables per TGW 20 Increasable
Routes per route table 10,000 Static + propagated
Bandwidth per VPC attachment 50 Gbps Per AZ
Bandwidth per VPN attachment 1.25 Gbps Per tunnel
MTU 8,500 bytes Within same region

Route 53 Limits

Resource Limit Notes
Hosted zones per account 500 Increasable
Records per hosted zone 10,000 Increasable
Traffic policies per account 50 Increasable
Health checks per account 200 Increasable
Query rate No limit Pay per query
Alias queries Free To AWS resources
Geolocation locations 233 countries Plus continents
Weighted routing weights 0-255 Per record

CloudFront Limits

Resource Default Limit Notes
Distributions per account 200 Increasable
Origins per distribution 25 Increasable
Cache behaviors per distribution 25 Increasable
Custom headers per origin 10 Increasable
Cookies to forward 10 Whitelist mode
Query strings to forward 10 Whitelist mode
SSL certificates per distribution 1 Per distribution
Alternate domain names (CNAMEs) 100 Per distribution
File size 30 GB Maximum object size
Request rate No limit Scales automatically

Load Balancer Limits

Resource ALB NLB GWLB
Load balancers per region 50 50 50
Target groups per region 3,000 3,000 300
Targets per LB 1,000 3,000 300
Listeners per LB 50 50 1
Rules per listener 100 N/A N/A
Certificates per LB 25 25 N/A
Targets per target group 1,000 3,000 300
Target groups per action 5 N/A N/A
Connections per target Unlimited 55,000 Varies

A.3 Important Port Numbers

Port Protocol Service Exam Relevance
20 TCP FTP Data ⭐ Active FTP
21 TCP FTP Control ⭐ FTP connections
22 TCP SSH ⭐ Secure shell, SFTP
23 TCP Telnet Legacy (insecure)
25 TCP SMTP Email sending
53 TCP/UDP DNS ⭐ Name resolution
80 TCP HTTP ⭐ Web traffic
110 TCP POP3 Email retrieval
123 UDP NTP Time synchronization
143 TCP IMAP Email access
161/162 UDP SNMP Network monitoring
389 TCP LDAP Directory services
443 TCP HTTPS ⭐ Secure web traffic
445 TCP SMB Windows file sharing
465 TCP SMTPS Secure email
514 UDP Syslog Log forwarding
636 TCP LDAPS Secure LDAP
993 TCP IMAPS Secure IMAP
995 TCP POP3S Secure POP3
1433 TCP MS SQL SQL Server
1521 TCP Oracle Oracle database
3306 TCP MySQL ⭐ MySQL/MariaDB
3389 TCP RDP ⭐ Windows Remote Desktop
5432 TCP PostgreSQL PostgreSQL database
5439 TCP Redshift Amazon Redshift
8080 TCP HTTP Alt Alternative HTTP
8443 TCP HTTPS Alt Alternative HTTPS

A.4 CIDR Block Quick Reference

Common CIDR Blocks and Usable IPs

CIDR Subnet Mask Total IPs Usable IPs AWS Usable Use Case
/32 255.255.255.255 1 1 0 Single host
/31 255.255.255.254 2 2 0 Point-to-point
/30 255.255.255.252 4 2 0 Too small for AWS
/29 255.255.255.248 8 6 3 Too small for AWS
/28 255.255.255.240 16 14 11 ⭐ Minimum for AWS
/27 255.255.255.224 32 30 27 Small subnet
/26 255.255.255.192 64 62 59 Small subnet
/25 255.255.255.128 128 126 123 Medium subnet
/24 255.255.255.0 256 254 251 ⭐ Common subnet
/23 255.255.254.0 512 510 507 Large subnet
/22 255.255.252.0 1,024 1,022 1,019 Large subnet
/21 255.255.248.0 2,048 2,046 2,043 Very large
/20 255.255.240.0 4,096 4,094 4,091 Very large
/19 255.255.224.0 8,192 8,190 8,187 Huge subnet
/18 255.255.192.0 16,384 16,382 16,379 Huge subnet
/17 255.255.128.0 32,768 32,766 32,763 Massive
/16 255.255.0.0 65,536 65,534 65,531 ⭐ Maximum VPC

AWS Reserved IPs (per subnet):

  • First IP: Network address
  • Second IP: VPC router
  • Third IP: DNS server
  • Fourth IP: Reserved for future use
  • Last IP: Broadcast address (not used in VPC but reserved)

RFC 1918 Private IP Ranges

CIDR Block IP Range Total IPs Common Use
10.0.0.0/8 10.0.0.0 - 10.255.255.255 16,777,216 ⭐ Large enterprises, AWS VPCs
172.16.0.0/12 172.16.0.0 - 172.31.255.255 1,048,576 ⭐ Medium networks, default VPC
192.168.0.0/16 192.168.0.0 - 192.168.255.255 65,536 Home/small office networks

A.5 BGP AS Numbers

AS Number Range Type Usage
1 - 64,495 Public Internet routing
64,496 - 64,511 Reserved Documentation
64,512 - 65,534 Private ⭐ Internal use, AWS VPN
65,535 Reserved Reserved
4,200,000,000 - 4,294,967,294 Private 32-bit private ASNs

AWS Defaults:

  • Virtual Private Gateway: 64,512 (default, configurable)
  • Direct Connect Gateway: 64,512 (default, configurable)
  • Transit Gateway: 64,512 (default, configurable)
  • Customer Gateway: You specify (typically private ASN)

A.6 MTU Sizes

Connection Type MTU Size Notes
Standard Ethernet 1,500 bytes Default for most networks
Jumbo Frames 9,001 bytes ⭐ Within VPC (enhanced networking)
VPN Connection 1,500 bytes Cannot use jumbo frames
Direct Connect 1,500 or 9,001 ⭐ Jumbo frames supported
Transit Gateway (same region) 8,500 bytes Between VPCs in same region
Transit Gateway (cross-region) 1,500 bytes Peering connections
VPC Peering (same region) 9,001 bytes Jumbo frames supported
VPC Peering (cross-region) 1,500 bytes Standard MTU only
Internet Gateway 1,500 bytes Standard MTU

šŸŽÆ Exam Tip: Jumbo frames (9,001 bytes) improve performance for large data transfers within AWS, but VPN connections always use 1,500 bytes.

Appendix B: Glossary

A

Anycast IP: An IP address that routes to the nearest location in a network. AWS Global Accelerator uses two static Anycast IPs to route traffic to the optimal AWS endpoint.

AS (Autonomous System): A collection of IP networks under control of a single organization that presents a common routing policy to the internet. Identified by an AS Number (ASN).

ASN (Autonomous System Number): A unique identifier assigned to an autonomous system for use in BGP routing. AWS uses 64,512 by default for VGWs and Transit Gateways.

Availability Zone (AZ): One or more discrete data centers with redundant power, networking, and connectivity in an AWS Region. AZs are physically separated for fault isolation.

B

BGP (Border Gateway Protocol): A standardized exterior gateway protocol used to exchange routing information between autonomous systems. Used in Direct Connect and dynamic VPN connections.

BGP ASN: See ASN.

BGP Community Tag: A 32-bit attribute attached to BGP routes to group destinations and apply routing policies. AWS uses community tags for route control.

Blackhole Route: A route that drops traffic destined for a specific CIDR block. Used for security or to prevent routing loops.

C

Cache Behavior: A CloudFront configuration that defines how content is cached and served based on path patterns, headers, cookies, and query strings.

CIDR (Classless Inter-Domain Routing): A method for allocating IP addresses and routing that replaces the old class-based system. Written as IP/prefix (e.g., 10.0.0.0/16).

CIDR Block: A range of IP addresses defined by a CIDR notation. VPCs and subnets are defined by CIDR blocks.

CloudHub: AWS VPN CloudHub enables multiple customer sites to connect to AWS and communicate with each other through VPN connections in a hub-and-spoke model.

Customer Gateway (CGW): A physical device or software application on the customer side of a VPN connection. Represents the customer's side of the connection.

Customer Gateway Device: The physical or virtual appliance on the customer's network that terminates the VPN connection.

D

DDoS (Distributed Denial of Service): An attack that attempts to make a service unavailable by overwhelming it with traffic from multiple sources. AWS Shield protects against DDoS.

Direct Connect (DX): A dedicated network connection from your premises to AWS. Provides consistent network performance and reduced bandwidth costs.

Direct Connect Gateway (DXGW): A globally available resource that enables you to connect your Direct Connect connection to VPCs in any AWS Region (except China).

DNS (Domain Name System): A hierarchical naming system that translates human-readable domain names (like example.com) into IP addresses.

DNS Failover: Route 53 feature that automatically routes traffic away from unhealthy resources to healthy ones based on health checks.

DNS Query: A request to resolve a domain name to an IP address or other DNS record.

DNS Resolution: The process of translating a domain name into an IP address using DNS.

E

Edge Location: A site that CloudFront uses to cache copies of content for faster delivery to users. AWS has 400+ edge locations globally.

Egress: Outbound traffic leaving a network, VPC, or subnet.

Egress-Only Internet Gateway: A VPC component that allows outbound IPv6 traffic to the internet while preventing inbound IPv6 connections.

Elastic IP (EIP): A static, public IPv4 address that you can allocate to your AWS account and associate with EC2 instances or network interfaces.

Elastic Network Interface (ENI): A virtual network interface that you can attach to an EC2 instance. Can have security groups, private IPs, and Elastic IPs.

Endpoint: A connection point for AWS services. VPC endpoints enable private connections to AWS services without using the internet.

F

Failover: The process of automatically switching to a standby system when the primary system fails. Route 53 supports DNS failover.

Flow Logs: A feature that captures information about IP traffic going to and from network interfaces in your VPC. Used for monitoring and troubleshooting.

G

Gateway Load Balancer (GWLB): A load balancer that operates at Layer 3 (IP) and is designed to deploy, scale, and manage third-party virtual appliances.

Gateway Load Balancer Endpoint (GWLBE): A VPC endpoint that intercepts traffic and routes it to a Gateway Load Balancer for inspection by security appliances.

Geolocation Routing: Route 53 routing policy that routes traffic based on the geographic location of the user making the DNS query.

Geoproximity Routing: Route 53 routing policy that routes traffic based on the geographic location of resources and users, with optional bias to shift traffic.

Global Accelerator: An AWS service that uses the AWS global network to optimize the path from users to applications, providing static IP addresses and improved performance.

H

Health Check: A monitoring mechanism that determines whether a resource is healthy and able to receive traffic. Used by Route 53, load balancers, and Global Accelerator.

Hosted Connection: A Direct Connect connection provisioned by an AWS Direct Connect Partner with capacities from 50 Mbps to 10 Gbps.

Hosted Zone: A container for DNS records that defines how to route traffic for a domain and its subdomains in Route 53.

Hub-and-Spoke: A network topology where multiple sites (spokes) connect to a central location (hub). Transit Gateway and VPN CloudHub use this model.

I

IGW: See Internet Gateway.

Ingress: Inbound traffic entering a network, VPC, or subnet.

Internet Gateway (IGW): A VPC component that enables communication between instances in your VPC and the internet. Horizontally scaled, redundant, and highly available.

IP Address: A numerical label assigned to each device connected to a network. Can be IPv4 (32-bit) or IPv6 (128-bit).

IPsec (Internet Protocol Security): A protocol suite for securing IP communications by authenticating and encrypting each IP packet. Used in VPN connections.

IPv4: Internet Protocol version 4, using 32-bit addresses (e.g., 192.0.2.1). Provides about 4.3 billion unique addresses.

IPv6: Internet Protocol version 6, using 128-bit addresses (e.g., 2001:0db8::1). Provides virtually unlimited addresses.

J

Jumbo Frames: Ethernet frames with more than 1,500 bytes of payload, up to 9,001 bytes. Supported within VPCs and over Direct Connect for improved performance.

L

Latency: The time delay between sending a request and receiving a response. Lower latency means faster response times.

Latency-Based Routing: Route 53 routing policy that routes traffic to the AWS Region that provides the lowest latency to the user.

Load Balancer: A service that distributes incoming traffic across multiple targets (EC2 instances, containers, IP addresses) to improve availability and fault tolerance.

Local Zone: An AWS infrastructure deployment that places compute, storage, and database services closer to end users for low-latency applications.

M

MTU (Maximum Transmission Unit): The largest packet size that can be transmitted over a network. Standard is 1,500 bytes; jumbo frames are 9,001 bytes.

Multivalue Answer Routing: Route 53 routing policy that returns multiple IP addresses for a DNS query, with health checking to return only healthy resources.

N

NAT (Network Address Translation): A method of remapping one IP address space into another by modifying network address information in packet headers.

NAT Gateway: A managed AWS service that enables instances in a private subnet to connect to the internet or other AWS services while preventing inbound connections.

NAT Instance: An EC2 instance configured to perform NAT. Less common now that NAT Gateway is available, but offers more control.

Network ACL (NACL): A stateless firewall that controls traffic in and out of subnets. Rules are evaluated in number order.

Network Interface: See Elastic Network Interface (ENI).

Network Load Balancer (NLB): A load balancer that operates at Layer 4 (TCP/UDP) and is designed for ultra-high performance and static IP addresses.

O

Origin: The source of content for CloudFront. Can be an S3 bucket, HTTP server, MediaStore container, or MediaPackage channel.

Origin Access Control (OAC): A CloudFront feature that restricts access to S3 bucket content, ensuring users can only access through CloudFront.

Origin Access Identity (OAI): Legacy method for restricting S3 access to CloudFront. Being replaced by Origin Access Control (OAC).

P

Peering Connection: See VPC Peering.

PoP (Point of Presence): A physical location where CloudFront caches content. AWS has 400+ PoPs globally.

Prefix List: A set of CIDR blocks that can be referenced in security group rules and route tables. Managed prefix lists are maintained by AWS.

Private IP: An IP address from RFC 1918 ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) used for internal network communication.

Private Subnet: A subnet whose route table doesn't have a route to an Internet Gateway. Instances cannot directly access the internet.

PrivateLink: See VPC Endpoint.

Public IP: An IP address that is routable on the internet. Can be an Elastic IP or an auto-assigned public IP.

Public Subnet: A subnet whose route table has a route to an Internet Gateway. Instances can directly access the internet if they have public IPs.

R

Region: A physical location around the world where AWS clusters data centers. Each Region consists of multiple Availability Zones.

Regional Edge Cache: A CloudFront cache layer between origin servers and edge locations, providing additional caching capacity.

Resource Record: A DNS record that provides information about a domain, such as IP addresses (A records) or mail servers (MX records).

Route: An entry in a route table that specifies where network traffic should be directed based on the destination IP address.

Route Propagation: Automatic addition of routes to a route table from a Virtual Private Gateway or Transit Gateway attachment.

Route Table: A set of rules (routes) that determine where network traffic from your subnet or gateway is directed.

Route 53: AWS's scalable DNS web service that translates domain names into IP addresses and provides traffic routing.

Routing Policy: A Route 53 configuration that determines how DNS queries are answered (simple, weighted, latency, failover, geolocation, geoproximity, multivalue, IP-based).

S

Security Group: A stateful virtual firewall that controls inbound and outbound traffic for EC2 instances and other resources. Rules allow traffic; there are no deny rules.

Site-to-Site VPN: An IPsec VPN connection between your on-premises network and AWS VPC through a Virtual Private Gateway or Transit Gateway.

Split-Horizon DNS: A DNS configuration that returns different answers based on the source of the query. Used to route internal and external users differently.

Stateful: A firewall or connection tracking mechanism that remembers the state of connections. Security groups are stateful.

Stateless: A firewall that treats each packet independently without tracking connection state. Network ACLs are stateless.

Subnet: A range of IP addresses in your VPC. Can be public (with internet access) or private (without direct internet access).

Subnet Mask: A 32-bit number that divides an IP address into network and host portions. Used with CIDR notation.

T

Target: A destination for traffic from a load balancer. Can be an EC2 instance, IP address, Lambda function, or another load balancer.

Target Group: A logical grouping of targets for a load balancer. Health checks and routing rules are configured at the target group level.

TGW: See Transit Gateway.

Traffic Dial: A Global Accelerator feature that controls the percentage of traffic directed to an endpoint group in a specific Region.

Transit Gateway (TGW): A network transit hub that connects VPCs, VPN connections, and Direct Connect gateways in a hub-and-spoke architecture.

Transit Gateway Attachment: A connection between a Transit Gateway and a VPC, VPN, Direct Connect gateway, or another Transit Gateway.

Transit Gateway Route Table: A route table associated with a Transit Gateway that controls routing between attachments.

Transit VIF: A virtual interface type for Direct Connect that connects to a Direct Connect Gateway, enabling access to multiple VPCs across Regions.

TTL (Time to Live): The duration (in seconds) that a DNS record is cached by DNS resolvers before querying again. Lower TTL enables faster changes.

V

VGW: See Virtual Private Gateway.

VIF (Virtual Interface): A logical connection over a Direct Connect physical connection. Types include private, public, and transit VIFs.

Virtual Private Cloud (VPC): A logically isolated section of the AWS cloud where you can launch AWS resources in a virtual network you define.

Virtual Private Gateway (VGW): The AWS side of a VPN connection or Direct Connect private VIF. Attached to a VPC to enable hybrid connectivity.

VPC Endpoint: A private connection between your VPC and AWS services without requiring internet access. Types include Interface and Gateway endpoints.

VPC Endpoint Service: A service you create to enable other AWS accounts to access your service through a VPC endpoint (powered by PrivateLink).

VPC Peering: A networking connection between two VPCs that enables routing traffic between them using private IP addresses.

VPN (Virtual Private Network): An encrypted connection over the internet between your network and AWS. Types include Site-to-Site VPN and Client VPN.

VPN CloudHub: See CloudHub.

VPN Connection: An IPsec tunnel between your network and AWS. Each connection has two tunnels for high availability.

VPN Tunnel: An encrypted connection that forms part of a VPN connection. Site-to-Site VPN connections have two tunnels.

W

Weighted Routing: Route 53 routing policy that routes traffic to multiple resources based on assigned weights (percentages).

Z

Zone Apex: The root domain without any subdomain (e.g., example.com). Also called the naked domain or root domain.

Appendix C: Decision Trees

C.1 Hybrid Connectivity Decision Tree

Question: How should I connect my on-premises network to AWS?

START: Analyze Requirements
│
ā”œā”€ Need dedicated, consistent bandwidth? ────────────────┐
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  ā”œā”€ Bandwidth > 10 Gbps? ──────────────┐              │
│  │  │                                   │              │
│  │  YES                                NO              │
│  │  │                                   │              │
│  │  └─ Multiple Direct Connect         │              │
│  │     connections (LAG)                │              │
│  │     āœ… 10-100 Gbps capacity          │              │
│  │                                      │              │
│  └─ Mission-critical? ──────────────────┤              │
│     │                                   │              │
│     YES                                NO              │
│     │                                   │              │
│     └─ Direct Connect + VPN backup     └─ Single      │
│        āœ… High availability                Direct      │
│        āœ… Automatic failover               Connect     │
│                                            āœ… Cost-     │
│                                               effective│
│                                                         │
└─ Quick setup needed? ──────────────────────────────────┤
   │                                                      │
   YES                                                   NO
   │                                                      │
   ā”œā”€ Need encryption? ──────────────┐                   │
   │  │                               │                   │
   │  YES                            NO                  │
   │  │                               │                   │
   │  └─ Site-to-Site VPN            └─ Consider         │
   │     āœ… Hours to deploy              Direct Connect  │
   │     āœ… Encrypted                    (weeks setup)    │
   │     āœ… Up to 1.25 Gbps/tunnel                       │
   │                                                      │
   └─ Multiple sites, same region? ─────────────────────┤
      │                                                   │
      YES                                                NO
      │                                                   │
      └─ VPN CloudHub                                    │
         āœ… Hub-and-spoke                                │
         āœ… Site-to-site communication                   │
         āœ… Simple management                            │
                                                          │
      └─ Evaluate based on bandwidth,                    │
         latency, and cost requirements                  │

Key Decision Factors:

  • Bandwidth: VPN (1.25 Gbps/tunnel) vs DX (50 Mbps - 100 Gbps)
  • Latency: DX provides consistent, low latency
  • Setup Time: VPN (hours) vs DX (weeks to months)
  • Cost: VPN (lower) vs DX (higher but predictable)
  • Security: VPN (encrypted) vs DX (private but not encrypted by default)

C.2 VPC Connectivity Decision Tree

Question: How should I connect multiple VPCs?

START: Analyze VPC Connectivity Needs
│
ā”œā”€ How many VPCs? ───────────────────────────────────────┐
│  │                                                      │
│  > 10 VPCs                                          ≤ 10 VPCs
│  │                                                      │
│  └─ Transit Gateway                                    │
│     āœ… Hub-and-spoke                                   │
│     āœ… Scales to 5,000 attachments                     │
│     āœ… Transitive routing                              │
│     āœ… Centralized management                          │
│                                                         │
ā”œā”€ Need transitive routing? ─────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  └─ Transit Gateway                                    │
│     āœ… VPC A → TGW → VPC B → TGW → VPC C              │
│     āœ… Simplifies complex topologies                   │
│                                                         │
ā”œā”€ Simple point-to-point? ───────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  ā”œā”€ Overlapping IP addresses? ──────┐                 │
│  │  │                                │                 │
│  │  YES                              NO                │
│  │  │                                │                 │
│  │  └─ Cannot use VPC Peering        └─ VPC Peering   │
│  │     Consider PrivateLink             āœ… Free        │
│  │     or Transit Gateway               āœ… Simple      │
│  │                                      āœ… Low latency │
│  │                                                      │
│  └─ Cross-region? ──────────────────────────────────┐  │
│     │                                                │  │
│     YES                                             NO │
│     │                                                │  │
│     ā”œā”€ Need transitive? ──────────┐                 │  │
│     │  │                           │                 │  │
│     │  YES                        NO                 │  │
│     │  │                           │                 │  │
│     │  └─ Transit Gateway          └─ VPC Peering   │  │
│     │     Peering                     āœ… Simple      │  │
│     │     āœ… Transitive routing       āœ… Free        │  │
│     │                                                │  │
│     └─ Same region VPC Peering ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜  │
│        āœ… Jumbo frames (9,001 MTU)                     │
│                                                         │
└─ Exposing service to other accounts? ──────────────────┤
   │                                                      │
   YES                                                   NO
   │                                                      │
   └─ AWS PrivateLink                                    │
      āœ… Service-oriented                                │
      āœ… No VPC peering needed                           │
      āœ… Supports overlapping IPs                        │
      āœ… Scalable to many consumers                      │
                                                          │
   └─ Re-evaluate requirements                           │

Key Decision Factors:

  • Scale: VPC Peering (125 max) vs Transit Gateway (5,000 attachments)
  • Transitive Routing: Only Transit Gateway supports it
  • Cost: VPC Peering (free, data transfer only) vs TGW (per attachment + data)
  • IP Overlap: PrivateLink supports it, VPC Peering doesn't
  • Management: TGW centralizes routing, VPC Peering is distributed

C.3 Load Balancer Selection Decision Tree

Question: Which load balancer should I use?

START: Analyze Application Requirements
│
ā”œā”€ What protocol? ───────────────────────────────────────┐
│  │                                                      │
│  HTTP/HTTPS                                         TCP/UDP
│  │                                                      │
│  ā”œā”€ Need content-based routing? ──────┐               │
│  │  │                                  │               │
│  │  YES                               NO               │
│  │  │                                  │               │
│  │  └─ Application Load Balancer      │               │
│  │     āœ… Path-based routing           │               │
│  │     āœ… Host-based routing           │               │
│  │     āœ… Lambda targets               │               │
│  │     āœ… WebSocket support            │               │
│  │                                     │               │
│  └─ Need extreme performance? ─────────┤               │
│     │                                  │               │
│     YES                               NO               │
│     │                                  │               │
│     └─ Network Load Balancer          └─ Application   │
│        āœ… Millions of RPS                 Load Balancer│
│        āœ… Ultra-low latency               āœ… Feature-   │
│        āœ… Static IP support                  rich      │
│                                                         │
ā”œā”€ Need static IP addresses? ────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  └─ Network Load Balancer                              │
│     āœ… Elastic IP support                              │
│     āœ… One static IP per AZ                            │
│                                                         │
ā”œā”€ Deploying third-party appliances? ────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  └─ Gateway Load Balancer                              │
│     āœ… Layer 3 (IP packets)                            │
│     āœ… Transparent inspection                          │
│     āœ… Scales security appliances                      │
│     āœ… Firewall, IDS/IPS integration                   │
│                                                         │
ā”œā”€ Need to preserve source IP? ──────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  ā”œā”€ HTTP/HTTPS? ──────────────────────┐               │
│  │  │                                  │               │
│  │  YES                               NO               │
│  │  │                                  │               │
│  │  └─ ALB with X-Forwarded-For       └─ Network      │
│  │     āœ… Header-based                    Load Balancer│
│  │                                        āœ… Native    │
│  │                                           support   │
│  │                                                      │
│  └─ Any load balancer works ────────────────────────┐  │
│     (source IP not critical)                        │  │
│                                                      │  │
└─ Legacy application? ──────────────────────────────────┤
   │                                                      │
   YES                                                   NO
   │                                                      │
   └─ Consider migration to ALB/NLB                      │
      āš ļø Classic LB is legacy                            │
      āš ļø Limited features                                │
      āš ļø Not recommended for new deployments             │
                                                          │
   └─ Choose based on protocol and features              │

Key Decision Factors:

  • Protocol: HTTP/HTTPS → ALB, TCP/UDP → NLB, IP packets → GWLB
  • Performance: Extreme performance → NLB (millions RPS)
  • Routing: Content-based routing → ALB
  • Static IP: Required → NLB (Elastic IP support)
  • Source IP: Preserve natively → NLB or GWLB

C.4 Route 53 Routing Policy Decision Tree

Question: Which Route 53 routing policy should I use?

START: Analyze Traffic Routing Needs
│
ā”œā”€ Need automatic failover? ─────────────────────────────┐
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  └─ Failover Routing                                   │
│     āœ… Active-passive DR                               │
│     āœ… Health check required                           │
│     āœ… Automatic switchover                            │
│                                                         │
ā”œā”€ Testing new version? ─────────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  └─ Weighted Routing                                   │
│     āœ… A/B testing                                     │
│     āœ… Gradual migration                               │
│     āœ… Percentage-based distribution                   │
│     Example: 10% new, 90% old                          │
│                                                         │
ā”œā”€ Global application? ───────────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  ā”œā”€ Optimize for performance? ──────────┐             │
│  │  │                                    │             │
│  │  YES                                 NO             │
│  │  │                                    │             │
│  │  └─ Latency-Based Routing            │             │
│  │     āœ… Routes to lowest latency      │             │
│  │     āœ… Best user experience          │             │
│  │                                       │             │
│  └─ Need geographic control? ────────────┤             │
│     │                                    │             │
│     YES                                 NO             │
│     │                                    │             │
│     ā”œā”€ Compliance requirements? ────┐   │             │
│     │  │                             │   │             │
│     │  YES                          NO   │             │
│     │  │                             │   │             │
│     │  └─ Geolocation Routing        │   │             │
│     │     āœ… Route by user location  │   │             │
│     │     āœ… Content localization    │   │             │
│     │     āœ… GDPR compliance          │   │             │
│     │                                 │   │             │
│     └─ Need traffic shifting? ────────┤   │             │
│        │                              │   │             │
│        YES                           NO   │             │
│        │                              │   │             │
│        └─ Geoproximity Routing       │   │             │
│           āœ… Geographic + bias        │   │             │
│           āœ… Shift traffic between    │   │             │
│              regions                  │   │             │
│                                       │   │             │
│        └─ Latency-Based Routing ā”€ā”€ā”€ā”€ā”€ā”€ā”˜   │             │
│                                            │             │
ā”œā”€ Route by client IP? ──────────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  └─ IP-Based Routing                                   │
│     āœ… Route specific IP ranges                        │
│     āœ… Corporate users to specific endpoint            │
│     āœ… ISP-based routing                               │
│                                                         │
ā”œā”€ Simple load distribution? ─────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  ā”œā”€ Need health checks? ──────────────┐               │
│  │  │                                  │               │
│  │  YES                               NO               │
│  │  │                                  │               │
│  │  └─ Multivalue Answer Routing      └─ Simple       │
│  │     āœ… Returns multiple IPs           Routing       │
│  │     āœ… Health checking                āœ… Single     │
│  │     āœ… Simple load balancing             resource   │
│  │                                       āš ļø No health  │
│  │                                          checks     │
│  │                                                      │
│  └─ Single resource? ────────────────────────────────┐ │
│     │                                                 │ │
│     YES                                              NO│
│     │                                                 │ │
│     └─ Simple Routing                                │ │
│        āœ… One resource                                │ │
│        āš ļø No health checks                           │ │
│        āš ļø No failover                                │ │
│                                                       │ │
│     └─ Re-evaluate requirements ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜ │
│                                                          │
└─ Complex requirements? ─────────────────────────────────┤
   │                                                       │
   YES                                                    NO
   │                                                       │
   └─ Combine Multiple Policies                           │
      āœ… Nested records                                   │
      āœ… Failover + Weighted                              │
      āœ… Geolocation + Latency                            │
      Example: Geolocation primary,                       │
               Latency for each region                    │
                                                           │
   └─ Start with Simple Routing                           │

Key Decision Factors:

  • Failover: Need automatic DR → Failover routing
  • Testing: A/B testing or gradual migration → Weighted routing
  • Performance: Global users, optimize latency → Latency-based routing
  • Compliance: Geographic restrictions → Geolocation routing
  • Traffic Control: Shift traffic between regions → Geoproximity routing
  • Client-Based: Route by source IP → IP-based routing
  • Simple: Single resource → Simple routing
  • Load Distribution: Multiple IPs with health checks → Multivalue routing

C.5 Content Delivery Decision Tree

Question: Should I use CloudFront or Global Accelerator?

START: Analyze Content Delivery Needs
│
ā”œā”€ What type of content? ────────────────────────────────┐
│  │                                                      │
│  HTTP/HTTPS                                         TCP/UDP
│  │                                                      │
│  ā”œā”€ Cacheable content? ──────────────┐                │
│  │  │                                 │                │
│  │  YES                              NO                │
│  │  │                                 │                │
│  │  └─ Amazon CloudFront             │                │
│  │     āœ… Edge caching                │                │
│  │     āœ… 400+ PoPs                   │                │
│  │     āœ… Reduces origin load         │                │
│  │     āœ… Lower latency               │                │
│  │                                    │                │
│  └─ Dynamic content only? ────────────┤                │
│     │                                 │                │
│     YES                              NO                │
│     │                                 │                │
│     ā”œā”€ Need static IP? ──────────┐   │                │
│     │  │                          │   │                │
│     │  YES                       NO   │                │
│     │  │                          │   │                │
│     │  └─ Global Accelerator     │   │                │
│     │     āœ… 2 static Anycast IPs │   │                │
│     │     āœ… AWS network routing  │   │                │
│     │                             │   │                │
│     └─ CloudFront with dynamic ───┤   │                │
│        content optimization       │   │                │
│        āœ… Still benefits from     │   │                │
│           AWS network             │   │                │
│                                   │   │                │
│     └─ Both can work ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜   │                │
│                                        │                │
ā”œā”€ Non-HTTP protocols? ──────────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  └─ AWS Global Accelerator                             │
│     āœ… TCP/UDP support                                 │
│     āœ… Gaming, VoIP, IoT                               │
│     āœ… Static IP addresses                             │
│     āœ… Automatic failover                              │
│                                                         │
ā”œā”€ Need DDoS protection? ─────────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  └─ Both include AWS Shield Standard                   │
│     CloudFront: Layer 7 protection                     │
│     Global Accelerator: Layer 3/4 protection           │
│                                                         │
ā”œā”€ Need instant failover? ────────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  └─ AWS Global Accelerator                             │
│     āœ… Health-based routing                            │
│     āœ… Instant failover (30 seconds)                   │
│     āœ… No DNS caching issues                           │
│                                                         │
ā”œā”€ Video streaming? ──────────────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  └─ Amazon CloudFront                                  │
│     āœ… Optimized for media                             │
│     āœ… Supports HLS, DASH                              │
│     āœ… Integration with MediaStore                     │
│                                                         │
└─ Can I use both? ───────────────────────────────────────┤
   │                                                       │
   YES                                                    NO
   │                                                       │
   └─ CloudFront + Global Accelerator                     │
      āœ… CloudFront for caching                           │
      āœ… Global Accelerator for static IP                 │
      āœ… Best of both worlds                              │
      Use case: CloudFront origin behind                  │
                Global Accelerator                        │
                                                           │
   └─ Choose based on primary requirement                 │

Key Decision Factors:

  • Protocol: HTTP/HTTPS → CloudFront, TCP/UDP → Global Accelerator
  • Caching: Need caching → CloudFront
  • Static IP: Required → Global Accelerator
  • Failover: Instant failover → Global Accelerator (no DNS caching)
  • Content Type: Video/media → CloudFront
  • Use Case: Web content → CloudFront, Gaming/VoIP → Global Accelerator

C.6 Security Decision Tree

Question: How should I secure my network traffic?

START: Analyze Security Requirements
│
ā”œā”€ What layer needs protection? ─────────────────────────┐
│  │                                                      │
│  Network Layer (L3/L4)                              Application Layer (L7)
│  │                                                      │
│  ā”œā”€ Subnet-level control? ──────────┐                 │
│  │  │                                │                 │
│  │  YES                             NO                 │
│  │  │                                │                 │
│  │  └─ Network ACLs                 │                 │
│  │     āœ… Stateless firewall         │                 │
│  │     āœ… Subnet boundary            │                 │
│  │     āœ… Allow and deny rules       │                 │
│  │     āœ… Numbered rule order        │                 │
│  │                                   │                 │
│  └─ Instance-level control? ─────────┤                 │
│     │                                │                 │
│     YES                             NO                 │
│     │                                │                 │
│     └─ Security Groups              │                 │
│        āœ… Stateful firewall          │                 │
│        āœ… Instance/ENI level         │                 │
│        āœ… Allow rules only           │                 │
│        āœ… Automatic return traffic   │                 │
│                                      │                 │
│     └─ Combine both for defense      │                 │
│        in depth                      │                 │
│                                                         │
ā”œā”€ Need deep packet inspection? ─────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  └─ Third-Party Appliances + GWLB                      │
│     āœ… IDS/IPS integration                             │
│     āœ… Advanced threat detection                       │
│     āœ… Transparent inspection                          │
│     āœ… Scales with traffic                             │
│                                                         │
ā”œā”€ Web application protection? ───────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  └─ AWS WAF                                            │
│     āœ… Layer 7 filtering                               │
│     āœ… SQL injection protection                        │
│     āœ… XSS protection                                  │
│     āœ… Rate limiting                                   │
│     āœ… Integrates with ALB, CloudFront                 │
│                                                         │
ā”œā”€ DDoS protection needed? ───────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  ā”œā”€ Basic protection? ──────────────┐                 │
│  │  │                                │                 │
│  │  YES                             NO                 │
│  │  │                                │                 │
│  │  └─ AWS Shield Standard          │                 │
│  │     āœ… Free                        │                 │
│  │     āœ… Automatic                   │                 │
│  │     āœ… Layer 3/4 protection       │                 │
│  │                                   │                 │
│  └─ Advanced protection? ─────────────┤                 │
│     │                                │                 │
│     YES                             NO                 │
│     │                                │                 │
│     └─ AWS Shield Advanced          │                 │
│        āœ… $3,000/month               │                 │
│        āœ… 24/7 DRT support           │                 │
│        āœ… Cost protection            │                 │
│        āœ… Advanced detection         │                 │
│                                      │                 │
│     └─ Shield Standard sufficient ā”€ā”€ā”€ā”˜                 │
│                                                         │
ā”œā”€ Encrypt data in transit? ──────────────────────────────┤
│  │                                                      │
│  YES                                                   NO
│  │                                                      │
│  ā”œā”€ VPN connection? ────────────────┐                 │
│  │  │                                │                 │
│  │  YES                             NO                 │
│  │  │                                │                 │
│  │  └─ Site-to-Site VPN             │                 │
│  │     āœ… IPsec encryption           │                 │
│  │     āœ… Automatic                   │                 │
│  │                                   │                 │
│  └─ HTTPS/TLS? ────────────────────┤                 │
│     │                                │                 │
│     YES                             NO                 │
│     │                                │                 │
│     └─ ACM Certificates             │                 │
│        āœ… Free SSL/TLS certs         │                 │
│        āœ… Auto-renewal               │                 │
│        āœ… Integrates with ELB,       │                 │
│           CloudFront                 │                 │
│                                      │                 │
│     └─ Consider encryption needs ā”€ā”€ā”€ā”€ā”˜                 │
│                                                         │
└─ Centralized security management? ──────────────────────┤
   │                                                       │
   YES                                                    NO
   │                                                       │
   └─ AWS Firewall Manager                                │
      āœ… Centralized WAF rules                            │
      āœ… Security group policies                          │
      āœ… Multi-account management                         │
      āœ… Compliance enforcement                           │
                                                           │
   └─ Use appropriate security controls                   │

Key Decision Factors:

  • Layer: Network (L3/L4) → NACLs/SGs, Application (L7) → WAF
  • Scope: Subnet → NACLs, Instance → Security Groups
  • Inspection: Deep packet inspection → Third-party appliances + GWLB
  • Web Apps: SQL injection, XSS protection → AWS WAF
  • DDoS: Basic (free) → Shield Standard, Advanced → Shield Advanced
  • Encryption: VPN → IPsec, Web → HTTPS/TLS with ACM

Appendix D: Additional Resources

D.1 Official AWS Documentation

Core Networking Services:

Load Balancing & Traffic Management:

Security & Monitoring:

Hybrid & Edge Networking:

D.2 AWS Whitepapers

Essential Reading:

  1. AWS Well-Architected Framework - Reliability Pillar

    • High availability and fault tolerance
    • Disaster recovery strategies
    • Network design best practices
  2. Building a Scalable and Secure Multi-VPC AWS Network Infrastructure

    • Multi-VPC architectures
    • Transit Gateway patterns
    • Security considerations
  3. Hybrid Cloud DNS Solutions for Amazon VPC

    • DNS resolution patterns
    • Route 53 Resolver
    • Hybrid DNS architectures
  4. AWS Direct Connect Best Practices

    • Connection types and sizing
    • Redundancy and resilience
    • Cost optimization
  5. AWS Security Best Practices

    • Network security layers
    • Encryption in transit
    • Security group strategies

Access Whitepapers: https://aws.amazon.com/whitepapers/

D.3 AWS Training & Certification

Recommended Courses:

  • Advanced Networking - Specialty (ANS-C01) Exam Prep

    • Official AWS exam preparation course
    • Covers all exam domains
    • Practice questions and scenarios
  • AWS Networking Fundamentals

    • VPC basics
    • Hybrid connectivity
    • Load balancing fundamentals
  • Deep Dive into Amazon VPC

    • Advanced VPC concepts
    • Complex architectures
    • Troubleshooting

AWS Skill Builder: https://skillbuilder.aws/

  • Free and paid courses
  • Hands-on labs
  • Learning paths

D.4 Practice Resources

Practice Test Bundles (Included in This Package):

  • Domain 1: Design and Implement Hybrid IT Network Architectures
  • Domain 2: Design and Implement AWS Networks
  • Domain 3: Automate AWS Tasks
  • Domain 4: Configure Network Integration with Application Services
  • Domain 5: Design and Implement for Security and Compliance
  • Domain 6: Manage, Optimize, and Troubleshoot the Network

Additional Practice:

  • AWS Practice Exam (Official): Available through AWS Training
  • Review practice test explanations thoroughly
  • Focus on understanding WHY answers are correct/incorrect

D.5 Hands-On Labs (Optional but Recommended)

Essential Labs to Build Practical Skills:

  1. VPC Fundamentals Lab

    • Create VPC with public and private subnets
    • Configure route tables and internet gateway
    • Set up NAT gateway
    • Test connectivity
  2. Hybrid Connectivity Lab

    • Simulate Site-to-Site VPN connection
    • Configure customer gateway
    • Test VPN tunnels
    • Implement redundancy
  3. Transit Gateway Lab

    • Create Transit Gateway
    • Attach multiple VPCs
    • Configure route tables
    • Test transitive routing
  4. Load Balancer Lab

    • Deploy Application Load Balancer
    • Configure target groups
    • Set up path-based routing
    • Test health checks
  5. Route 53 Lab

    • Create hosted zone
    • Configure routing policies (weighted, latency, failover)
    • Set up health checks
    • Test DNS resolution
  6. CloudFront Lab

    • Create distribution
    • Configure S3 origin
    • Set up cache behaviors
    • Test content delivery
  7. Security Lab

    • Configure security groups and NACLs
    • Set up VPC Flow Logs
    • Analyze traffic patterns
    • Implement WAF rules

Lab Platforms:

  • AWS Free Tier: Practice with real AWS services (watch costs)
  • AWS Workshops: https://workshops.aws/ (guided tutorials)
  • Qwiklabs: Hands-on labs with temporary AWS accounts
  • A Cloud Guru: Sandbox environments for practice

D.6 Community Resources

AWS Forums & Communities:

  • AWS re:Post: https://repost.aws/

    • Official AWS community forum
    • Ask questions, share knowledge
    • AWS experts and community members
  • AWS Subreddit: r/aws

    • Community discussions
    • Certification advice
    • Real-world scenarios
  • AWS Discord/Slack Communities

    • Real-time discussions
    • Study groups
    • Certification channels

Blogs & Articles:

D.7 Tools & Utilities

Network Planning & Design:

  • AWS VPC CIDR Calculator: Plan IP address ranges
  • Draw.io / Lucidchart: Create network diagrams
  • CloudFormation / Terraform: Infrastructure as Code

Monitoring & Troubleshooting:

  • VPC Reachability Analyzer: Test network paths
  • AWS Network Manager: Visualize global networks
  • CloudWatch: Monitor metrics and logs
  • VPC Flow Logs: Analyze traffic patterns

Cost Management:

  • AWS Pricing Calculator: https://calculator.aws/

    • Estimate networking costs
    • Compare service options
    • Plan budgets
  • AWS Cost Explorer: Analyze actual spending

  • AWS Budgets: Set cost alerts

D.8 Exam-Specific Resources

Official Exam Resources:

Exam Day Preparation:

  • Pearson VUE: https://home.pearsonvue.com/aws

    • Schedule exam
    • Testing center locations
    • Online proctoring option
  • PSI Exams: Alternative testing provider

    • Online and in-person options

After Passing:

  • AWS Certification Benefits: Digital badge, certification logo
  • Recertification: Required every 3 years
  • Continuing Education: Stay current with AWS updates

D.9 Staying Current

AWS Updates:

  • AWS What's New: https://aws.amazon.com/new/

    • Latest service announcements
    • Feature releases
    • Regional expansions
  • AWS re:Invent: Annual conference

    • New service announcements
    • Technical sessions
    • Networking opportunities
  • AWS Summits: Regional events

    • Local AWS community
    • Technical workshops
    • Certification lounges

Newsletter Subscriptions:

  • AWS Week in Review
  • AWS Architecture Monthly
  • Service-specific newsletters

D.10 Study Tips & Best Practices

Effective Study Strategies:

  1. Hands-On Practice: Build real architectures, don't just read
  2. Understand WHY: Don't memorize, understand the reasoning
  3. Use Multiple Resources: Combine official docs, courses, and practice
  4. Join Study Groups: Learn from others, share knowledge
  5. Take Notes: Summarize key concepts in your own words
  6. Practice Questions: Use practice tests to identify weak areas
  7. Review Mistakes: Understand why you got questions wrong
  8. Time Management: Simulate exam conditions during practice

Common Pitfalls to Avoid:

  • āŒ Relying only on practice tests (understand concepts)
  • āŒ Skipping hands-on labs (practical experience is crucial)
  • āŒ Ignoring official documentation (most accurate source)
  • āŒ Cramming the night before (consistent study is better)
  • āŒ Not reviewing mistakes (learn from errors)

Exam Day Tips:

  • āœ… Get good sleep the night before
  • āœ… Arrive early (or test tech for online proctoring)
  • āœ… Read questions carefully (watch for keywords)
  • āœ… Eliminate obviously wrong answers first
  • āœ… Flag difficult questions and return later
  • āœ… Manage your time (don't spend too long on one question)
  • āœ… Trust your preparation

Final Words

You're Ready When...

Knowledge Indicators

  • You can explain concepts without notes: Close the study guide and explain VPC, Transit Gateway, Direct Connect, Route 53 routing policies, and load balancers to someone else
  • You recognize question patterns instantly: When you read a scenario, you immediately know which services and configurations are being tested
  • You make decisions quickly using frameworks: You can apply the decision trees in this appendix without hesitation
  • You understand the WHY, not just the WHAT: You know why certain solutions are better than others in specific scenarios
  • You can troubleshoot scenarios: Given a problem, you can identify the root cause and solution

Practice Test Indicators

  • You score 75%+ on all practice tests consistently: Not just once, but repeatedly across different test bundles
  • You understand ALL explanations: Even for questions you got right, you understand why other options were wrong
  • You can predict the correct answer: Before looking at options, you know what the solution should be
  • You finish practice tests with time to spare: You're not rushing through questions
  • You identify trap answers easily: You recognize distractors and understand why they're incorrect

Practical Indicators

  • You've built the architectures: You've created VPCs, configured Transit Gateway, set up VPN connections, deployed load balancers
  • You've troubleshot real issues: You've used VPC Flow Logs, Reachability Analyzer, and CloudWatch to diagnose problems
  • You've compared services hands-on: You've used both ALB and NLB, VPC Peering and Transit Gateway, CloudFront and Global Accelerator
  • You understand the cost implications: You know which solutions are cost-effective and which are expensive

Mental Readiness

  • You feel confident, not anxious: Nervousness is normal, but you trust your preparation
  • You've reviewed the cheat sheet: Key facts and limits are fresh in your mind
  • You've practiced time management: You know how to pace yourself during the exam
  • You're well-rested: You've gotten good sleep and are mentally sharp

Remember on Exam Day

Read Carefully

  • Keywords matter: "most cost-effective", "highest performance", "least operational overhead", "most secure"
  • Requirements are constraints: If it says "must support IPv6", eliminate options that don't
  • Scenario details are clues: Company size, traffic volume, geographic distribution all point to specific solutions

Think Like an Architect

  • Consider all requirements: Performance, cost, security, scalability, operational overhead
  • Eliminate wrong answers first: Often easier than finding the right answer immediately
  • Choose the BEST answer: All options might work, but one is optimal for the scenario
  • Don't overthink: Your first instinct is usually correct if you've studied well

Time Management

  • Don't get stuck: If a question takes more than 2 minutes, flag it and move on
  • Answer all questions: There's no penalty for wrong answers
  • Review flagged questions: Use remaining time to revisit difficult questions
  • Trust your preparation: You've studied hard, trust your knowledge

Common Exam Patterns

  • Hybrid connectivity: Direct Connect vs VPN, redundancy, failover
  • Multi-VPC architectures: Transit Gateway vs VPC Peering, transitive routing
  • Load balancing: ALB vs NLB vs GWLB, choosing the right type
  • DNS routing: Route 53 policies, health checks, failover
  • Content delivery: CloudFront vs Global Accelerator, caching strategies
  • Security: Security groups vs NACLs, WAF, Shield, encryption
  • Troubleshooting: VPC Flow Logs, Reachability Analyzer, common issues

After the Exam

If You Pass

Congratulations! šŸŽ‰

  • You've earned the AWS Certified Advanced Networking - Specialty certification
  • Update your resume and LinkedIn with your certification
  • Download your digital badge from AWS Certification
  • Consider the next certification in your path
  • Share your knowledge with others preparing for the exam
  • Stay current with AWS networking updates (recertification in 3 years)

If You Don't Pass

Don't be discouraged! This is a challenging exam.

  • Review your score report to identify weak domains
  • Focus your study on those specific areas
  • Take more practice tests in weak domains
  • Build more hands-on labs for concepts you struggled with
  • Join study groups or forums to discuss difficult topics
  • Schedule a retake when you're ready (wait period applies)
  • Many successful candidates pass on their second attempt

Final Thoughts

The AWS Certified Advanced Networking - Specialty certification demonstrates your expertise in designing, implementing, and managing complex AWS network architectures. It's a challenging certification that requires both theoretical knowledge and practical experience.

You've put in the work:

  • You've studied the fundamentals and advanced concepts
  • You've learned about VPCs, hybrid connectivity, load balancing, DNS, content delivery, and security
  • You've practiced with hundreds of questions
  • You've built hands-on experience with AWS networking services
  • You've reviewed decision frameworks and best practices

Trust your preparation. You're ready for this.

Key Principles to Remember:

  1. Understand the WHY: AWS wants architects who understand reasoning, not just memorization
  2. Think holistically: Consider performance, cost, security, scalability, and operational overhead
  3. Choose the right tool: AWS has many services; pick the best one for each scenario
  4. Design for resilience: High availability, fault tolerance, and disaster recovery are critical
  5. Security is paramount: Always consider security implications in your designs
  6. Optimize costs: Balance performance with cost-effectiveness
  7. Automate when possible: Reduce operational overhead through automation

You've got this! šŸ’Ŗ

Go into that exam with confidence. You've prepared thoroughly, you understand the concepts, and you're ready to demonstrate your expertise.

Good luck on your AWS Certified Advanced Networking - Specialty exam!


Appendix E: Quick Facts Cheat Sheet

Must-Memorize Numbers

VPC Limits:

  • VPCs per region: 5 (default)
  • Subnets per VPC: 200
  • VPC CIDR blocks: 5 (max)
  • VPC Peering per VPC: 50 (default), 125 (max)
  • Security groups per VPC: 2,500 (default)
  • Rules per security group: 60 (inbound + outbound)
  • Security groups per ENI: 5 (default), 16 (max)

Direct Connect:

  • VIFs per connection: 50 private + 1 public
  • BGP routes advertised: 100 (each direction)
  • VGWs per DX Gateway: 10
  • Transit Gateways per DX Gateway: 3

Transit Gateway:

  • Attachments per TGW: 5,000
  • Bandwidth per VPC attachment: 50 Gbps per AZ
  • MTU same region: 8,500 bytes
  • MTU cross-region: 1,500 bytes

Route 53:

  • Hosted zones per account: 500
  • Records per hosted zone: 10,000
  • Health checks per account: 200

Load Balancers:

  • Load balancers per region: 50 (each type)
  • Targets per ALB: 1,000
  • Targets per NLB: 3,000
  • Rules per ALB listener: 100

MTU Sizes:

  • Standard: 1,500 bytes
  • Jumbo frames (VPC): 9,001 bytes
  • VPN: 1,500 bytes (always)
  • Transit Gateway (same region): 8,500 bytes

BGP ASN:

  • Private range: 64,512 - 65,534
  • AWS default: 64,512

AWS Reserved IPs per Subnet: 5

  • Network address (.0)
  • VPC router (.1)
  • DNS server (.2)
  • Reserved (.3)
  • Broadcast (.255)

Critical Concepts

VPC Peering:

  • āŒ No transitive routing
  • āŒ No overlapping CIDR blocks
  • āœ… Cross-region supported
  • āœ… Cross-account supported
  • āœ… Free (data transfer charges apply)

Transit Gateway:

  • āœ… Transitive routing
  • āœ… Hub-and-spoke model
  • āœ… 5,000 attachments
  • āœ… Cross-region peering
  • āŒ Costs per attachment + data

Direct Connect:

  • Dedicated: 1, 10, 100 Gbps
  • Hosted: 50 Mbps - 10 Gbps
  • Setup time: Weeks to months
  • āœ… Consistent performance
  • āŒ Not encrypted by default

VPN:

  • Bandwidth: 1.25 Gbps per tunnel
  • 2 tunnels per connection (HA)
  • Setup time: Minutes to hours
  • āœ… Encrypted (IPsec)
  • āŒ Variable latency (internet)

Load Balancer Selection:

  • HTTP/HTTPS + routing → ALB
  • TCP/UDP + performance → NLB
  • Static IP needed → NLB
  • Third-party appliances → GWLB

Route 53 Routing:

  • Failover → Active-passive DR
  • Weighted → A/B testing, gradual migration
  • Latency → Best performance globally
  • Geolocation → Compliance, localization
  • Geoproximity → Traffic shifting with bias

CloudFront vs Global Accelerator:

  • HTTP/HTTPS + caching → CloudFront
  • TCP/UDP + static IP → Global Accelerator
  • Video streaming → CloudFront
  • Gaming/VoIP → Global Accelerator

Security Layers:

  • Subnet level → Network ACLs (stateless)
  • Instance level → Security Groups (stateful)
  • Application level → AWS WAF
  • DDoS → AWS Shield

You're ready. Trust your preparation. Good luck! šŸš€