How to Set Up a Scalable and Secure VPC on AWS
A Virtual Private Cloud (VPC) is the foundational network boundary for your resources within Amazon Web Services. A poorly conceived VPC architecture is a primary source of technical debt, creating critical security vulnerabilities and severe bottlenecks for scalability. Conversely, a well-architected VPC, built on first principles, enables a "secure by default" posture, simplifies operations, and scales seamlessly with your workloads.
This article is a technical, implementation-focused guide for CTOs and senior engineers. We will move beyond the "default VPC" and construct a production-grade network fabric, detailing the precise architectural decisions and configurations required for a secure, scalable, and multi-Availability Zone (AZ) deployment.
Core Architectural Decision: VPC and Subnet CIDR Planning
The most critical and irreversible decision is your VPC's primary Classless Inter-Domain Routing (CIDR) block. Once a VPC is created, its primary CIDR cannot be changed.
Problem: Choosing a common CIDR (e.g., 172.16.0.0/16 or 192.168.0.0/16) creates a high probability of IP conflicts when you inevitably need to connect to an on-premises network (via VPN or Direct Connect) or peer with another VPC (e.g., a partner or a SaaS provider).
Solution:
- Use a Large Block: Always start with a
/16block. This provides 65,536 private IP addresses, which is more than sufficient for growth and subnetting. IP addresses are free; IP exhaustion is a catastrophic failure. - Avoid Common Blocks: Select a block from the
10.0.0.0/8range that is uncommon. For example, choose something specific like10.100.0.0/16. This simple decision will prevent countless future networking conflicts.
Product Engineering Services
Work with our in-house Project Managers, Software Engineers and QA Testers to build your new custom software product or to support your current workflow, following Agile, DevOps and Lean methodologies.
Subnetting Strategy: The Multi-AZ, Multi-Tier Model
Your subnetting strategy must be driven by two principles: High Availability (HA) and Security Isolation.
- HA: Your application must survive the failure of an entire Availability Zone. This means you must have a presence in at least two, preferably three, AZs.
- Isolation: Resources should be segmented by their function and security posture. The primary division is Public vs. Private.
- Public Subnets: Contain resources that must have a direct route to the Internet Gateway (IGW). This tier is for internet-facing Elastic Load Balancers (ELBs) and NAT Gateways.
- Private Subnets: Contain your protected resources (application servers, container tasks, databases) that must never be directly accessible from the internet.
Combining these, a robust model involves creating paired subnets for each functional tier in each AZ.
Example VPC Plan:
- VPC CIDR:
10.100.0.0/16 - Region:
us-east-1(with AZsus-east-1a,us-east-1b,us-east-1c)
We will use /24 blocks for our subnets (256 IPs each), which is a common and flexible size.
| Subnet Name | AZ | CIDR Block | Type | Purpose |
public-a | us-east-1a | 10.100.10.0/24 | Public | ELB, NAT Gateway A |
public-b | us-east-1b | 10.100.11.0/24 | Public | ELB, NAT Gateway B |
public-c | us-east-1c | 10.100.12.0/24 | Public | ELB, NAT Gateway C |
private-app-a | us-east-1a | 10.100.20.0/24 | Private | Application Tier A |
private-app-b | us-east-1b | 10.100.21.0/24 | Private | Application Tier B |
private-app-c | us-east-1c | 10.100.22.0/24 | Private | Application Tier C |
private-db-a | us-east-1a | 10.100.30.0/24 | Private | Database Tier A |
private-db-b | us-east-1b | 10.100.31.0/24 | Private | Database Tier B |
private-db-c | us-east-1c | 10.100.32.0/24 | Private | Database Tier C |
This structure gives us clear separation of concerns, HA for all tiers, and ample room for future expansion (e.g., adding private-data or private-mgmt tiers).
Implementation: Routing, Gateways, and Egress
With the plan defined, the implementation involves creating the networking components and "wiring" them together with Route Tables.
Step 1: Create VPC, Subnets, and Internet Gateway (IGW)
First, create the core components. We'll use the AWS CLI for precision.
# 1. Create the VPC
VPC_ID=$(aws ec2 create-vpc --cidr-block 10.100.0.0/16 \
--query 'Vpc.VpcId' --output text)
aws ec2 create-tags --resources $VPC_ID --tags Key=Name,Value=prod-vpc
# 2. Create the Internet Gateway and attach it
IGW_ID=$(aws ec2 create-internet-gateway --query 'InternetGateway.InternetGatewayId' --output text)
aws ec2 create-tags --resources $IGW_ID --tags Key=Name,Value=prod-igw
aws ec2 attach-internet-gateway --vpc-id $VPC_ID --internet-gateway-id $IGW_ID
# 3. Create public subnets (example for AZ-a)
# (Enable auto-assign public IP for convenience in this subnet)
SUBNET_PUB_A=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.100.10.0/24 \
--availability-zone us-east-1a --query 'Subnet.SubnetId' --output text)
aws ec2 modify-subnet-attribute --subnet-id $SUBNET_PUB_A --map-public-ip-on-launch
aws ec2 create-tags --resources $SUBNET_PUB_A --tags Key=Name,Value=public-a
# 4. Create private subnets (example for AZ-a)
SUBNET_APP_A=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.100.20.0/24 \
--availability-zone us-east-1a --query 'Subnet.SubnetId' --output text)
aws ec2 create-tags --resources $SUBNET_APP_A --tags Key=Name,Value=private-app-a
# ... repeat for all other subnets in our plan ...
Step 2: Configure Public vs. Private Routing
Routing is what defines a subnet as "public" or "private".
Private Route Tables (HA Egress): Private subnets must not have a route to the IGW. For outbound internet access (e.g., for software patches or calling external APIs), they must use a NAT Gateway (NGW).For HA, we must provision one NGW in each Availability Zone (e.g., in public-a, public-b, public-c). We then create a separate private route table for each AZ that points to its local NGW. This prevents a single NGW failure from taking down outbound connectivity for all AZs.Bash
# --- Configuration for AZ-A ---
# 1. Create Elastic IP for NGW-A
EIP_A=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text)
# 2. Create NGW-A in the public-a subnet
NGW_A=$(aws ec2 create-nat-gateway --subnet-id $SUBNET_PUB_A --allocation-id $EIP_A \
--query 'NatGateway.NatGatewayId' --output text)
aws ec2 create-tags --resources $NGW_A --tags Key=Name,Value=nat-gateway-a
# Wait for NGW to be available (omitted for brevity)
# 3. Create a private route table for AZ-A
RTB_PRIVATE_A=$(aws ec2 create-route-table --vpc-id $VPC_ID \
--query 'RouteTable.RouteTableId' --output text)
aws ec2 create-tags --resources $RTB_PRIVATE_A --tags Key=Name,Value=rtb-private-a
# 4. Add default route via NGW-A
aws ec2 create-route --route-table-id $RTB_PRIVATE_A \
--destination-cidr-block 0.0.0.0/0 --nat-gateway-id $NGW_A
# 5. Associate with ALL private subnets in AZ-A
aws ec2 associate-route-table --subnet-id $SUBNET_APP_A --route-table-id $RTB_PRIVATE_A
# ... associate with private-db-a ...
# --- Repeat steps 1-5 for AZ-B and AZ-C ---
# (Create EIP-B, NGW-B in public-b, RTB_PRIVATE_B, route 0.0.0.0/0 to NGW-B,
# and associate with private-app-b, private-db-b)
Public Route Table: Create a single route table for all public subnets. Its defining feature is a default route (0.0.0.0/0) pointing to the Internet Gateway.Bash
# Create public route table
RTB_PUBLIC=$(aws ec2 create-route-table --vpc-id $VPC_ID \
--query 'RouteTable.RouteTableId' --output text)
aws ec2 create-tags --resources $RTB_PUBLIC --tags Key=Name,Value=rtb-public
# Add the "public" route to the IGW
aws ec2 create-route --route-table-id $RTB_PUBLIC \
--destination-cidr-block 0.0.0.0/0 --gateway-id $IGW_ID
# Associate with our public subnets
aws ec2 associate-route-table --subnet-id $SUBNET_PUB_A --route-table-id $RTB_PUBLIC
# ... associate with public-b, public-c ...
At this point, any EC2 instance launched in private-app-a has no public IP, is not reachable from the internet, but can initiate outbound connections via nat-gateway-a.
Product Engineering Services
Work with our in-house Project Managers, Software Engineers and QA Testers to build your new custom software product or to support your current workflow, following Agile, DevOps and Lean methodologies.
Layered Security: Security Groups vs. Network ACLs
A common point of failure is misunderstanding the two layers of VPC firewalls.
Security Groups (SGs)
- What they are: A stateful, instance-level firewall.
- Stateful: If you allow inbound traffic (e.g., port 443), the return outbound traffic is automatically allowed, regardless of outbound rules.
- Scope: Applied to an Elastic Network Interface (ENI), effectively an instance.
- Rules: Allow-only. You cannot create "deny" rules.
- Best Practice: Use SGs as your primary, granular firewall. A key technique is referencing other SGs in your rules. This is far superior to hard-coding CIDR blocks.
Terraform (HCL) Example: This pattern is ideal.
resource "aws_security_group" "lb_sg" {
name = "prod-lb-sg"
vpc_id = aws_vpc.prod.id
# Allow public web traffic
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "app_sg" {
name = "prod-app-sg"
vpc_id = aws_vpc.prod.id
# ONLY allow traffic from our load balancer
ingress {
from_port = 8080 # App port
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.lb_sg.id] # Source is the LB SG
}
}
resource "aws_security_group" "db_sg" {
name = "prod-db-sg"
vpc_id = aws_vpc.prod.id
# ONLY allow traffic from our application tier
ingress {
from_port = 5432 # PostgreSQL port
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.app_sg.id] # Source is the App SG
}
}
Network Access Control Lists (NACLs)
- What they are: A stateless, subnet-level firewall.
- Stateless: If you allow inbound traffic, you must also explicitly allow the return outbound traffic.
- Scope: Applied to one or more subnets.
- Rules: Allow and Deny rules. Rules are processed by number, in order.
- Recommendation: Leave the default NACL as "ALLOW ALL" (which it is by default). Use SGs for 99% of your security. NACLs are a blunt instrument, best reserved for broad, explicit "deny" rules (e.g., "deny all traffic from known-malicious-ip-range
1.2.3.0/24"). Overly complex NACLs are a common cause of network connectivity debugging nightmares.
Securing Internal Traffic: VPC Endpoints
A major security hole in many VPCs is that communication from a private subnet to an AWS service (like S3 or DynamoDB) traverses the public internet by default (via the NAT Gateway). This is non-performant, adds data transfer cost, and increases your attack surface.
Solution: VPC Endpoints keep this traffic on the AWS private network.
Gateway Endpoints
- Services: S3 and DynamoDB.
- How they work: You create the endpoint and associate it with your private route tables. AWS automatically adds a route for the service's public IP range to the endpoint.
- Cost: Free.
Implementation (S3 Endpoint):
# Create the gateway endpoint for S3
aws ec2 create-vpc-endpoint --vpc-id $VPC_ID \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids $RTB_PRIVATE_A $RTB_PRIVATE_B $RTB_PRIVATE_C
# Best Practice: Attach a policy to restrict access
# This example policy only allows Get/PutObject to a specific bucket
# from a specific IAM role within your instances.
# (Policy JSON ommitted for brevity)
Now, any S3 SDK call from an instance in a private subnet automatically and transparently routes via the private endpoint, not the NAT Gateway.
Interface Endpoints (AWS PrivateLink)
- Services: Most other AWS services (SQS, SNS, Kinesis, CodeCommit, etc.) and your own services.
- How they work: Creates an Elastic Network Interface (ENI) with a private IP inside your private subnets. You access the service via a private DNS name.
- Cost: Billed per hour and per-GB of data processed.
- Implementation: You must specify which private subnets (one per AZ for HA) the endpoint ENIs should live in.
# Example for SQS
aws ec2 create-vpc-endpoint --vpc-id $VPC_ID \
--vpc-endpoint-type Interface \
--service-name com.amazonaws.us-east-1.sqs \
--subnet-ids $SUBNET_APP_A $SUBNET_APP_B $SUBNET_APP_C \
--security-group-id $YOUR_ENDPOINT_SG_ID
Using endpoints is a non-negotiable component of a secure VPC design.
Scaling Beyond One VPC: AWS Transit Gateway
As your organization grows, you will have multiple VPCs (e.g., prod, dev, shared-services, data-science). The legacy solution, VPC Peering, creates a complex, non-transitive N-to-N "mesh" that is unmanageable.
Solution: AWS Transit Gateway (TGW).
Product Engineering Services
Work with our in-house Project Managers, Software Engineers and QA Testers to build your new custom software product or to support your current workflow, following Agile, DevOps and Lean methodologies.
The TGW acts as a central cloud router in a hub-and-spoke model.
- You create one TGW.
- All your VPCs (the "spokes") attach to this TGW (the "hub").
- Your on-premises network (via VPN/Direct Connect) also attaches to the TGW.
This model dramatically simplifies routing:
- Each VPC only needs one route (e.g.,
10.0.0.0/8) pointing to the TGW attachment. - The TGW's route tables control all inter-VPC and VPC-to-on-premises traffic.
- It enables a "centralized inspection" model, where all traffic can be routed through a dedicated "security" VPC (with 3rd-party firewalls) before reaching its destination.
For any architecture expected to grow beyond two or three VPCs, start with a Transit Gateway. Do not build a peer-to-peer mesh that you will be forced to refactor.
Conclusion
A production-grade VPC is not a "set it and forget it" component. It is a living design that must be based on intentional architectural decisions. By focusing on a logical CIDR plan, a multi-AZ subnetting strategy, HA-aware routing, and layered security via SGs and VPC Endpoints, you create a foundation that enables, rather than hinders, your application's security and growth.
Building this foundation correctly is a core competency of expert cloud engineering services. Whether your remote teams are building new platforms or migrating existing ones, mastering this network layer is a prerequisite for success on AWS.