Right-Sizing EC2: Tools and Strategies
Optimize EC2 costs through right-sizing. Learn to analyze utilization metrics, use AWS tools, and implement automated recommendations for significant cost savings.
Right-sizing is the lowest-hanging fruit in cloud cost optimization. Studies show 30-40% of EC2 instances are oversized. That’s money burned on unused CPU and memory. This guide covers how to identify oversized instances and safely resize them.
Understanding Utilization
What Metrics Matter
key_metrics:
cpu:
metric: CPUUtilization
healthy_range: "40-80% average"
warning_sign: "< 20% average over 14 days"
memory:
metric: mem_used_percent (CloudWatch Agent)
healthy_range: "50-80% average"
warning_sign: "< 40% average"
network:
metrics:
- NetworkIn
- NetworkOut
consideration: "Network-optimized instances if > 5 Gbps"
disk:
metrics:
- DiskReadOps
- DiskWriteOps
- DiskReadBytes
- DiskWriteBytes
consideration: "IO-optimized instances if high IOPS"
CloudWatch Agent for Memory Metrics
{
"agent": {
"metrics_collection_interval": 60
},
"metrics": {
"namespace": "CWAgent",
"metrics_collected": {
"mem": {
"measurement": ["mem_used_percent", "mem_available"],
"metrics_collection_interval": 60
},
"disk": {
"measurement": ["disk_used_percent"],
"resources": ["/"],
"metrics_collection_interval": 60
},
"cpu": {
"measurement": ["cpu_usage_active"],
"metrics_collection_interval": 60
}
},
"append_dimensions": {
"InstanceId": "${aws:InstanceId}",
"AutoScalingGroupName": "${aws:AutoScalingGroupName}"
}
}
}
Terraform for CloudWatch Agent
resource "aws_ssm_parameter" "cloudwatch_agent_config" {
name = "/cloudwatch-agent/config"
type = "String"
value = file("${path.module}/cloudwatch-agent-config.json")
}
resource "aws_ssm_association" "cloudwatch_agent" {
name = "AmazonCloudWatch-ManageAgent"
targets {
key = "tag:Environment"
values = [var.environment]
}
parameters = {
action = "configure"
mode = "ec2"
optionalConfigurationSource = "ssm"
optionalConfigurationLocation = aws_ssm_parameter.cloudwatch_agent_config.name
}
}
AWS Tools for Right-Sizing
Compute Optimizer
# Enable Compute Optimizer
aws compute-optimizer update-enrollment-status --status Active
# Get recommendations
aws compute-optimizer get-ec2-instance-recommendations \
--filters name=Finding,values=Overprovisioned \
--query 'instanceRecommendations[*].{
InstanceId:instanceArn,
Current:currentInstanceType,
Recommended:recommendationOptions[0].instanceType,
Savings:recommendationOptions[0].projectedUtilizationMetrics[0].value
}'
Cost Explorer Right-Sizing
# Get right-sizing recommendations
aws ce get-rightsizing-recommendation \
--service "AmazonEC2" \
--configuration '{
"RecommendationTarget": "SAME_INSTANCE_FAMILY",
"BenefitsConsidered": true
}'
Custom Analysis Script
import boto3
from datetime import datetime, timedelta
from typing import List, Dict
import json
cloudwatch = boto3.client('cloudwatch')
ec2 = boto3.client('ec2')
pricing = boto3.client('pricing', region_name='us-east-1')
# Instance type hierarchy for downsizing
INSTANCE_FAMILIES = {
'm6i': ['large', 'xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge', '16xlarge', '24xlarge', '32xlarge'],
'm5': ['large', 'xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge', '16xlarge', '24xlarge'],
'c6i': ['large', 'xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge', '16xlarge', '24xlarge', '32xlarge'],
'r6i': ['large', 'xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge', '16xlarge', '24xlarge', '32xlarge'],
}
def get_instance_metrics(instance_id: str, days: int = 14) -> Dict:
"""Get utilization metrics for an instance."""
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=days)
metrics = {}
# CPU Utilization
cpu_response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Average', 'Maximum', 'Minimum']
)
if cpu_response['Datapoints']:
metrics['cpu'] = {
'avg': sum(p['Average'] for p in cpu_response['Datapoints']) / len(cpu_response['Datapoints']),
'max': max(p['Maximum'] for p in cpu_response['Datapoints']),
'min': min(p['Minimum'] for p in cpu_response['Datapoints'])
}
# Memory (if CloudWatch Agent installed)
try:
mem_response = cloudwatch.get_metric_statistics(
Namespace='CWAgent',
MetricName='mem_used_percent',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Average', 'Maximum']
)
if mem_response['Datapoints']:
metrics['memory'] = {
'avg': sum(p['Average'] for p in mem_response['Datapoints']) / len(mem_response['Datapoints']),
'max': max(p['Maximum'] for p in mem_response['Datapoints'])
}
except:
pass
# Network
for metric_name in ['NetworkIn', 'NetworkOut']:
response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName=metric_name,
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Average', 'Maximum']
)
if response['Datapoints']:
metrics[metric_name.lower()] = {
'avg': sum(p['Average'] for p in response['Datapoints']) / len(response['Datapoints']),
'max': max(p['Maximum'] for p in response['Datapoints'])
}
return metrics
def suggest_downsize(current_type: str, metrics: Dict) -> Dict:
"""Suggest a smaller instance type based on metrics."""
family = current_type.rsplit('.', 1)[0]
size = current_type.rsplit('.', 1)[1]
if family not in INSTANCE_FAMILIES:
return {'recommendation': 'unknown_family', 'suggested': None}
sizes = INSTANCE_FAMILIES[family]
current_idx = sizes.index(size) if size in sizes else -1
if current_idx <= 0:
return {'recommendation': 'already_smallest', 'suggested': None}
# Determine how many sizes to downgrade
cpu_avg = metrics.get('cpu', {}).get('avg', 100)
cpu_max = metrics.get('cpu', {}).get('max', 100)
mem_avg = metrics.get('memory', {}).get('avg', 100)
mem_max = metrics.get('memory', {}).get('max', 100)
# Conservative: only downsize if both CPU and memory are low
if cpu_avg < 10 and cpu_max < 30 and mem_avg < 30:
steps = 2 # Aggressive downsize
elif cpu_avg < 20 and cpu_max < 50 and mem_avg < 50:
steps = 1 # Moderate downsize
else:
return {'recommendation': 'appropriately_sized', 'suggested': None}
new_idx = max(0, current_idx - steps)
suggested_type = f"{family}.{sizes[new_idx]}"
return {
'recommendation': 'downsize',
'current': current_type,
'suggested': suggested_type,
'reason': f"CPU avg: {cpu_avg:.1f}%, max: {cpu_max:.1f}%"
}
def analyze_all_instances(tag_filter: Dict = None) -> List[Dict]:
"""Analyze all instances and generate recommendations."""
filters = []
if tag_filter:
for key, value in tag_filter.items():
filters.append({'Name': f'tag:{key}', 'Values': [value]})
filters.append({'Name': 'instance-state-name', 'Values': ['running']})
instances = ec2.describe_instances(Filters=filters)
recommendations = []
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
instance_type = instance['InstanceType']
metrics = get_instance_metrics(instance_id)
if not metrics.get('cpu'):
continue
suggestion = suggest_downsize(instance_type, metrics)
recommendations.append({
'instance_id': instance_id,
'name': next((t['Value'] for t in instance.get('Tags', []) if t['Key'] == 'Name'), 'N/A'),
'current_type': instance_type,
'metrics': metrics,
**suggestion
})
return recommendations
def generate_report(recommendations: List[Dict]) -> str:
"""Generate a markdown report of recommendations."""
downsize = [r for r in recommendations if r.get('recommendation') == 'downsize']
appropriate = [r for r in recommendations if r.get('recommendation') == 'appropriately_sized']
smallest = [r for r in recommendations if r.get('recommendation') == 'already_smallest']
report = f"""# EC2 Right-Sizing Report
Generated: {datetime.utcnow().isoformat()}
## Summary
- **Total instances analyzed:** {len(recommendations)}
- **Downsize candidates:** {len(downsize)}
- **Appropriately sized:** {len(appropriate)}
- **Already smallest:** {len(smallest)}
## Downsize Recommendations
| Instance ID | Name | Current | Suggested | CPU Avg | CPU Max |
|-------------|------|---------|-----------|---------|---------|
"""
for r in sorted(downsize, key=lambda x: x['metrics']['cpu']['avg']):
report += f"| {r['instance_id']} | {r['name'][:20]} | {r['current_type']} | {r['suggested']} | {r['metrics']['cpu']['avg']:.1f}% | {r['metrics']['cpu']['max']:.1f}% |\n"
return report
if __name__ == '__main__':
recommendations = analyze_all_instances({'Environment': 'production'})
report = generate_report(recommendations)
print(report)
# Save detailed JSON
with open('rightsizing-report.json', 'w') as f:
json.dump(recommendations, f, indent=2, default=str)
Automated Right-Sizing Pipeline
Lambda for Weekly Analysis
import boto3
import json
from datetime import datetime
def lambda_handler(event, context):
"""Weekly right-sizing analysis."""
# Run analysis
recommendations = analyze_all_instances()
# Filter actionable recommendations
actionable = [r for r in recommendations if r.get('recommendation') == 'downsize']
if not actionable:
return {'status': 'no_action_needed'}
# Store report in S3
s3 = boto3.client('s3')
report_key = f"rightsizing/{datetime.utcnow().strftime('%Y/%m/%d')}/report.json"
s3.put_object(
Bucket=os.environ['REPORTS_BUCKET'],
Key=report_key,
Body=json.dumps(recommendations, default=str),
ContentType='application/json'
)
# Send notification
sns = boto3.client('sns')
sns.publish(
TopicArn=os.environ['ALERTS_TOPIC'],
Subject=f"EC2 Right-Sizing: {len(actionable)} instances to review",
Message=f"""
Right-sizing analysis complete.
Found {len(actionable)} instances that may be oversized.
Top candidates:
{chr(10).join(f"- {r['instance_id']} ({r['name']}): {r['current_type']} → {r['suggested']}" for r in actionable[:5])}
Full report: s3://{os.environ['REPORTS_BUCKET']}/{report_key}
"""
)
return {
'analyzed': len(recommendations),
'actionable': len(actionable),
'report_location': f"s3://{os.environ['REPORTS_BUCKET']}/{report_key}"
}
Safe Resize Procedure
#!/bin/bash
# safe-resize.sh - Resize an instance with safeguards
INSTANCE_ID=$1
NEW_TYPE=$2
DRY_RUN=${3:-true}
if [ -z "$INSTANCE_ID" ] || [ -z "$NEW_TYPE" ]; then
echo "Usage: $0 <instance-id> <new-type> [dry-run=true]"
exit 1
fi
# Get current state
CURRENT_TYPE=$(aws ec2 describe-instances \
--instance-ids $INSTANCE_ID \
--query 'Reservations[0].Instances[0].InstanceType' \
--output text)
CURRENT_STATE=$(aws ec2 describe-instances \
--instance-ids $INSTANCE_ID \
--query 'Reservations[0].Instances[0].State.Name' \
--output text)
echo "Instance: $INSTANCE_ID"
echo "Current type: $CURRENT_TYPE"
echo "New type: $NEW_TYPE"
echo "Current state: $CURRENT_STATE"
if [ "$DRY_RUN" == "true" ]; then
echo ""
echo "DRY RUN - No changes made"
echo "Run with 'false' as third argument to execute"
exit 0
fi
# Create AMI backup first
echo "Creating backup AMI..."
AMI_ID=$(aws ec2 create-image \
--instance-id $INSTANCE_ID \
--name "pre-resize-$INSTANCE_ID-$(date +%Y%m%d%H%M)" \
--no-reboot \
--query 'ImageId' \
--output text)
echo "Backup AMI: $AMI_ID"
# Wait for AMI
echo "Waiting for AMI to be available..."
aws ec2 wait image-available --image-ids $AMI_ID
# Stop instance
echo "Stopping instance..."
aws ec2 stop-instances --instance-ids $INSTANCE_ID
aws ec2 wait instance-stopped --instance-ids $INSTANCE_ID
# Modify instance type
echo "Modifying instance type..."
aws ec2 modify-instance-attribute \
--instance-id $INSTANCE_ID \
--instance-type "{\"Value\": \"$NEW_TYPE\"}"
# Start instance
echo "Starting instance..."
aws ec2 start-instances --instance-ids $INSTANCE_ID
aws ec2 wait instance-running --instance-ids $INSTANCE_ID
# Verify
NEW_CURRENT_TYPE=$(aws ec2 describe-instances \
--instance-ids $INSTANCE_ID \
--query 'Reservations[0].Instances[0].InstanceType' \
--output text)
echo ""
echo "Resize complete!"
echo "Old type: $CURRENT_TYPE"
echo "New type: $NEW_CURRENT_TYPE"
echo "Backup AMI: $AMI_ID (delete after verification)"
Graviton Migration
Why Graviton
Benefits:
- 20% lower cost than comparable x86
- Up to 40% better price/performance
- Same or better performance for most workloads
Compatible workloads:
- Web servers
- Containerized apps
- Java applications
- Python applications
- Most Linux workloads
Requires testing:
- Applications with x86 assembly
- Windows workloads (not supported)
- License-locked software
Graviton Instance Types
graviton_equivalents:
x86: graviton
m5.large: m6g.large
m5.xlarge: m6g.xlarge
c5.large: c6g.large
c5.xlarge: c6g.xlarge
r5.large: r6g.large
r5.xlarge: r6g.xlarge
t3.micro: t4g.micro
t3.small: t4g.small
Terraform for Graviton ASG
resource "aws_launch_template" "graviton" {
name_prefix = "app-graviton-"
image_id = data.aws_ami.amazon_linux_arm64.id
instance_type = "m6g.large"
# ... rest of configuration
}
data "aws_ami" "amazon_linux_arm64" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["amzn2-ami-hvm-*-arm64-gp2"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
}
Key Takeaways
- Measure first — install CloudWatch Agent for memory metrics
- Use AWS tools — Compute Optimizer and Cost Explorer are free
- Automate analysis — weekly reports catch drift
- Downsize conservatively — leave headroom for spikes
- Consider Graviton — 20% savings with a rebuild
“The best instance size is one that runs at 60-70% utilization. Higher wastes performance, lower wastes money.”