Right-sizing is the lowest-hanging fruit in cloud cost optimization. Studies show 30-40% of EC2 instances are oversized. That’s money burned on unused CPU and memory. This guide covers how to identify oversized instances and safely resize them.

Understanding Utilization

What Metrics Matter

key_metrics:
  cpu:
    metric: CPUUtilization
    healthy_range: "40-80% average"
    warning_sign: "< 20% average over 14 days"

  memory:
    metric: mem_used_percent (CloudWatch Agent)
    healthy_range: "50-80% average"
    warning_sign: "< 40% average"

  network:
    metrics:
      - NetworkIn
      - NetworkOut
    consideration: "Network-optimized instances if > 5 Gbps"

  disk:
    metrics:
      - DiskReadOps
      - DiskWriteOps
      - DiskReadBytes
      - DiskWriteBytes
    consideration: "IO-optimized instances if high IOPS"

CloudWatch Agent for Memory Metrics

{
  "agent": {
    "metrics_collection_interval": 60
  },
  "metrics": {
    "namespace": "CWAgent",
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent", "mem_available"],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": ["disk_used_percent"],
        "resources": ["/"],
        "metrics_collection_interval": 60
      },
      "cpu": {
        "measurement": ["cpu_usage_active"],
        "metrics_collection_interval": 60
      }
    },
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}",
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}"
    }
  }
}

Terraform for CloudWatch Agent

resource "aws_ssm_parameter" "cloudwatch_agent_config" {
  name  = "/cloudwatch-agent/config"
  type  = "String"
  value = file("${path.module}/cloudwatch-agent-config.json")
}

resource "aws_ssm_association" "cloudwatch_agent" {
  name = "AmazonCloudWatch-ManageAgent"

  targets {
    key    = "tag:Environment"
    values = [var.environment]
  }

  parameters = {
    action                        = "configure"
    mode                          = "ec2"
    optionalConfigurationSource   = "ssm"
    optionalConfigurationLocation = aws_ssm_parameter.cloudwatch_agent_config.name
  }
}

AWS Tools for Right-Sizing

Compute Optimizer

# Enable Compute Optimizer
aws compute-optimizer update-enrollment-status --status Active

# Get recommendations
aws compute-optimizer get-ec2-instance-recommendations \
  --filters name=Finding,values=Overprovisioned \
  --query 'instanceRecommendations[*].{
    InstanceId:instanceArn,
    Current:currentInstanceType,
    Recommended:recommendationOptions[0].instanceType,
    Savings:recommendationOptions[0].projectedUtilizationMetrics[0].value
  }'

Cost Explorer Right-Sizing

# Get right-sizing recommendations
aws ce get-rightsizing-recommendation \
  --service "AmazonEC2" \
  --configuration '{
    "RecommendationTarget": "SAME_INSTANCE_FAMILY",
    "BenefitsConsidered": true
  }'

Custom Analysis Script

import boto3
from datetime import datetime, timedelta
from typing import List, Dict
import json

cloudwatch = boto3.client('cloudwatch')
ec2 = boto3.client('ec2')
pricing = boto3.client('pricing', region_name='us-east-1')

# Instance type hierarchy for downsizing
INSTANCE_FAMILIES = {
    'm6i': ['large', 'xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge', '16xlarge', '24xlarge', '32xlarge'],
    'm5': ['large', 'xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge', '16xlarge', '24xlarge'],
    'c6i': ['large', 'xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge', '16xlarge', '24xlarge', '32xlarge'],
    'r6i': ['large', 'xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge', '16xlarge', '24xlarge', '32xlarge'],
}


def get_instance_metrics(instance_id: str, days: int = 14) -> Dict:
    """Get utilization metrics for an instance."""
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(days=days)

    metrics = {}

    # CPU Utilization
    cpu_response = cloudwatch.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName='CPUUtilization',
        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
        StartTime=start_time,
        EndTime=end_time,
        Period=3600,
        Statistics=['Average', 'Maximum', 'Minimum']
    )

    if cpu_response['Datapoints']:
        metrics['cpu'] = {
            'avg': sum(p['Average'] for p in cpu_response['Datapoints']) / len(cpu_response['Datapoints']),
            'max': max(p['Maximum'] for p in cpu_response['Datapoints']),
            'min': min(p['Minimum'] for p in cpu_response['Datapoints'])
        }

    # Memory (if CloudWatch Agent installed)
    try:
        mem_response = cloudwatch.get_metric_statistics(
            Namespace='CWAgent',
            MetricName='mem_used_percent',
            Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
            StartTime=start_time,
            EndTime=end_time,
            Period=3600,
            Statistics=['Average', 'Maximum']
        )
        if mem_response['Datapoints']:
            metrics['memory'] = {
                'avg': sum(p['Average'] for p in mem_response['Datapoints']) / len(mem_response['Datapoints']),
                'max': max(p['Maximum'] for p in mem_response['Datapoints'])
            }
    except:
        pass

    # Network
    for metric_name in ['NetworkIn', 'NetworkOut']:
        response = cloudwatch.get_metric_statistics(
            Namespace='AWS/EC2',
            MetricName=metric_name,
            Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
            StartTime=start_time,
            EndTime=end_time,
            Period=3600,
            Statistics=['Average', 'Maximum']
        )
        if response['Datapoints']:
            metrics[metric_name.lower()] = {
                'avg': sum(p['Average'] for p in response['Datapoints']) / len(response['Datapoints']),
                'max': max(p['Maximum'] for p in response['Datapoints'])
            }

    return metrics


def suggest_downsize(current_type: str, metrics: Dict) -> Dict:
    """Suggest a smaller instance type based on metrics."""
    family = current_type.rsplit('.', 1)[0]
    size = current_type.rsplit('.', 1)[1]

    if family not in INSTANCE_FAMILIES:
        return {'recommendation': 'unknown_family', 'suggested': None}

    sizes = INSTANCE_FAMILIES[family]
    current_idx = sizes.index(size) if size in sizes else -1

    if current_idx <= 0:
        return {'recommendation': 'already_smallest', 'suggested': None}

    # Determine how many sizes to downgrade
    cpu_avg = metrics.get('cpu', {}).get('avg', 100)
    cpu_max = metrics.get('cpu', {}).get('max', 100)
    mem_avg = metrics.get('memory', {}).get('avg', 100)
    mem_max = metrics.get('memory', {}).get('max', 100)

    # Conservative: only downsize if both CPU and memory are low
    if cpu_avg < 10 and cpu_max < 30 and mem_avg < 30:
        steps = 2  # Aggressive downsize
    elif cpu_avg < 20 and cpu_max < 50 and mem_avg < 50:
        steps = 1  # Moderate downsize
    else:
        return {'recommendation': 'appropriately_sized', 'suggested': None}

    new_idx = max(0, current_idx - steps)
    suggested_type = f"{family}.{sizes[new_idx]}"

    return {
        'recommendation': 'downsize',
        'current': current_type,
        'suggested': suggested_type,
        'reason': f"CPU avg: {cpu_avg:.1f}%, max: {cpu_max:.1f}%"
    }


def analyze_all_instances(tag_filter: Dict = None) -> List[Dict]:
    """Analyze all instances and generate recommendations."""
    filters = []
    if tag_filter:
        for key, value in tag_filter.items():
            filters.append({'Name': f'tag:{key}', 'Values': [value]})

    filters.append({'Name': 'instance-state-name', 'Values': ['running']})

    instances = ec2.describe_instances(Filters=filters)

    recommendations = []

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']
            instance_type = instance['InstanceType']

            metrics = get_instance_metrics(instance_id)

            if not metrics.get('cpu'):
                continue

            suggestion = suggest_downsize(instance_type, metrics)

            recommendations.append({
                'instance_id': instance_id,
                'name': next((t['Value'] for t in instance.get('Tags', []) if t['Key'] == 'Name'), 'N/A'),
                'current_type': instance_type,
                'metrics': metrics,
                **suggestion
            })

    return recommendations


def generate_report(recommendations: List[Dict]) -> str:
    """Generate a markdown report of recommendations."""
    downsize = [r for r in recommendations if r.get('recommendation') == 'downsize']
    appropriate = [r for r in recommendations if r.get('recommendation') == 'appropriately_sized']
    smallest = [r for r in recommendations if r.get('recommendation') == 'already_smallest']

    report = f"""# EC2 Right-Sizing Report
Generated: {datetime.utcnow().isoformat()}

## Summary
- **Total instances analyzed:** {len(recommendations)}
- **Downsize candidates:** {len(downsize)}
- **Appropriately sized:** {len(appropriate)}
- **Already smallest:** {len(smallest)}

## Downsize Recommendations

| Instance ID | Name | Current | Suggested | CPU Avg | CPU Max |
|-------------|------|---------|-----------|---------|---------|
"""

    for r in sorted(downsize, key=lambda x: x['metrics']['cpu']['avg']):
        report += f"| {r['instance_id']} | {r['name'][:20]} | {r['current_type']} | {r['suggested']} | {r['metrics']['cpu']['avg']:.1f}% | {r['metrics']['cpu']['max']:.1f}% |\n"

    return report


if __name__ == '__main__':
    recommendations = analyze_all_instances({'Environment': 'production'})
    report = generate_report(recommendations)
    print(report)

    # Save detailed JSON
    with open('rightsizing-report.json', 'w') as f:
        json.dump(recommendations, f, indent=2, default=str)

Automated Right-Sizing Pipeline

Lambda for Weekly Analysis

import boto3
import json
from datetime import datetime

def lambda_handler(event, context):
    """Weekly right-sizing analysis."""
    # Run analysis
    recommendations = analyze_all_instances()

    # Filter actionable recommendations
    actionable = [r for r in recommendations if r.get('recommendation') == 'downsize']

    if not actionable:
        return {'status': 'no_action_needed'}

    # Store report in S3
    s3 = boto3.client('s3')
    report_key = f"rightsizing/{datetime.utcnow().strftime('%Y/%m/%d')}/report.json"
    s3.put_object(
        Bucket=os.environ['REPORTS_BUCKET'],
        Key=report_key,
        Body=json.dumps(recommendations, default=str),
        ContentType='application/json'
    )

    # Send notification
    sns = boto3.client('sns')
    sns.publish(
        TopicArn=os.environ['ALERTS_TOPIC'],
        Subject=f"EC2 Right-Sizing: {len(actionable)} instances to review",
        Message=f"""
Right-sizing analysis complete.

Found {len(actionable)} instances that may be oversized.

Top candidates:
{chr(10).join(f"- {r['instance_id']} ({r['name']}): {r['current_type']}{r['suggested']}" for r in actionable[:5])}

Full report: s3://{os.environ['REPORTS_BUCKET']}/{report_key}
        """
    )

    return {
        'analyzed': len(recommendations),
        'actionable': len(actionable),
        'report_location': f"s3://{os.environ['REPORTS_BUCKET']}/{report_key}"
    }

Safe Resize Procedure

#!/bin/bash
# safe-resize.sh - Resize an instance with safeguards

INSTANCE_ID=$1
NEW_TYPE=$2
DRY_RUN=${3:-true}

if [ -z "$INSTANCE_ID" ] || [ -z "$NEW_TYPE" ]; then
    echo "Usage: $0 <instance-id> <new-type> [dry-run=true]"
    exit 1
fi

# Get current state
CURRENT_TYPE=$(aws ec2 describe-instances \
    --instance-ids $INSTANCE_ID \
    --query 'Reservations[0].Instances[0].InstanceType' \
    --output text)

CURRENT_STATE=$(aws ec2 describe-instances \
    --instance-ids $INSTANCE_ID \
    --query 'Reservations[0].Instances[0].State.Name' \
    --output text)

echo "Instance: $INSTANCE_ID"
echo "Current type: $CURRENT_TYPE"
echo "New type: $NEW_TYPE"
echo "Current state: $CURRENT_STATE"

if [ "$DRY_RUN" == "true" ]; then
    echo ""
    echo "DRY RUN - No changes made"
    echo "Run with 'false' as third argument to execute"
    exit 0
fi

# Create AMI backup first
echo "Creating backup AMI..."
AMI_ID=$(aws ec2 create-image \
    --instance-id $INSTANCE_ID \
    --name "pre-resize-$INSTANCE_ID-$(date +%Y%m%d%H%M)" \
    --no-reboot \
    --query 'ImageId' \
    --output text)

echo "Backup AMI: $AMI_ID"

# Wait for AMI
echo "Waiting for AMI to be available..."
aws ec2 wait image-available --image-ids $AMI_ID

# Stop instance
echo "Stopping instance..."
aws ec2 stop-instances --instance-ids $INSTANCE_ID
aws ec2 wait instance-stopped --instance-ids $INSTANCE_ID

# Modify instance type
echo "Modifying instance type..."
aws ec2 modify-instance-attribute \
    --instance-id $INSTANCE_ID \
    --instance-type "{\"Value\": \"$NEW_TYPE\"}"

# Start instance
echo "Starting instance..."
aws ec2 start-instances --instance-ids $INSTANCE_ID
aws ec2 wait instance-running --instance-ids $INSTANCE_ID

# Verify
NEW_CURRENT_TYPE=$(aws ec2 describe-instances \
    --instance-ids $INSTANCE_ID \
    --query 'Reservations[0].Instances[0].InstanceType' \
    --output text)

echo ""
echo "Resize complete!"
echo "Old type: $CURRENT_TYPE"
echo "New type: $NEW_CURRENT_TYPE"
echo "Backup AMI: $AMI_ID (delete after verification)"

Graviton Migration

Why Graviton

Benefits:
- 20% lower cost than comparable x86
- Up to 40% better price/performance
- Same or better performance for most workloads

Compatible workloads:
- Web servers
- Containerized apps
- Java applications
- Python applications
- Most Linux workloads

Requires testing:
- Applications with x86 assembly
- Windows workloads (not supported)
- License-locked software

Graviton Instance Types

graviton_equivalents:
  x86: graviton
  m5.large: m6g.large
  m5.xlarge: m6g.xlarge
  c5.large: c6g.large
  c5.xlarge: c6g.xlarge
  r5.large: r6g.large
  r5.xlarge: r6g.xlarge
  t3.micro: t4g.micro
  t3.small: t4g.small

Terraform for Graviton ASG

resource "aws_launch_template" "graviton" {
  name_prefix   = "app-graviton-"
  image_id      = data.aws_ami.amazon_linux_arm64.id
  instance_type = "m6g.large"

  # ... rest of configuration
}

data "aws_ami" "amazon_linux_arm64" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-arm64-gp2"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

Key Takeaways

  1. Measure first — install CloudWatch Agent for memory metrics
  2. Use AWS tools — Compute Optimizer and Cost Explorer are free
  3. Automate analysis — weekly reports catch drift
  4. Downsize conservatively — leave headroom for spikes
  5. Consider Graviton — 20% savings with a rebuild

“The best instance size is one that runs at 60-70% utilization. Higher wastes performance, lower wastes money.”