AWS Troubleshooting Tips: Common Issues Every DevOps Engineer Must Know

Introduction 

In this blog we’ll explore common AWS issues, their root causes, and practical troubleshooting tips to help you quickly resolve problems. Whether you’re dealing with EC2 instance errors, S3 bucket access problems, or Lambda function failures, these tips will ensure your cloud environment runs smoothly and efficiently.  

10 common Troubleshooting scenarios and how to fix them.

1. EC2 Instance Not Starting Error: Instance stuck in pending or stopped state.
Problem: EC2 instance fails to boot or restart.
Possible Cause: Insufficient instance quota, corrupted AMI, or invalid networking setup.
Solution:

  • Check EC2 service quotas in AWS console.
  • Validate AMI is active and supported.
  • Verify subnet, VPC, and security group settings.
  • Review system logs for kernel or boot errors.

2. SSH Connection Failure to EC2 Error: “Connection timed out” or “Permission denied (publickey)”.
Problem: Unable to connect via SSH.
Possible Cause: Port 22 blocked, wrong key pair, incorrect public IP, or private subnet restriction.
Solution:

  • Open inbound rule for port 22 in security group.
  • Use correct .pem key file with chmod 400.
  • Confirm Elastic IP or public IP is correct.
  • Use bastion host or VPN for private subnet instances.
3. S3 Access Denied Error: “AccessDenied” when accessing S3 bucket/object.
Problem: Cannot upload, download, or list S3 objects.
Possible Cause: IAM role lacks permissions, bucket policy restrictions, or block public access enabled.
Solution:
  • Add s3:GetObject and s3:PutObject to IAM policy.
  • Update bucket policy for required access.
  • Check cross-account access setup.
  • Disable overly restrictive Block Public Access if needed.
4. IAM Permission Issues Error: “User is not authorized to perform this action.”
Problem: AWS resources not accessible.
Possible Cause: Incorrect IAM policy, missing role trust policy, or SCP restrictions from AWS Organizations.
Solution:
  • Test access in IAM Policy Simulator.
  • Attach correct IAM role/policy.
  • Update trust relationships for assumed roles.
  • Verify SCPs don’t block the request.
5. High AWS Billing Error: Unexpectedly high bill in AWS console.
Problem: Running costs exceed budget.
Possible Cause: Unused EC2/RDS instances, orphaned EBS volumes, or high data transfer usage.
Solution:
  • Analyze costs with AWS Cost Explorer.
  • Stop/delete unused resources.
  • Set up budgets and alerts.
  • Use Trusted Advisor for optimization suggestions.
6. Lambda Function Timeout Error: “Task timed out after X seconds”.
Problem: Function fails before execution completes.
Possible Cause: Default timeout too low, inefficient code, or dependent service latency.
Solution:
  • Increase timeout in Lambda configuration.
  • Optimize function code (remove unnecessary loops).
  • Check DynamoDB, API, or S3 latency.
  • Consider asynchronous invocation if needed.
7. RDS Database Connection Failure Error: “Could not connect to server: Connection timed out”.
Problem: Application cannot connect to RDS.
Possible Cause: Wrong DB endpoint/port, blocked SG rules, or private subnet restriction.
Solution:
  • Use correct RDS endpoint and port.
  • Update security group inbound rules.
  • Ensure DB is accessible from application VPC.
  • Enable Enhanced Monitoring for deeper insights.
8. VPC Internet Access Issues Error: Instances cannot access the internet.
Problem: Outbound connections fail.
Possible Cause: Missing Internet Gateway, wrong route table, or blocked NACLs.
Solution:
  • Attach Internet Gateway to VPC.
  • Add 0.0.0.0/0 route in route table.
  • Use NAT Gateway for private subnets.
  • Verify NACL and security group rules.
9. CloudWatch Logs Missing Error: Logs not visible in CloudWatch.
Problem: No application logs are showing.
Possible Cause: IAM role missing permissions, agent misconfigured, or wrong log group name.
Solution:
  • Add logs:CreateLogGroup and logs:PutLogEvents to IAM role.
  • Restart CloudWatch agent on instance.
  • Verify log group/stream names.
  • Check region matches where logs are sent.
10. Auto Scaling Not Working Error: Auto Scaling Group does not launch or terminate instances.
Problem: Scaling policies not triggering.
Possible Cause: Misconfigured CloudWatch alarms, invalid launch template, or service limits.
Solution:
  • Review scaling policies and alarms.
  • Validate Launch Template/Configuration.
  • Check EC2 limits/quota in region.
  • Ensure health checks (ELB/EC2) are correct.


Fri Sep 19, 2025

About the Author

"DevOps is the union of people, processes, and products to enable continuous delivery of value to our end users."     - Donovan Brown

Ayushman Sen is a DevOps Engineer at CloudDevOpsHub with a passion for cloud technologies and automation. He enjoys writing blogs to share his DevOps knowledge and insights with the community. A true DevOps enthusiast, Ayushman is also passionate about traveling, listening to music, and playing musical instruments.

Ayushman Sen
DevOps Engineer at CloudDevOpsHub