Most AWS bills have 30-50% fat. Not because teams are careless, but because cloud costs grow incrementally — a new service here, a size-up there, a forgotten dev environment — and nobody ever does the full audit.
This is the checklist we work through on infrastructure engagements. It's ordered by impact: items near the top typically yield more savings than items near the bottom.
Before you start
Pull these three things from your AWS account:
- Cost Explorer, grouped by service, last 90 days.
- Cost and Usage Reports (CUR) into Athena or QuickSight — more granular than Cost Explorer.
- Trusted Advisor (if you have Business/Enterprise Support) for baseline rightsizing recommendations.
Now work through the items below in order.
1. EC2 rightsizing (biggest ROI usually)
Pull CloudWatch metrics for every EC2 instance for the last 30 days. Look at:
- CPU utilization p95 — if consistently under 40%, you're oversized
- Memory utilization p95 (requires CloudWatch agent) — same threshold
- Network throughput — if under 10% of instance's max, networking isn't the constraint
The 3-step rightsizing process
-
Downsize where possible. If an
m5.2xlargeaverages 15% CPU and 30% memory, drop tom5.xlarge. If it still averages 30% CPU, drop tom5.large. Be aggressive. It's easier to size back up than to notice over-provisioning. -
Switch generations. Graviton-based instances (
m6g,m7g,c7g) are typically 20% cheaper than x86 equivalents with similar or better performance. If your code runs on Linux ARM64, migrate. Most modern stacks (Java, Python, Go, Node) work on Graviton with zero code changes. -
Consider Spot for stateless workloads. CI/CD runners, batch jobs, stateless API servers with proper graceful-shutdown handling can run on Spot for 70-90% discount.
Many ASGs are configured to scale up aggressively and down conservatively (or never). Check your scale-in policies. Often the simplest cost win is a properly-tuned target-tracking scaling policy.
2. Savings Plans and Reserved Instances
If you have stable, predictable compute usage and you're paying on-demand, you're leaving 40-60% on the table.
The commitment ladder
- Compute Savings Plans (1-year, no upfront) — most flexible, ~30% discount. Safe for most workloads.
- Compute Savings Plans (3-year, all upfront) — biggest discount (~55%), commit only what you're certain will persist.
- EC2 Instance Savings Plans — lock to specific instance family, larger discount than Compute Savings Plans but less flexible.
- Reserved Instances — more rigid, more discount. Mostly superseded by Savings Plans.
The strategy
- Start with Compute Savings Plans (1-year, no upfront) covering 60-70% of your baseline compute.
- Leave the top 30-40% on-demand to handle variance.
- Review coverage and utilization monthly. Adjust next purchase accordingly.
- Don't commit more than 80% of predicted usage — waste is worse than no discount.
Use AWS Cost Explorer's Savings Plans recommendations as a starting point, but verify — recommendations can over-commit.
3. Forgotten resources
The easiest wins are things nobody's using but still paying for.
Run through this list:
- Unattached EBS volumes. Filter EBS volumes by state
available. Delete after snapshotting if you're paranoid. - Old EBS snapshots. Anything over 90 days old is probably forgotten. Keep your backup retention policy in mind.
- Idle RDS instances. CPU < 5% for 30 days. Probably a dev/test leftover. Stop or snapshot-and-delete.
- Idle load balancers. ELBs with zero traffic for 30+ days. ~$20-40/month each.
- Unused Elastic IPs. Unassociated EIPs cost $3.60/month each. Usually cheap individually, but we've seen accounts with 200+.
- Old AMIs and ECR images. Set lifecycle policies (keep last 10 tagged, delete untagged > 30 days).
- CloudWatch log groups without retention. Default is infinite retention. Set to 30-90 days unless you have compliance reasons otherwise.
Write a weekly Lambda that reports forgotten resources to Slack. Makes cleanup a habit, not a project.
4. S3 storage classes
S3 is one of the most-wasted services. Data lands in Standard and stays forever at full price.
Lifecycle rules to set today
- Move logs and historical exports to
STANDARD_IAafter 30 days. - Move archival data to
GLACIER_IRorDEEP_ARCHIVEbased on retrieval needs. - Expire temporary artifacts aggressively (7-30 days).
- Keep explicit exceptions for legal/compliance retention sets.
Start with one bucket category at a time (app logs, data lake raw, backups) and validate retrieval behavior before broad rollout.
5. RDS and Aurora efficiency
Databases are usually the second-largest line item after compute.
Audit:
- CPU, memory, and IOPS utilization over 30 days.
- replica utilization and lag behavior.
- storage allocation vs actual data footprint.
- expensive long-running queries causing oversized instances.
Common wins:
- rightsize writer/reader nodes independently.
- remove stale replicas created for one-off events.
- tune autovacuum and indexing before scaling hardware.
- use Aurora Serverless v2 only where burst patterns justify it.
6. EKS/Kubernetes cost controls
Kubernetes cost sprawl is easy because waste hides at multiple layers.
Checklist:
- enforce CPU/memory requests and limits on all workloads.
- remove zombie namespaces and stale preview environments.
- run cluster autoscaler with sane scale-down policies.
- adopt Karpenter or equivalent for better bin-packing.
- use spot node groups for resilient stateless workloads.
If your pods have no requests, you are flying blind on both reliability and cost.
7. NAT gateway and data transfer traps
Many teams underestimate transfer and egress.
Review:
- NAT gateway cost by AZ and service path.
- cross-AZ traffic patterns from chatty services.
- unnecessary internet egress for internal service calls.
- CloudFront opportunities for cacheable content/API responses.
Architectural fixes here often produce disproportionately high savings.
8. Logging and observability spend
CloudWatch and third-party observability bills grow silently.
Do this:
- set log retention by class (dev, staging, prod).
- sample high-volume logs once debugging is complete.
- avoid shipping duplicate logs to multiple sinks.
- reserve high-cardinality metrics for truly actionable use cases.
You need visibility. You do not need infinite raw telemetry.
9. Environment governance
Cost waste is frequently organizational, not technical.
Set baseline governance:
- required tags:
owner,team,env,cost_center,expiry_date. - automatic shutdown policies for non-prod during off-hours.
- monthly review per team with top spend deltas.
- explicit owner for every major service line item.
No owner means no optimization.
10. Commitment strategy review cadence
Savings Plans are not "set and forget."
Run a monthly FinOps review:
- commitment coverage (% of baseline on discounted pricing).
- commitment utilization (how much committed spend is actually used).
- instance family drift and seasonal trends.
- upcoming workload changes that alter commitment assumptions.
Bad commitments create hidden waste just as surely as on-demand usage.
11. Cost anomaly detection and alerts
You need early warning before month-end surprises.
Minimum setup:
- daily spend anomaly alerts by service and account.
- budget thresholds (50%, 80%, 100%) routed to owners.
- sudden egress or NAT spikes flagged separately.
- deployment-to-cost correlation in change logs.
Fast feedback loops are what make cost discipline sustainable.
12. Build a 90-day optimization roadmap
Treat this as a delivery program, not a one-time cleanup.
Break actions into:
- Week 1-2: quick cleanup and rightsizing.
- Week 3-6: deeper service-level tuning.
- Week 7-12: architectural changes with larger ROI.
Each item should have:
- owner
- expected savings range
- implementation risk
- verification method
A practical prioritization formula
Use this to rank actions:
Priority score = (estimated monthly savings x confidence) / implementation effort
This prevents teams from spending weeks on low-impact optimizations while large obvious savings remain untouched.
Closing
Most AWS cost reduction comes from disciplined execution of boring fundamentals. Do the high-leverage checks consistently, automate the recurring controls, and reserve architecture changes for where they create real ROI.
Related resources
- Capabilities: Infrastructure Optimization and Edge Computing
- Case study: E-commerce cloud cost reduction
- Deep dive: Edge vs origin decision framework
Tags
Anthra AI Team
Engineering Team
Collective posts from the engineers at Anthra AI. We write about what we build.
More posts by Anthra AI TeamShare this article
Get insights like this weekly
Product engineering notes on AI, data, and infrastructure - no fluff.
Previous post
Kafka topic design: the 5 mistakes we see most often
Data
Next post
Choosing a vector database in 2026: a practical comparison
AI