Infrastructure

Optimizing Cloud Infrastructure Costs

January 21, 2021

Practical strategies for reducing AWS bills at scale — instance generation upgrades, spot instances, storage tiering, and Lambda cold start optimization.

Vedantu is India’s one of the largest live learning platforms. Whatever we build creates a tremendous impact on students. To deliver this impact, we consistently improve our services.

Due to the surge in scale and the number of live sessions increasing on our platform we realised it’s time to revisit the services used by the team. Our team is responsible for ins and outs of WAVE (the platform that powers our live class). We are using AWS primarily to power our infrastructure. So we picked up the task to reduce the AWS bill.

EC2

Price Comparison

price-comparison

Performance Comparison

performance

aws-bills

hls

aws-bill-1

S3

⚠️

Before moving to glacier remember this, for each object archived to Amazon Glacier, Amazon S3 uses 8 KB of storage for the name of the object and other metadata. Amazon S3 stores this metadata so that you can get a real-time list of your archived objects by using the Amazon S3 API. You are charged standard Amazon S3 rates for this additional storage.

For each archived object, Amazon Glacier adds 32 KB of storage for index and related metadata. This extra data is necessary to identify and restore your object. Hence, if we store objects with a size less than 32KB, the total cost will be higher than simply storing the data in the Standard storage class.

EKS

One of our services is backed by Amazon’s Managed Kubernetes Service. It launches more than 6k containers on EKS daily. It was initially backed by on-demand worker nodes. To reduce the cost of on-demand nodes we categorised our jobs into namespaces. Then we set the SLA and priority of jobs based on our internal business use-cases to specific namespaces. As we were launching the jobs from code itself, out of three namespaces, we moved two namespaces as their first preference to launch inside SPOT worker nodes. This result in great savings of up to 40% by adding nodeAffnity expressions as spot.

node-affinity

Elasticache AutoScaling

Amazon ElastiCache for Redis supports auto-scaling to automatically adjust capacity to maintain steady, predictable performance at the lowest possible cost. You can automatically scale your cluster horizontally by adding or removing shards or replica nodes. ElastiCache for Redis uses AWS Application Auto Scaling to manage scaling and Amazon CloudWatch metrics to determine when it is time to scale up or down.

auto-scaling

Previously, there was no autoscaling enabled in any of our elasticache services. That means we were not evaluating our resources whether we were underutilising/over utilising. Also unnecessary cost for unused resources. For the first version, we started with possible scheduled autoscaling based on the SLA and peak hours. Where we downscale when we don’t have the live traffic to deal with and upscale before the peak hours begin.

Then we also went for Target tracking scaling policies – Increase or decrease the number of shards/replicas that your service runs based on a target value for a specific metric (analyzing database memory utilization and Engine CPU). This is similar to the way that your thermostat maintains the temperature of your home. You select a temperature and the thermostat does the rest.

EndNote

I realised this is not going to be a one time exercise. In order to do it properly need to make it a regular routine and a best practice to understand the billing model of the cloud-native provider. We are continuously working on designing the architecture to ensure the cost is efficient. Also, we have more plans on auditing and optimising costs on other services used from our cloud-native platform.

← All chapters