A tale of investigating rising costs in a high velocity startup is the leading music & platform in the Middle East. We’ve been dubbed the , though we prefer just Millions of users use our services daily to enjoy , or . Anghami entertainment Spotify of the Middle East Anghami . Music Music Videos Expressions We have been on the Cloud since Day 0, which is 4 years ago, I joined 2 & 1/2 years ago (and my oh my does time fly) and we mostly love the scale and flexibility we get. We rely on AWS’s (Simple Queue Service) to move data asynchronously back and forth through some of our systems. For those that don’t know how SQS or what we call Message Queues work: they are systems (or a service in the case of SQS) where you can push some data we call to, from what we call , you can then subsequently pull data from it and process it on what we call to accomplish the processing of tasks asynchronously, where run the longer running tasks needed to be done like sending out emails to welcome a New User. AWS SQS “Messages” “Producers” “Consumers” “Consumers” usually The wrapper and core processing implementation around SQS was written before I joined and was working pretty well at the scale we were running it back then. Then something happened, we need to process more data and so we started adding more “Consumers” latching on to SQS to fetch and process “Messages”. That’s when we started to see our SQS costs go up disproportionately to the increase in the number of “Messages” we were processing! Queue panic mode and investigations. Total SQS Costs with a BIG jump in July I opened our AWS Billing Dashboard (it was my first time back then) to check what was going on. I found that they provide a tab called “Cost Explorer” which contained the link to “Daily Spend View”. There I chose the date range I was interested in (before and after the jump in costs and addition of “Consumers”), followed by adding a filter for the Service SQS shown above. Then I noticed the option and then tried to see what the different options were, the magical option for our case was grouping by which makes sense as SQS charges by number of API Operations, not “Messages”. Grouping API Operation SQS Costs grouped by API Operation The chart provided gave me all the insight I needed to make sense of things. Most of our spend was going towards the call (in blue above). Interesting! I opened up the old piece of code for the SQS Wrapper core we had and checked for calls to the method that invokes that API call, it was being called before every time we try to fetch messages to process to check if the queue had any contents and sleep if it didn’t. This was weird but without it we would keep calling the API in a loop (at the time this code was initially written SQS didn’t have long polling). By the time I had adopted the code SQS had long polling so first things first I updated all our Queues to set the to the max (20 seconds), which basically tells the SQS library to keep the HTTP connection to SQS waiting up to 20 seconds given that it can still receive more messages with the limit being 10 messages per call. GetQueueAttributes ReceiveMessage Receive Message Wait Time Then I updated the code to utilize the long polling, this removed all our reliance on API Calls and I removed them from the code. That was the first sweep that cut down the bulk of the cost that was basically waste at that point. GetQueueAttributes After the rush of victory from my first investigation I looked into how I can take this further and tune it down. We know we will always be receiving “Messages” in batches of up to 10 and we are processing those on the same node. When a message is processed successfully you must call the API to tell SQS to delete it from its pool of messages (there is a safety in SQS that sets a Timeout on the Message after which it is released back into the pool of available messages). What came next was batching those into instead of delete message taking care of not crossing the of the Queues in question. After I changed that code we saw another reduction in cost since we were basically calling once for every 10 messages (in reality it’s lower, at around one for every 7 Messages, since we don’t always get 10 messages or fail to process etc.). DeleteMessage DeleteMessageBatch Visibility Timeout DeleteMessageBatch I continued to then batch some of the calls that could be which also resulted in some reduction there, but due to the nature of these systems not much can be batched when Sending as they are generated one by one when our API instances receive some calls etc. We also added logic to our “Consumer” tier and with the above changes we no longer needed to worry and reduced costs further by only having the number of instances we need to process the load of messages we needed to process. SendMessage autoscaling Thanks for reading!

Amazon

Reducing Costs in the Cloud

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

30 Million Songs Down to 30

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

30 Million Songs Down to 30

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps