We're running an e-commerce platform, where people publish products and other people purchase those products. Our backend has some highly scalable microservices running on well-designed Lambdas, and there's a lot of caching involved. Our order processing microservice writes to a DynamoDB table. We're using DynamoDB provisioned capacity mode with auto-scaling. We did a great job and everything runs smoothly.
Suddenly, someone's product goes viral, and a lot of people rush in to buy it at the same time. Our cache and CDN don't even blink at the traffic, and our well-designed Lambdas scale amazingly fast, but our DynamoDB table is suddenly bombarded with writes and the auto-scaling can't keep up. Our order processing Lambda receives ProvisionedThroughputExceededException
, and when it retries it just makes everything worse. Things crash. Sales are lost. We eventually recover, but those customers are gone. How do we make sure it doesn't happen again?
Let's explore option 2.
AWS Services involved:
DynamoDB: Our database. All you need to know for this post is how DynamoDB scales.
SQS: A fully managed message queuing service that enables you to decouple components. Producers like our order processing microservice post to the queue, the queue stores these messages until they're read, and consumers read from the queue in their own time.
SES: An email platform, more similar to services like MailChimp than to an AWS service. If you're already on AWS and you just need to send emails programmatically, it's easy to set up. If you're not on AWS, need more control, or need to send so many emails that price is a factor, you'll need to do some research. For this post, SES is good enough.
SQS is a fully managed message queuing service. A messaging queue is a data structure where items can be read in the same order as they were written: First-In, First-Out (FIFO).
Queues allow us to decouple components by making the consumer (that's the component that reads from the queue) unaware of who wrote the item (the writer is called producer).
Additionally, in software architecture we usually focus on another characteristic of queues: The reading of the item can happen sometime after the writing. This lets us decouple producers and consumers in time: Consumers don't need to be available when Producers write to the queue.
The queue stores the messages for a certain amount of time, and when Consumers are ready, they poll the queue for messages and receive the oldest message.
For our solution, we're going to use a queue so that our order processing microservice can send a message with the order, the queue stores the message, and a consumer can read it at its own rhythm (i.e. at our DynamoDB table's rhythm).
There are two types of queues in SQS:
Standard queues are the default type of queue. They're cheaper than FIFO queues and nearly infinitely scalable. The tradeoff is that they only guarantee at-least-once delivery (meaning you might get duplicates), and the order of the messages is mostly respected but not guaranteed.
FIFO queues are more expensive than Standard queues, and they don't scale infinitely, but they guarantee ordered exactly-once delivery. You need to set the MessageGroupId
property in the message since FIFO queues only deliver the next message in a MessageGroup after the previous message has been successfully processed. For example, if you set the value of MessageGroupId
to the customer ID and a customer makes two orders at the same time, the second one to come in won't be processed until the first one is finished processing. It's also important to set MessageDeduplicationId
, to ensure that if the message gets duplicated upstream, it will be deduplicated at the queue. A FIFO queue will only keep one message per unique value of MessageDeduplicationId.
Most people who think of queues are thinking of guaranteed FIFO orders and exactly-once delivery. The only way to actually get those guarantees is with FIFO queues.
Follow these step-by-step instructions to implement an SQS Queue to throttle writes to a DynamoDB table. Replace YOUR_ACCOUNT_ID
and YOUR_REGION
with the appropriate values for your account and region.
We need to update the code of the Orders service so that it sends the new Order to the Orders Queue, instead of writing to the Orders table.
This is what the code looks like:
const AWS = require('aws-sdk');
const sqs = new AWS.SQS();
const queueUrl = 'https://sqs.YOUR_REGION.amazonaws.com/YOUR_ACCOUNT_ID/OrdersQueue';
async function processOrder(order) {
const params = {
MessageBody: JSON.stringify(order),
QueueUrl: queueUrl,
MessageGroupId: order.customerId,
MessageDeduplicationId: order.id
};
try {
const result = await sqs.sendMessage(params).promise();
console.log('Order sent to SQS:', result.MessageId);
} catch (error) {
console.error('Error sending order to SQS:', error);
}
}
Also, add this policy to the IAM Role of the function, so it can access SQS. Don't forget to delete the permissions to access DynamoDB!
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sqs:SendMessage",
"Resource": "arn:aws:sqs:YOUR_REGION:YOUR_ACCOUNT_ID:OrdersQueue"
}
]
}
Go to the Lambda console and create a new Lambda function.
Add the following code:
const AWS = require('aws-sdk');
const dynamoDB = new AWS.DynamoDB.DocumentClient();
const ses = new AWS.SES();
exports.handler = async (event) => {
for (const record of event.Records) {
const order = JSON.parse(record.body);
await saveOrderToDynamoDB(order);
await sendEmailNotification(order);
}
};
async function saveOrderToDynamoDB(order) {
const params = {
TableName: 'orders',
Item: order
};
try {
await dynamoDB.put(params).promise();
console.log(`Order saved: ${order.orderId}`);
} catch (error) {
console.error(`Error saving order: ${order.orderId}`, error);
}
}
async function sendEmailNotification(order) {
const emailParams = {
Source: '[email protected]',
Destination: {
ToAddresses: [order.customerEmail]
},
Message: {
Subject: {
Data: 'Your order is ready'
},
Body: {
Text: {
Data: `Thank you for your order, ${order.customerName}! Your order #${order.orderId} is now ready.`
}
}
}
};
try {
await ses.sendEmail(emailParams).promise();
console.log(`Email sent: ${order.orderId}`);
} catch (error) {
console.error(`Error sending email for order: ${order.orderId}`, error);
}
}
Also, add the following IAM Policy to the IAM Role of the function, so it can be triggered by SQS and access DynamoDB and SES:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes"
],
"Resource": "arn:aws:sqs:YOUR_REGION:YOUR_ACCOUNT_ID:OrdersQueue"
},
{
"Effect": "Allow",
"Action": [
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem"
],
"Resource": "arn:aws:dynamodb:YOUR_REGION:YOUR_ACCOUNT_ID:table/Orders"
},
{
"Effect": "Allow",
"Action": "ses:SendEmail",
"Resource": "*"
}
]
}
Architecture-wise, there's one big change in our solution: We've made our workflow async! Let me bring the diagram here.
Before, our Orders service would return the result of the order. From the user's perspective, they wait until the order is processed, and they see the result on the website. From the system's perspective, we're constrained to either succeed or fail processing the order in the timeout limit of API Gateway (29 seconds). In more practical terms, we're limited by what the user is expecting: we can't just show a "loading" icon for 29 seconds!
After the change, the website just shows something like "We're processing your order, we'll email you when it's ready". That sets a different expectation to the user. That's important for the system, because now we could actually have our Lambda function take 15 minutes, without hitting the 29 seconds limit of API Gateway, or without the user getting angry. It's not just that though, if the Order Processing lambda crashes mid-execution, the SQS queue will make the order available again as a message after the visibility timeout expires, and the Lambda service will invoke our function again with the same order. When the maxReceiveCount limit is reached, the order can be sent to another queue called Dead Letters Queue (DLQ), where we can store failed orders for future reference. We didn't set up a DLQ here, but it's easy enough, and for small and medium-sized systems you can easily set up SNS to send you an email and resolve the issue manually, since the volume shouldn't be particularly large.
Once the order went through all the steps, failed some, retried, succeeded, etc, then we notify the user that their order is "ready". This can look different for different systems, some are just a "we got the money", some ship physical products, some onboard the user to a complex SaaS. For this solution I chose to do it via email because it's easy and common enough, but you could use a webhook and still keep the process async.
Also published here.