But before we get into our explanation, let me tell you a bit about myself, the journey I took, and how S3 is related to it.
Deep inside I am a software developer, over the past 12 years in the software industry, I was dealing mainly with backend and mobile development, up until last year most of my work was done in server-more environment, a bit of S3, RDS and the only time I used lambda was as part of my Python code.
I joined a startup called Coneuron a year ago as head of engineering. As a father of twins, the startup’s mission was very compelling: reduce negativity online. These are three simple words, but pose a huge problem technically and product wise.
As a small startup, we have decided right from the start that we prefer to go serverless, for the sole reason that we prefer to concentrate on our product and not on provisioning a linux machine. Going serverless is more than using Lambdas or FaaS; it’s a state of mind where you “outsource” everything that is not core to your business. (But what is serverless in my own mind is a different post. Let’s move forward.)
How does S3 fit into the picture? Before answering that, you need to understand how Coneuron works.
We collect data (a lot of it), which our users share in various social networks. Data is collected on a mobile device and one of our earliest problems was the question of, “How do we upload this data into our system in order to analyze it?”. We wanted the solution to support the following futures:
We use S3 as a simple queue where no order is guaranteed. Everything starts with a single device. Here’s how it works:
We like the architecture above. It proved to work very well so we’ve decided to stretch it a bit and use S3 + Lambda as a messaging gateway for our internal services. S3 acts as a non-ordered queue, and each service that we have in the system creates a message and saves it into a predefined folder with specific suffix. A Lambda catches the event, and according to the suffix of the file, forwards it to the correct service. Saving all API interactions in S3 greatly improves debugging, and there’s no need to look at logs in order to view the arguments/parameters etc. Just pick the json file containing the interaction, view it in your preferred editor, or replay it on your machine with break points. In addition, we can leverage our knowledge in S3 without introducing any new service into the system.
All messages arrive to a single folder named
incoming and multiple Lambda functions are configured to listen to different file names. (You can configure it under
properties tab.) Each file with a different suffix represents a different service. When a message is processed successfully it is being moved to an
analyzed folder. As I wrote earlier, we keep all messages forever.
All of our messages are POPO (plain old python objects) that implement two main methods:
"date": "2018–07–01 10:41:58.925570+00:00"
Saving a message triggers the Lambda that is attached to the S3 bucket and the message is read. A common functionality is shared by all services:
details <- get file details from event
actual_file <- download file from S3 to local storage
do_magic() # implemented by the service itself
move file to done folder # save forever
in case of an error move to dead letter bucket
Like any messaging system, we have a dead letter bucket. The bucket contains all messages that for some reason failed analyze successfully. By using a scheduled Lambda task that runs once per hour, we move all messages back to the
incoming folder and try to reanalyze them. In distributed systems things fail once in a while, but usually retrying the same message again will succeed. We do track using internal metrics the number of messages that we retried in order to pinpoint real bugs.
S3 is a versatile tool. Its basic nature of storage elasticity combined with powerful API and Lambda integration gives it super powers. I’ve demonstrated two possible usages:
You are more than welcome to share your experience with S3 and how you extended it beyond its original purpose.