S3 the best of 2 worlds

Simple math

Here at Coneuron, we’ve decided to put S3 at the center of our serverless solution because it’s such a versatile component in the AWS world and we want to tell you a bit about how we use it.

But before we get into our explanation, let me tell you a bit about myself, the journey I took, and how S3 is related to it.

Deep inside I am a software developer, over the past 12 years in the software industry, I was dealing mainly with backend and mobile development, up until last year most of my work was done in server-more environment, a bit of S3, RDS and the only time I used lambda was as part of my Python code.

I joined a startup called Coneuron a year ago as head of engineering. As a father of twins, the startup’s mission was very compelling: reduce negativity online. These are three simple words, but pose a huge problem technically and product wise.

As a small startup, we have decided right from the start that we prefer to go serverless, for the sole reason that we prefer to concentrate on our product and not on provisioning a linux machine. Going serverless is more than using Lambdas or FaaS; it’s a state of mind where you “outsource” everything that is not core to your business. (But what is serverless in my own mind is a different post. Let’s move forward.)

How does S3 fit into the picture? Before answering that, you need to understand how Coneuron works.

We collect data (a lot of it), which our users share in various social networks. Data is collected on a mobile device and one of our earliest problems was the question of, “How do we upload this data into our system in order to analyze it?”. We wanted the solution to support the following futures:

  • Uploaded data can weigh from 1K up to several hundreds Ks (and in some cases even a couple of megabytes) and each device can upload content hundreds of times a day.
  • We live in a FaaS world, and I want the solution to play nicely with what we already have, e.g. Lambda.
  • From a security perspective, we want to try to minimize the attack surface of the interface.
  • We want to be able to save content for future use (debugging, replaying, reanalyzing, etc.)

High level view

We use S3 as a simple queue where no order is guaranteed. Everything starts with a single device. Here’s how it works:

  1. On the device itself we compress all data; reducing size is mandatory both for saving space on S3 and reducing the used bandwidth on device. These compressed files are saved in our S3 bucket forever, enabling us to better debug, fine tune our ML algorithms, and duplicate our production environment data in one click.
  2. Instead of a single API that authenticates and enables the end point to upload a file, we’ve separated the process into two steps:
     — Coneuron uses an internal authentication service (more on that in future posts), which on a given token produces a pre-signed post url. This pre-signed url is time restricted and after X amount of time it expires. In addition it allows us to control how the resulting file name will look like. The file name itself represents which user uploaded the file. We use [unique_id]_[human_readable_id]_[time]_[uuid] to represent a file that a user uploaded.
     This url is used by a device to produce a simple post request and upload the file to an S3 bucket.
  3. Pay attention that Coneuron’s service is not handling the actual upload; it only produces the url for upload. This technique has another benefit of the ability to run A/B testing quite easily. There’s no need to push any flags to the device in order to use a different url. All you have to do is just to give a different pre-signed url, a different bucket and a different piece of code is ran.
  4. The last part in the process is a Lambda function attached to the S3 bucket. Using S3 prefix/suffix triggers, we are able to run the right Lambda depending on file type (more on that later).

It was good × 2

We like the architecture above. It proved to work very well so we’ve decided to stretch it a bit and use S3 + Lambda as a messaging gateway for our internal services. S3 acts as a non-ordered queue, and each service that we have in the system creates a message and saves it into a predefined folder with specific suffix. A Lambda catches the event, and according to the suffix of the file, forwards it to the correct service. Saving all API interactions in S3 greatly improves debugging, and there’s no need to look at logs in order to view the arguments/parameters etc. Just pick the json file containing the interaction, view it in your preferred editor, or replay it on your machine with break points. In addition, we can leverage our knowledge in S3 without introducing any new service into the system.

Implementation details

Folder structure for S3 based queue
IS3 bucket event configuration

All messages arrive to a single folder named incoming and multiple Lambda functions are configured to listen to different file names. (You can configure it under eventsproperties tab.) Each file with a different suffix represents a different service. When a message is processed successfully it is being moved to an analyzed folder. As I wrote earlier, we keep all messages forever.

All of our messages are POPO (plain old python objects) that implement two main methods:

  • Save — This will serialize the message into json and save it in to a S3 folder with specific suffix. Pay attention that each serialized message in addition to the content itself contains versioning and date. For example:
{
"meta": {
"version": 1,
"date": "2018–07–01 10:41:58.925570+00:00"
},
"user_id": 944,
"sent_to_id": 365
}
  • from_stream — This will read the actual serialized file, verify its validity, and convert it to POPO.

Saving a message triggers the Lambda that is attached to the S3 bucket and the message is read. A common functionality is shared by all services:

details <- get file details from event
actual_file <- download file from S3 to local storage
do_magic() # implemented by the service itself
move file to done folder # save forever
in case of an error move to dead letter bucket

Like any messaging system, we have a dead letter bucket. The bucket contains all messages that for some reason failed analyze successfully. By using a scheduled Lambda task that runs once per hour, we move all messages back to the incoming folder and try to reanalyze them. In distributed systems things fail once in a while, but usually retrying the same message again will succeed. We do track using internal metrics the number of messages that we retried in order to pinpoint real bugs.

Summary

S3 is a versatile tool. Its basic nature of storage elasticity combined with powerful API and Lambda integration gives it super powers. I’ve demonstrated two possible usages:

  1. Easy and secure file uploads from various clients → a kind of API GW
  2. Messaging system

You are more than welcome to share your experience with S3 and how you extended it beyond its original purpose.

Read “from the trenches” series

More by Efi Merdler-Kravitz

Topics of interest

More Related Stories