part 1 : overview
part 2 : testing and continuous delivery strategies
part 3 : ops
part 4 : building a scalable push notifications system
part 5 : building a better recommendation system
Just before Yubl’s untimely demise we did an interesting piece of work to redesign the system for sending targeted push notifications to our users to improve retention.
The old system relied on MixPanel for both selecting users as well as sending out the push notifications. Whilst MixPanel was great for getting us basic analytics quickly, we soon found our use cases outgrew MixPanel. The most pressing limitation was that we were not able to query users based on their social graph to create target push notifications — eg. notify an influencer’s followers when he/she publishes a new post or runs a new social media campaign.
Since all of our analytics events are streamed to Google BigQuery (using a combination of Kinesis Firehose, S3 and Lambda) we have all the data we need to support the complex use cases the product team has.
What we needed, was a push notification system that can integrate with BigQuery results and is capable of sending millions of push notifications in a batch.
From a high level, we need to support 2 types of notifications.
Ad-hoc notifications are driven by the marketing team, working closely with influencers and the BI team to match users with influencers or contents that they might be interested in. Example notifications include:
Scheduled notifications are driven by the product team, these notifications are designed to nudge users to finish the sign up process or to come back to the platform after they have lapsed. Example notifications include:
For the scheduled notifications, we want to test out different messages/layouts to optimise their effectiveness over time. To do that, we wanted to support A/B testing as part of the new system (which MixPanel already supports).
We should be able to create multiple variants (each with a percentage), along with a control group who will not receive any push notifications.
A/B test groups are configured in the Lambda function, which can be easily and quickly changed and redeployed, and the changes are source controlled and peer reviewed.
For the ad-hoc notifications, we don’t want to get in the way of the marketing team doing their job, so the process for creating ad-hoc push notifications should be as frictionless as possible. However, we also don’t want the marketing team to operate completely without oversight and run the risk of long term damage by spamming users with unwanted push notifications (which might cause users to disable notifications or even rage quit the app).
The compromise we reached was an automated approval process whereby:
A simple form to request push notifications to be sent to users identified by the query
Once your request has been submitted, the requester and the approvers will both receive an email detailing the no. of users selected by the query, the JSON payload, etc. In the email to the approvers, there are also buttons to approve/reject the request, as well as to send the proposed push notification to the approvers so they can see how the message would appear on both Android and iOS devices.
We decided to use S3 as the source for a send-batch-notifications
function because it allows us to pass large list of users (remember, the goal is to support sending push notifications to millions of users in a batch) without having to worry about pagination or limits on payload size.
The function will work with any JSON file in the right format, and that JSON file can be generated in many ways:
We also considered moving the device registrations to SNS but decided against it because it doesn’t provide useful enough an abstraction to justify the effort to migrate (involves client work) and the additional cost for sending push notifications. Instead, we used node-gcm and apn to communicate with GCM and APN directly.
Lambda has a hard limit of 5 mins execution time (it might be softened in the near future), and that might not be enough time to send millions of push notifications.
Our approach to long-running tasks like this is to write the Lambda function as a recursive function.
A naive recursive function would process the payload in fixed size batches and recurse at the end of each batch whilst passing along a token/position value to allow the next invocation to continue from where it left off.
In this particular case, we have additional considerations because the total number of work items can be very large:
To minimise the no. of recursions, our function would:
context.getRemainingTimeInMillis()
to check how much time is left in this invocationWhen caching the content of the JSON file from S3, we also need to compare the ETAG to ensure that the content of the file hasn’t changed.
With this set up the system was able to easily handle JSON files with more than 1 million users during our load test (sorry Apple and Google for sending all those fake device tokens :-P).
Like what you’re reading but want more help? I’m happy to offer my services as an independent consultant and help you with your serverless project — architecture reviews, code reviews, building proof-of-concepts, or offer advice on leading practices and tools.
I’m based in London, UK and currently the only UK-based AWS Serverless Hero. I have nearly 10 years of experience with running production workloads in AWS at scale. I operate predominantly in the UK but I’m open to travelling for engagements that are longer than a week. To see how we might be able to work together, tell me more about the problems you are trying to solve here.
I can also run an in-house workshops to help you get production-ready with your serverless architecture. You can find out more about the two-day workshop here, which takes you from the basics of AWS Lambda all the way through to common operational patterns for log aggregation, distribution tracing and security best practices.
If you prefer to study at your own pace, then you can also find all the same content of the workshop as a video course I have produced for Manning. We will cover topics including:
You can also get 40% off the face price with the code ytcui. Hurry though, this discount is only available while we’re in Manning’s Early Access Program (MEAP).