we built a system that integrates with BigQuery results and is capable of sending millions of push notifications in a batch, using a combination of Lambda, S3, API Gateway, DynamoDB and SES. The Road So Far part 1 : overview part 2 : testing and continuous delivery strategies part 3 : ops part 4 : building a scalable push notifications system part 5 : building a better recommendation system Just before Yubl’s we did an interesting piece of work to redesign the system for sending targeted push notifications to our users to improve retention. untimely demise The old system relied on for both selecting users as well as sending out the push notifications. Whilst was great for getting us basic analytics quickly, we soon found our use cases outgrew . The most pressing limitation was that we were not able to query users based on their social graph to create target push notifications — eg. notify an influencer’s followers when he/she publishes a new post or runs a new social media campaign. MixPanel MixPanel MixPanel Since all of our analytics events are streamed to Google BigQuery (using a combination of , and ) we have all the data we need to support the complex use cases the product team has. Kinesis Firehose S3 Lambda What we needed, was a push notification system that can integrate with BigQuery results and is capable of sending millions of push notifications in a batch. Design Goals From a high level, we need to support 2 types of notifications. are driven by the marketing team, working closely with influencers and the BI team to match users with influencers or contents that they might be interested in. Example notifications include: Ad-hoc notifications users who follow and other fashion brands might be interested to know when another notable fashion brand joins the platform Accessorize users who follow an influencer might be interested to know when the influencer publishes a new post or is running a social media campaign (usually with give-away prizes, etc.) users who have shared/liked music related contents might be interested to know that has joined the platform Tinie Tempah are driven by the product team, these notifications are designed to nudge users to finish the sign up process or to come back to the platform after they have lapsed. Example notifications include: Scheduled notifications day-1 unfinished sign up : notify users who didn’t finish the sign up process to come back to complete the process day-2 engagement : notify users to come back and follow more people or invite friends on day 2 day-21 inactive : notify users who have not logged into the app for 21 days to come back and check out what’s new A/B testing For the scheduled notifications, we want to test out different messages/layouts to optimise their effectiveness over time. To do that, we wanted to support A/B testing as part of the new system (which already supports). MixPanel We should be able to create multiple variants (each with a percentage), along with a control group who will not receive any push notifications. A/B test groups are configured in the Lambda function, which can be easily and quickly changed and redeployed, and the changes are source controlled and peer reviewed. Oversight vs Frictionless For the ad-hoc notifications, we don’t want to get in the way of the marketing team doing their job, so the process for creating ad-hoc push notifications should be as frictionless as possible. However, we also don’t want the marketing team to operate completely without oversight and run the risk of long term damage by spamming users with unwanted push notifications (which might cause users to disable notifications or even rage quit the app). The compromise we reached was an automated approval process whereby: the marketing team will work with BI on a query to identify users (eg. followers of ) Tinie Tempah fill in a request form, which informs designated approvers via email approvers can send themselves a test push notification to see how it will be formatted on both Android and iOS approvers can approve or reject the request once approved, the request will be executed A simple form to request push notifications to be sent to users identified by the query Once your request has been submitted, the requester and the approvers will both receive an email detailing the no. of users selected by the query, the JSON payload, etc. In the email to the approvers, there are also buttons to approve/reject the request, as well as to send the proposed push notification to the approvers so they can see how the message would appear on both Android and iOS devices. Implementation We decided to use as the source for a function because it allows us to pass large list of users (remember, the goal is to support sending push notifications to millions of users in a batch) without having to worry about pagination or limits on payload size. S3 send-batch-notifications The function will work with any JSON file in the right format, and that JSON file can be generated in many ways: by the cron jobs that generate scheduled notifications by the approval system after an ad-hoc push notification is approved by the approval system to send a test push notification to the approvers (to visually inspect how the message will appear on both Android and iOS devices) by members of the engineering team when manual interventions are required We also considered moving the device registrations to but decided against it because it doesn’t provide an abstraction to justify the effort to migrate (involves client work) and the for sending push notifications. Instead, we used and to communicate with GCM and APN directly. SNS useful enough additional cost node-gcm apn Recursive Functions FTW has a hard limit of 5 mins execution time (it might be softened in the near future), and that might not be enough time to send millions of push notifications. Lambda Our approach to long-running tasks like this is to write the function as a recursive function. Lambda A naive recursive function would process the payload in fixed size batches and recurse at the end of each batch whilst passing along a token/position value to allow the next invocation to continue from where it left off. In this particular case, we have additional considerations because the total number of work items can be very large: minimising the no. of recursions required, which equates to no. of requests to and carries a at scale Invoke Lambda cost implication caching the content of the JSON file to improve performance (by avoiding loading and parsing a large JSON file more than once) and reduce S3 cost To minimise the no. of recursions, our function would: process the list of users in small batches of 500 at the end of each batch, call to check how much time is left in this invocation context.getRemainingTimeInMillis() if there is more than 1 min left in the invocation then process another batch; otherwise recurse When caching the content of the JSON file from , we also need to compare the to ensure that the content of the file hasn’t changed. S3 ETAG With this set up the system was able to easily handle JSON files with more than 1 million users during our load test (sorry Apple and Google for sending all those fake device tokens :-P). Like what you’re reading but want more help? I’m happy to offer my services as an and help you with your serverless project — architecture reviews, code reviews, building proof-of-concepts, or offer advice on leading practices and tools. independent consultant I’m based in and currently the only UK-based . I have nearly of with running production workloads in AWS at scale. I operate predominantly in the UK but I’m open to travelling for engagements that are longer than a week. To see how we might be able to work together, tell me more about the problems you are trying to solve . London, UK AWS Serverless Hero 10 years experience here I can also run an to help you get with your serverless architecture. You can find out more about the two-day workshop , which takes you from the basics of AWS Lambda all the way through to common operational patterns for log aggregation, distribution tracing and security best practices. in-house workshops production-ready here If you prefer to study at your own pace, then you can also find all the same content of the workshop as a I have produced for Manning. We will cover topics including: video course authentication authorization with API Gateway Cognito & & testing running functions locally & CI/CD log aggregation monitoring best practices distributed tracing with X-Ray tracking correlation IDs performance cost optimization & error handling config management canary deployment VPC security leading practices for Lambda, Kinesis, and API Gateway You can also get the face price with the code . Hurry though, this discount is only available while we’re in Manning’s Early Access Program (MEAP). 40% off ytcui