Real World Architecture in the Cloud Using Event-Driven Techniques to Build a PDF Rendering…

Jon Christensen and Chris Hickman of Kelsus discuss an example of a real-world architecture in the Cloud using event-driven techniques to add a new feature, such as building a PDF-rendering pipeline for an existing application. Some of the highlights of the show include: Form/Document/PDF: Common request when building applications for enterprises to automate a business process previously done manually Online forms-based application can perform regular quality checks at various locations to to generate evidence that work is being scheduled and performed properly Forms are submitted into system; application lets others view, share, and forward results of quality checks via a PDF version, rather than having to login to system Making a PDF is not easy-peasy because it’s application-specific and generates a rendering problem; it’s CPU expensive and time consuming — not instantaneous Failure is possible due to complexity and number of moving pieces involved; challenging to incorporate it into existing application with large code base that works fine Loose Coupling: Keep it separate from everything else and use events; free wired-in facilities are available in AWS Triggering an event doesn’t take a lot of work; use pub-sub design pattern to publish event and subscribe to list New feature is separate application that only listens to messages by using Amazon SNS; create Amazon SQS to subscribe to specific SNS event What happens when something goes wrong? Takes longer? Runs out of memory? Additional application code needs to be in Lambda, or it’s lost forever Perfect Solution: A queue because there’s an item of work that needs to happen, handle failures and retry, and have more than one worker/listener that can do the work Things that go immediately, don’t offer observability; subscribe to SNS topic with an SQS for bulk of code to poll queue and look for messages Message retrieved from queue is hidden from others for a certain amount of time; visibility timeout should be long enough to process it and prevent it from being rebuilt Make process more extensive by admitting different background tasks; this architecture deals well with tasks that are lengthy, complicated, and need to be done asynchronously Amazon feature launched last year offers ability for Lambda functions to subscribe to SQS to get rid of scaffolding code; yet making PDFs still doesn’t work very well Make PDF: Pull up message to get submitted data, which needs to be in HTML format Links and Resources Kelsus Secret Stache Media CheckWise AWS Amazon S3 Amazon Simple Notification Service (SNS) Amazon Simple Queue Service (SQS) AWS Lambda AWS ECS PhantomJS Puppeteer: Headless Chrome Node API Docker Transcript Rich: In episode 52 of Mobycast, we discuss a real world example of using event-driven techniques to add a new feature to an existing application. Welcome to Mobycast, a weekly conversation about cloud native developments, AWS, and building distributed systems. Let’s jump right in. Jon: Welcome, Chris, to another episode of Mobycast. Chris: Hey, John. Good to be back. Jon: Good to have you and sadly, missing Rich. I will admit that we’re missing Rich because we were supposed to record this earlier today when he was available. But here in our local ski mountains, we got about almost 20 inches of snow last night and we had to postpone Mobycast for yours truly to get a little skiing done. It was mandatory. Chris: Opportunity calls. Jon: Yes, it does. That’s why I live here. What have you been up to this week, Chris? Chris: You know, you may have got 20 inches of snow but you’re up in the mountains of Colorado, right, that’s supposed to happen. We got more snow here in Seattle. We got another inch of snow yesterday which is fine. It’s not like the foot-and-a-half that we got a few weeks back, but still, come on, it’s March. It’s March in Seattle and for us to get — just stop. I’m ready for Spring. Jon: Yeah, [inaudible 00:01:14]. Chris: Yes, I do. I’m sick of the snow. I’m sick of the cold. I’m ready for Spring. Even went out to Phoenix and weather was cold out there, so very much looking forward to — come on, come on Spring. Jon: Right, right. Given that it’s still cold and wintery, maybe by the time you’re listening to this or wherever you are listening to this, for us here, cold wintery means let’s get technical. Instead of talking big ideas and tell people stuff, we’re going to talk about some technical stuff today. We’re going to revisit something that we’ve talked about generally before. We’re going to talk specifically about a real-world architecture in the cloud using event-driven techniques to build a PDF rendering pipeline. This is a super common thing. It’s just such a common request if you’re building applications for enterprises especially if you’re doing something that automates a business process that was done more manually before. It’s just so common for people to say, “Hey, can we also get this form or this document,” or whatever is the outcome of the business process that you got that you’re creating a web application around. They say, “Can we get this as a PDF and can we put it in this bucket because that will solve our compliance issues,” or whatever it is that they have. Maybe Chris you can jump in and talk a little bit and talk about the existing application and the requirements and situation for building this architecture. Chris: Part of this was we’ve built a forms-based application for a business where they need to perform regular quality checks at the various locations that they manage. They’re a services company, thousands and thousands of locations across the US. Performing this service, they need to have audit quality and make sure that the work is being done properly, it’s being done as scheduled, and have evidence that that was done, so have pictures taken, descriptions of the work and what not. We’ve built this web-based application that’s responsive because it can be done using a laptop or a phone or a tablet. It’s form-based, the folks doing these quality inspections can fill out that form, take pictures with the camera, attach that to the particular quality check, and then submit that. It goes into the system and of course, the application allows folks to view these submissions, and the managers can view the submissions that have been happened and query across the various locations, and query on results and what not and see that. That was the existing application that we had. They’re popular, very core to the business and what it was doing. The next step was like, “Well, now this is getting traction, it’s very useful.” They kind of have some new requirements for, “Well, we want it to be easier to share the results of these quality checks and be able to easily forward it to a different office or a different manager and not necessarily have them log in to the system to do it, use a password or teach them how to use the software and what not.” The idea for these submissions that are in this application that are web-based, let’s create a PDF version of this. Now that PDF is something that’s atomic that can be shared. You can email to someone or you can go put it on a file so you can throw it onto Slack or wherever it may be, but a really easy way to do this and it doesn’t require anyone to log in or really know anything about it. Anyone can view a PDF. That was the feature that we wanted to add to it. Jon: And they can print it out and do all kinds of good stuff with it. Chris: Absolutely, yes. Jon: Okay. Making PDFs is not a problem. It’s easy-peasy. Chris: Yeah. Just tell them, “Install Adobe Acrobat or Microsoft Office and just save this PDF, print this PDF.” It’s not a trivial thing to do to say, “We have these submissions.” Essentially, it’s data and then it has templating around it to do the presentation of it, to format it, and you can do it online because in the application itself, how do you actually make a PDF of that? That was the crux of the problem. A few things that come into play there are actually the one, like making a PDF is pretty time consuming and it is pretty CPU expensive even if you just have a document and you go to print it and save this PDF, even that takes a little bit of time. It’s not instantaneous. There is process and time that’s involved. It’s not an instant thing that happens. Jon: I have a little commentary to this because it just blows my mind that this is still a problem. Because in 2011 when I was running a company called Checkwise, we were also building checklists that people filled out and they kept temperature logs and guess what, same exact requirement, “Let’s build a PDF from these completed checklist so that we can store them as paper,” because we are going away from the paper process. It was this same problem. I don’t think it got any easier in 10 years, okay, six years. Six or seven years, it’s still the same. It just bums me out. It’s like these low-level things, “Oh, convert this data to that data that everybody uses and does all the time,” are the things that kills me because of just what you said. I made a joke about as it being easy-peasy, and you said, “Oh, yeah. Of course. All you have to do is right click and save as PDF.” Everybody in the world being used to that, especially people that pay for software to get built, that’s what they’re used to. And then when they hear, “Oh we have to build a bigger architecture for this.” I’m like, “What? No, you don’t. Just right click and save this PDF. It’s easy.” And it just kills me that it’s not. Okay, continue. Chris: It’s one of those things that is so application-specific. It’s a rendering problem. Jon: It should just be PDF has a really clear JSON-like structure and we’re just going to switch from this JSON structure to that JSON structure. Now, it’s not that easy. Chris: We can talk forever around this. In a previous lifetime, I had, again, built a PDF rendering pipeline. I actually had to use the PDF language. There’s actually a PDF spec and if you print out the document into PDF spec, it’s 500, 600 pages long. It’s really almost a programming language. Jon: It takes 45 minutes to save this PDF. Chris: It does. We digress. But yeah, the important part here is that building a PDF is not a simple task. Not only is it time-consuming, it’s CPU expensive, it’s also complicated. The code for doing this, it’s not just these three lines of code. It’s significant. Jon: The other important thing about it is that it could be a little brittle. It could fail. Chris: Absolutely, yeah. Whenever something gets complicated, the number of moving pieces as it goes up, then odds of failure increase as well. All these are all considerations and figuring out how we’re going to do this. Again, keep in mind that we already have an existing application with rather a large code base that’s working just fine. How do we go ahead and incorporate this? This is where our favorite friend, loose coupling, comes into play. One of the great ways of achieving it is by keeping this separate from everything else and using things like events. That was definitely the strategy here is how can we make sure that we’re building this in a way that takes into account that it’s going to be CPU expensive, it’s going to be time-consuming, so synchronous processing is not going to work. We can’t do something like whenever someone clicks on the submit button to complete one of this quality checks that they have to sit there and wait for a minute or two for some PDF that’s being created and while it’s doing it, it’s maybe impacting other users of the system. That’s not an option. And then again, we’ve already talked about, trying to add all this code for PDF rendering into the existing code base — that doesn’t feel really very either. We need to decouple this and so let’s use events to do it. It’s the perfect example of using events to do that, so that’s what we did. Jon: Usually, when I think of using an event, my first thought is, “Inside of AWS, is there some natural thing that just has an event already that I can take advantage of?” Did you do something like that, or did you have to write some code to make the event happen? Chris: There are facilities in AWS where you can get that for free, where it’s already wired in. Jon: Right. It’s not what would be like an S3 bucket when an object gets written to it, it can generate an event automatically. Chris: Exactly. If these submissions that we’re creating were actually files that we can dump on S3 then that would be an option perhaps but that’s not the case. These submissions, it’s all code. It’s actually going into multiple backend stores, some no sequel, some relational, it wasn’t for a free way of triggering an event there in that. But the good news is that it’s not a lot of work to actually trigger that event. we’re going to be using the pub-sub design pattern here. We’re going to publish an event and then we’re going to have something that subscribes to that list and listens to that and does something that with. In AWS land, we can use SNS as the way of publishing these things, of triggering these events. It becomes pretty straightforward and our primary application, we just have to make some code changes there to say, “Hey, as we receive these submissions for these quality checks, we’ve done the validation that’s required, we create them in our databases and what not, before we then return back success to the caller.” We can then make a call to SNS to say, “Just publish this event.” We’re going to send the message to SNS and basically to just let messages, “Hey, the submission was just created.” That’s all the main application has to do. There’s really little and nothing else that it has to do as part of this. Its job is done and the team that’s working on the main application can continue on their way and they don’t need to do anything else for adding this new feature. The new feature can now be completely isolated from that. It’d be very modular, decoupled, and we can build a separate application for doing that. That separate application all it needs to do now is it need to listen to these messages. To facilitate that, we can take a manage of SNS fan out. By publishing to S2 and SNS Topic, a message that, “Hey, this submission was created.” Anyone that’s interested in that kind of event can subscribe to that topic and there is multiple ways that you subscribe to that topic with different types of listeners, but anyone that’s interested can do so. We set it up such that we have an SQS, a queue that subscribes to that particular SNS event such that whenever a message is published to that topic, it then ends up creating a message in SQS on that queue. Jon: The idea there is that you want all those PDFs that need to get built, you want that work that’s required to sit in a queue so that if the thing that builds the PDF is down or unavailable or if it gets overloaded, that you won’t lose any PDFs that are [inaudible 00:16:10]. If you are just to go call some functions, it’s like, “Alright, I’m building my PDFs right now. I’ve got the message, I’m building it,” then you could run into some trouble. Having a queue there, that kind of saves up all the work you’re supposed to do, makes it possible to have a little bit of failure in your system. Right? Chris: Yeah. This is definitely one of the things that it was programmed for. Another listener that you could set up for this SNS toΩÅSCDCCFDRCCEXZSApic would be a Lambda function. You could say, “Oh, I’ll have a lambda function that creates a PDF.” But yeah, you would have this issue of, “Okay, what happens when there is a failure? What happens if it takes longer than the standard Lambda timeout? What if I run out of memory on Lambda function?” All the other various things can go wrong. That then ends up being lost forever unless you do additional application code in your Lambda to increase [inaudible 00:17:14]. Jon: Right. It ends up being like, “Ha, I kind of made a queue after all.” Chris: Yeah. You would actually have to do something like that. You need some way of persistence to do the retry. A queue here makes perfect sense because you have an item of work, you need it to happen, you want to be able to handle failures and do retry, you want to have more than one worker or listener that can do the work on this. Jon: Also, there’s just a really important point about observability. It’s so easy to go check a queue and be like, “Oh my goodness. Look at that thing. It’s got 100,000 PDFs it’s supposed to have built and it’s not done it. The queue is full, what’s wrong?” Whereas if things were just going immediately, you don’t have that observability, so that’s another piece of this. Chris: Absolutely. Although if you get that deep in your queue that’s when you declare SQS bankruptcy. Jon: Yeah. I used 100,000 because I wanted it to sound really like we’re operating in a huge scale. Chris: Yes. We subscribe to the SNS topic with an SQS queue and then the bulk of the code becomes this worker process. Its job is polling that queue, looking for messages. And whenever it does, a message is available, it pulls it off the queue. It does its processing so it’s not going to unpack that message, look at the payload, understand like, “Okay, what submission is this?” It has its application logic for going down fetching that submission from the data stores, making the HTML version of it, and then doing the conversion to PDF, and then storing the result to that. When it’s done, then it can now delete that message from the queue. Jon: Yeah, that was the part I was just about to ask you about. I haven’t worked with SQS, honestly, but I’ve worked with other queues in the past. Just that ability to warn off other listeners like, “Hey, I’m working on this thing. Do not grab this one because I have got it.” And then, “Okay, I am done with it. Let’s get it out of the queue.” Those are fairly easy to work with features of SQS. Chris: Those are decisions that you’ll be faced with, I mean, once you wire up to it. Once a message is retreat from the queue, it’s now hidden from other folks that are pulling from that queue for up to a certain amount of time. This is the visibility timeout because you need to handle the case for what happens if that worker dies? Jon: Right. You can’t depend it to tell you that it’s done. You have to kind of have a way of undoing the fact that it gotten pulled off the queue. Chris: That’s why you have to really take that visibility timeout and to account for how long it takes to process this. if you say your visibility timeout is 60 seconds but it takes you two minutes to process it, that’s going to be a problem because it now means it’s just going to keep being rebuilt by your workers. Jon: Not a problem for Adobe. They do want that additional PDF in the world. Chris: Sure. Or just heating up the universe. Of course, AWS loves it too right. Jon: Yeah. Chris: “Go spend all the CPU cycles.” Jon: But that makes sense. A visibility timeout is basically the thing is invisible for a certain amount of time and it becomes visible again if you haven’t deleted it before that timeout. Chris: Yes. It’s up to the code to say, “Yup, I’m done. I’ve finished this. Go ahead and delete it.” Jon: Cool. Chris: Some interesting things to point out with an architecture like this is one, again, we could’ve gone straight from our application publishing to an SQS queue instead of an SNS, but by doing so it limits the extensibility. But by publishing that SNS, it now allows us to extend this even further. We might have something where maybe we now have a requirement that whenever a submission is generated a quality check, if it’s a failing quality check, it needs to create a ticket in some other system enterprise application that’s used for managing urgent issues. This makes it really easy except now, because it’s an SNS topic and it’s a message being published to that, we can add another listener to it. Now we could do something like maybe it is a lambda function. Its job is to add itself as a subscriber to that topic, look at the submissions, see if it’s failing or not. If it is failing, then it can be the one that’s responsible for doing the API call to some other system and registering this as an urgent issue that has to be fixed. It’s really easy to extend. None of the PDF code has to change, none of the application has to change, it’s just using these events by using something like SNS and helping SNS fan out, it allows whoever’s interested in learning about that, they can do so. That’s one thing to point out here. Another one is that with the workers, that process, we can make that more extensive as well by admitting all sorts of different background tasks. This kind of architecture is really good for dealing with tasks that are, again, lengthy, complicated, they need to be done asynchronously. But basically, it’s framework where it’s essentially a pool of workers that are listening to a queue, pulling message off, they’re understanding what they are, and then they’re doing the work on them and then marking them as being complete. You could have another requirement here for building complicated data reports, CSV format, that maybe take minutes to generate or something like that. That could be another message type that gets admitted into this as well. We can extend our worker to tag these messages for what kind of operation should they be. Is this generated PDF or is this generated CSV report and create this pluggable model in your worker code so that you’re not redoing all the scaffolding for each different type of one of these asynchronous tasks you want to do. Instead, you can just have it and be very modular and you just write a module that you can then plug in to that worker. That’s kind of an architectural pattern that we’ve adopted here. One of the things to touch on a little bit is, for the longest time, SQS in order to process SQS, you couldn’t wire SQS out directly to Lambda. If you want to process SQS messages, you have to go write code to do that. You have to build basically an application and figure out how to run it and deploy it and all that kind of stuff. Big new feature that Amazon launched last year was the ability for Lambda functions to subscribe to SQS queues which for certain things, it’s really great. You can get rid of all of that scaffolding code and instead just the core logic. In certain situations, that model’s going to work really well for a lot of people. For us, for doing PDFs, it doesn’t work so well. To try to build a PDF inside of a Lambda function, it’s so complicated, there’s so much code there, there’s so much that could go wrong, and it’s so lengthy that it just doesn’t make sense. In this particular case, having the scaffolding, the actual dedicated worker for it, for us, it makes sense in this case. But just something to keep in mind as folks think about this, there are options there. You just need to look at your particular situation and what makes more sense for you. Jon: Right on, yeah, that makes sense. We have to wrap up but I know there’s probably a couple of listeners that are wondering, “Wait, you said you’re going to make some PDFs,” so they’re probably wondering, “What did you actually use to make those PDFs, where’d you put them, and what happens when they’re made?” Chris: Right. Those workers when they pull up these messages and say, “Hey, I need to generate a PDF,” one of the first thing they have to do is they have to go get that submission, that data, and it needs to be in basically HTML format. One of the easiest ways to make a PDF is to go from HTML to PDF. The first iteration that we did this with was we used a library tool called Phantomjs. That worked reasonably well but some performance issues with it, some other pretty heavy, pretty resource-intensive. Recently, in the past year, we switched over to Headless Chrome basically with using puppeteer as the frontend for driving Headless Chrome. Much more efficient, we can make PDFs much quicker now. This worker nodes are running inside Docker containers on top of ECS in AWS in order for Headless Chrome to work correctly with that. We had some issues there specifically, Headless Chrome, it needs scratchbase to build these PDF files and it wants to use shared memory, and ECS did not allow for setting a size for that and bumping it past the default ISO. It turned out we originally did want to use Headless Chrome, we couldn’t because of that, but then with some lobbying and some other folks out there having the same kind of issues, Amazon updated ECS that you could set that. Once that setting became available to increase the size of the shared memory then we were off to the races. Now we can build those PDFs efficiently and quickly and then when we’re done, of course, we throw them into an S3 bucket, and now they’re available for downloading or for streaming back. Jon: Very cool. This was pretty cool because it was a little bit of a premise of our previous event-driven architecture episode, but it just felt so much more real diving into the specifics of this particular problem that we solved with event-driven architecture. Thank you for laying that out for us, Chris. Chris: Anytime. It’s been fun. Jon: Alright, talk to you next week. Chris: Alright. Thanks! See yah. Rich: Well dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode along with show notes and other valuable resources is available at mobycast.fm/52. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.