There are many articles out there that deal with standard optimization techniques like SSR, PWA, tree shaking and similar. This article is similar, but also different as it focuses on things unique to serverless environments, in this case, AWS Lambda specifically.
My name is Sven, and I’m one of the founders of Webiny. Don’t know Webiny? — It’s a serverless CMS powered by React, GraphQL, and Node.
Prep work
To get out measurements, and help us identify the problem(s) we will be using webpagetest.org to run our requests, gather timings data and get a better understanding what a user is seeing, and experiencing, on their end.
What we will be looking at is the “first view”, meaning how do the load times look for a user that has never visited the page before. This is important, as the browser cache can hide many of the bottlenecks.
From the chart at the top of the page, the most meaningful metric for us was the “Time to Start Render”. If you look closely, you’ll see that it took almost 2 seconds just to start the render of the page 😱. This is due to the nature of a single page app (SPA). Basically, you first need to download a massive JS bundle (1), which then the main thread needs to process (2) before the page can display some content.
(1) Download JS bundle. (2) Wait until the main thread processes the JS bundle.
However, that’s just part of the story. Once the main thread processes the JavaScript bundle, it actually fires off several API calls to the API Gateway. At this stage, the user sees the notorious spinner, and he doesn’t actually see any content so far. Here is a filmstrip of the visual events:
What we see here is a poor user experience. The user sees an empty page for about 2 seconds, and then a spinner for another second. That additional second of viewing the spinner is caused by the API requests that follow once the JS bundle is loaded. The API requests are needed to retrieve the content and actually render the content.
If this was a normal VPS, the time cost of these API calls would mostly be predictable, but when dealing with serverless, you have the infamous “cold start”, and to make matters even worse, in case of Webiny Cloud Platform, the Lambdas are part of a VPC, which means we need to initiate an ENI for each new Lambda instance, this increases the cold start time drastically.
Here are some timing metrics for booting a Lambda inside a VPC and outside a VPC:
Image is taken from https://www.freecodecamp.org/news/lambda-vpc-cold-starts-a-latency-killer-5408323278dd/
Conclusion, there is a 10x increase in the cold start time when Lambda is inside a VPC (ouch! 🤕).
There is also another cost that’s bundled inside the API timings, that’s the latency. Since it’s my browser (the client), that’s executing the API requests, the browser needs to transverse from my computer, over the Internet, all the way to the origin, and back. That is repeated for each API request.
Based on the above we identified a few challenges that we need to tackle:
We actually opted in for the third option. What if we don’t need the API requests at all, and what if we don’t need the JS bundle at all? This would eliminate all our pain points.
Our first idea was to generate a static HTML snapshot of the rendered page and serve that to the user.
Webiny Cloud, which is the serverless infrastructure, based on AWS Lambda, that hosts Webiny websites, it had the ability to detect bots and then instead of serving a JS version of the page, it would reroute the request to a Puppeteer instance which would render the page using headless Chrome, and then serve back the HTML to the bot. The main reason for that was SEO, as many bots don’t read JavaScript. So we had the idea of actually using that functionality and serve back the same output to regular users.
This usually works well for a non-JS environment, but when you try serving a pre-rendered content to a client with a real JS enabled browser, the page renders, but then once the JS files load, the React components don’t know where to mount. This causes a big pile of errors in the console, therefore this solution wasn’t really helpful in our case.
The benefit of SSR is that all the API requests stay within a local network, as they are handled on a machine, or a function, running inside a VPC, there is no latency when communicating with a Lambda, like the one between the client and the origin server, unless there is a cold start in question.
An additional benefit is that we get back an HTML snapshot, to which React components know how to mount, once the JS files are loaded.
And finally, we don’t need that big JS bundle and the API calls to display the page. It can be loaded asynchronously, and it won’t block the main thread.
Basically, SSR solves most of our challenges…well, kind of.
At this point we got to here:
No more API calls, and we actually get to see our page before that big JS bundle is downloaded. But if you look closely at the first request, it took almost 2 seconds to return a response, for the document, from the server. Let’s take a deeper look into this.
What’s happening here is that we are starting a node server, doing the SSR, with all the API requests and JS processing and then returning the final output. However, the problem is that on average this takes around 1-2s.
Our SSR server needed to do all that work, before returning the first byte of the response to the user. This causes a very long wait time for the first byte. It’s almost the same amount of work, just that work is not happening on the client side, but on the server side inside the SSR workflow.
Wait. You said server ? Isn’t this supposed to be serverless? We sure did try doing SSR in a Lambda function but it turned out to be a very intensive process (you need to drastically increase the allocated memory size to get more CPU resources), plus the cold starts we talked about before... So for now, the ideal setup is using the node server to download the site’s SSR bundle and render it.
Back to the SSR results: looking at the film strip, the timing is not much different than what we had when we were doing client-side rendering.
Blank screen for 2.5 seconds 😡
Although it might not look like we’ve achieved any progress, we actually have! We got an HTML snapshot with all the content, which is ready to be hydrated by React, and there is no need to do any API calls since all the required data is already injected into the HTML.
The only problem is that generating this HTML snapshot takes too long. At this point, we could either invest more time into optimizing SSR or simply cache the output and serve the snapshot from something like Redis cache — which is exactly what we did.
Once a user visits a Webiny website, we first check a centralized Redis cache to see if we have an existing HTML snapshot, and serve it from the cache. On average this brought down the “time to first byte” to a range of 200–400ms. This is where we actually started seeing massive improvements in speed.
Right from the first view, the user gets to see the page content under a second.
Let’s have a deeper look at the waterfall chart now:
The red line shows the 800ms mark, which is when the page content is fully loaded. You can also see that the JS bundles were loaded at around 1.3s mark, but that didn’t affect the time in which the user got his content. The same applies to the API calls and the main thread processing, both are no longer required to display the page content.
Note that the timings on the JS bundle, API calls, and main thread still matter, as that is the time by which page becomes “interactive”, but for crawlers and user’s perception of “speed” — it doesn’t matter.
In case this was a “dynamic” page, say it was displaying a signed-in user in the header. The SSR would load a generic page, meaning the one where the user is not signed in, and only after the page, JS bundle, and API calls are processed (time to interactive), the header would change and display the signed-in user.
A few weeks later we discovered that our proxy wasn’t closing the client connection at in right place when it was triggering the SSR as a background process. A single-line fix brought down the TTFB to 50–90ms mark and visual complete to ~600ms.
There are only two hard things in Computer Science: cache invalidation and naming things.
— Phil Karlton
The cache invalidation is definitely true. Either you have a very short TTL and refresh the cache often, introducing infrequent long page load times, or have a mechanism to invalidate the cache based on certain events.
In our case to get around this problem we actually introduced a short TTL or 30s, but also we added the option to serve stale cache, and at the same time refresh the content in the background. This way we offset any latency and cold-start issues that Lambdas might introduce.
This works as follows: a user visits a Webiny website, we check the HTML cache, and if there is an existing snapshot, we will serve that. The snapshot can even be several days old. What we do is we serve that old snapshot to the visitor in those few 100ms, and in parallel, we trigger a job to generate a new snapshot and replace the old cache. That job usually takes just a few seconds, as we also introduced a mechanism to always have a bucket of pre-warmed lambdas, so we don’t pay the big cold-start cost when generating new snapshots.
This way we always serve from the cache and the content is refreshed on subsequent visits, in case the cache is older than 30 seconds.
This is definitely an area where we will introduce additional improvements, for example, we are looking at adding the option to automatically refresh the cache every time a user publishes a page. However, this is not a silver bullet.
For example, say you have a homepage that displays the last 3 blog posts, and you create and publish a new post, technically the cache will be generated for that new post only, but the homepage will still be stale.
We are still investigating the best approach, but so far the focus was on sorting out the performance challenges. At this point, we believe we did quite a good job.
Our starting point was a client-side rendering, where, on average, the visual complete metric was 3.3s. Now the visual complete is ~600ms. And what’s also important, there is no more spinner.
SSR is the key player here, but without proper caching, you are just moving timing metrics from the client to the server, so the end time to “visually complete” won’t change much.
SSR has the additional benefit of offsetting the CPU bottleneck on older mobile devices, which is something we have not measured in this test, but the current implementation we have should get around that problem as well.
Overall, doing SSR is hard, and with serverless on top, it makes it even harder. The solution requires code changes, additional infrastructure, as well as an intelligent caching mechanism; but the benefits are great, and most importantly, your users will appreciate them.
Let me know if you decide to give Webiny a spin. You can host your custom apps on the Webiny managed platform, and you’ll get all these amazing performance improvements out of the box. In case you have any questions or feedback to share, drop me a message on twitter @SvenAlHamad, or use the comment form below.