paint-brush
Building Scalable E-commerce Infrastructure on Magentoby@robbiethompson
222 reads

Building Scalable E-commerce Infrastructure on Magento

by Robbie ThompsonNovember 20th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This article highlights the key challenges Ruroc faced and the solutions they implemented. It serves as a guide for developers to build scalable e-commerce infrastructure on Magento while avoiding similar pitfalls.
featured image - Building Scalable E-commerce Infrastructure on Magento
Robbie Thompson HackerNoon profile picture

Introduction

At Ruroc, the process of building the infrastructure to handle over 12,000 concurrent users on Black Friday with no drop in performance took over 3 months and a lot of trial and error. This article summarises the key obstacles we ran into and how we fixed them. It can act as a guide to help other developers build scalable e-commerce infrastructure on Magento and avoid the pitfalls we encountered.


Despite its use across over 100,000 e-commerce sites, including large-scale stores like Helly Hansen and Ford, there’s not a lot of publicly available information on how to run Magento at scale. [1] Our hype-drop business model relied on our website being able to support tens of thousands of concurrent users, all arriving on the site simultaneously.


Magento eventually allowed us to do this, but it required a highly tuned infrastructure that was fit for purpose. Our users all had a high intent to convert and were trying to do so as fast as possible before products sold out. This demand profile, characterized by uncached calls to cart functions and a narrow time frame, causes what is known as the “thundering herd” problem, and makes it the most challenging type of traffic surge to scale for. [2]


Identifying the Problems

Load Testing

Before you start, you need to establish a baseline for what your infrastructure is capable of. The reason we started this process in the first place was that we noticed performance issues even with steadily increasing site traffic. We found CPU utilization on our web server was increasing exponentially, and the query response time for the database was also increasing in line with it, eventually causing transactions to lock.


In our case, we knew that this baseline was only a few hundred users, but we needed to be able to monitor our progress to make sure we were addressing the right issues.


We used JMeter to build out our testing suite since this was what we were familiar with, but there are plenty of newer tools (Gatling, k6, Locust) that do the same job.


The most common mistake in load testing is to test only test endpoints in isolation, and then see how many requests they can support. This isn’t how your users actually behave, so there’s no point in analyzing your infrastructure in this way. [3]


In production, your users may land on one page, but they will all then take different journeys. X% of users may go straight to a product page, Y% may explore your categories, and Z% may drop off entirely. When building your testing scripts, your goal should be to build customer journeys that emulate traffic that behaves as close to your actual traffic as possible.


When you’re building out your test suite, you should be doing this on your staging infrastructure, not your local development environment. There will always be slight differences between production and your local environment, and you want to ensure that you’re building and testing your scripts against production-like infrastructure.


On that note, ensure that your staging infrastructure is configured and scaled to the same spec that you want production to be running for high-traffic events. Better yet, test against production during off-peak hours if you can, else you’ll never be fully certain of its capacity. Once you know how to quantify the capacity your infrastructure can handle, you can start improving it.

Identifying Bottlenecks

You’ll need to build up an understanding of what your infrastructure is actually spending time and resources on. We used New Relic to identify different groups of requests that we wanted to address:


  1. Uncached calls - requests that weren’t being cached in our CDN (Fastly’s) full-page cache.
  2. High transaction time - requests that were taking up the most time, as the longer the request is being processed then the fewer threads are available to handle other requests.
  3. Most common calls - these are requests that may not be the most resource-heavy, but are common across all sessions, fire on every page, a sub-request of another request, etc.


Start by just finding these requests and putting them into these groups. Once you’ve done this, even without digging into the underlying code, you should have a much better idea of where to focus your optimizations.


Fixing the Problems

Once you know where your main bottlenecks are, it’s time to start implementing optimizations. Here are some high-level things to focus on which you’ll find will cover a lot of issues at once:

Full Page Cache

Magento has various caching mechanisms operating at different levels across the infrastructure. Your first goal should be to move as much as you can to whichever service you’re using for your full-page cache. In a typical Magento stack this is Varnish, but in my view, unless you’re trying to optimize for costs and not performance, then this is an unnecessary component to manage yourself. If you can, save yourself the headache and use a CDN for your FPC.


We used Fastly as they have an incredibly good Magento module, even though they are on the more expensive side. [4] Time was of the essence for us so we didn’t have the option of building an integration ourselves, but if you have the time then you can integrate with any modern CDN and use their FPC. Whatever you choose to use, they can all achieve the same result.


For every uncached request, you need both the input and output of what it’s doing. If either the input or output contains or requires dynamic/private data such as customer information, then you’re probably not going to be able to put it behind FPC. [5] If it doesn’t, then you’re in luck, and you should make that response cacheable so that your infrastructure doesn’t have to deal with the load. You won’t find many requests like this in core Magento, but third-party modules are culprits for not caching responses as it can make the development process slightly harder. [6]

Serverless / Edge Functions

For the remaining uncached calls, you now need to consider whether these can be moved to serverless functions. Again, all CDN providers will offer this (Cloudflare has Workers, Cloudfront has Functions), but we used Fastly who have Edge Compute.


A great example of a call that can be moved to a serverless function is GeoIP lookup. When a user first hits your website you need to redirect them to the relevant store depending on their location. This obviously can’t be put behind FPC else you’d be redirecting everyone to the same store, but you can move this to a serverless function as these functions do have context of the user’s IP address.


In this function, you simply need to get the user’s IP address, map it to a country (your CDN may already provide the country as a variable to the function), and then redirect to the corresponding store. This is a great request to cache as every user will be calling it, and depending on your current implementation it can have a high transaction time due to needing to map the IP address to a country.


Dig deep into all of your uncached calls and figure out if there are ways for your serverless functions to process the request without having access to the Magento codebase. You may have to write hard-coded subpar code, but the reduction in resource utilization will be worth it.

High Transaction Time Requests

The requests that take up the most time usually do so because they’re interfacing with other services (database, elastic search, third-party API, etc.), or because it’s running code that is performing a lot of operations (data transformation, filesystem read/writes, etc.).


There’s no simple fix for reducing transaction time, as each request will be performing functionality that is specific to a certain use case. However, there are some general changes you can try to implement:


  1. Reduce interfacing with other services. If the code is querying the database, try and move the results into Redis so they can be re-used later. Determine whether the database needs to be involved at all - are you able to make assumptions and hard-code the results instead of querying for them?
  2. Remove reliance on third-party APIs. Even if the API is performant, there is still a minimum time it will take to retrieve results across the internet, and this is the time your web server has to spend while holding up other requests and doing nothing. [7] Try and bring the functionality into your codebase if you can. For example, instead of querying a third-party API for a GeoIP lookup, store the MaxMind GeoIP database in your filesystem/Redis instance and use that instead. This will always be more efficient than using a third party.
  3. Make your operations more efficient. You’ve probably written sloppy code before that is “good enough for now” - now is the second best time to fix it.

Third-Party Modules

Next, you’ll want to look at what third-party modules your codebase is using. You’ve probably got some that were installed a while ago that you no longer need, or some that were nice-to-haves that aren’t providing that much value. Put these modules under a microscope and really figure out if you need them. If you do, what functionality from them do you need? Are you able to rebuild these modules yourself with just the functionality you need to reduce bloat?


Most modules have thousands of lines of unnecessary code as they’ve been built up over the years to support needs from other brands that have requirements that you don’t. [8] In my experience, it’s often easier to find the code you do need and move that out than it is to try and optimize all of the existing code.


For some of the larger third-party modules, of course, it won’t be practical for you to go through and untangle everything to rebuild that functionality yourself. In these cases, try and look for the worst pieces of code - anything that’s touching the cache, dealing with user-specific data, database queries, etc. You need to build an understanding of exactly what code is actually running per request so that you can decide if it can be optimised.

Monitoring Cache

The number of items across your caching infrastructure should gradually increase over time. These items will have expiry timestamps so of course will deplete, but the general trend should be positive. When you perform a deployment part of that process should be running the setup:upgrade command, which should clear both the Redis and FPC. [9] The number of cached items at the point of this command running should drop to zero, and then quite quickly pick up again as core framework items get cached when requests are made.


If your infrastructure is healthy, this is the only time that you should see the number of cached items drop to zero. In our case, when we looked into this we noticed that ours was struggling to ever hold above zero, which was immediately concerning.


I said earlier that third-party modules are usually the culprits for performance issues, but in reality, they’re just the most addressable and where you should focus your efforts on. We didn’t know it at the time, but we’d just stumbled upon a core issue that we couldn’t ignore.

Core Issues

We added some logging around all of the interfaces that touched the cache so that we could see exactly what cache was being cleared, and where these calls were coming from. After analyzing the stack trace, this is what we found:


  1. A user adds a product to their cart, triggering the shipping rate calculation
  2. This calculation then requests the shipping providers to calculate rates, including Temando
  3. Temando starts to build a request to their API to retrieve the rates
  4. As part of this request, Temando retrieves an access token from the admin panel
  5. Temando clears the entire cache so that they can retrieve the latest access token
  6. The request is made, and shipping rates are calculated


We weren’t sure what we were expecting to find when we were trawling through dozens of classes and methods trying to piece together what was going on, but this was laughable. What made it even worse was not only that the Temando module was a third-party module that they paid Magento to include in the core framework, but that the module was disabled. This meant that everyone running the same Magento version as us was constantly having their cache cleared, at best making their website significantly slower, and at worst bringing it down when a significant number of users were adding to the cart.


The module was removed from our codebase, we reported the issue to Temando and Adobe, and the issue was resolved in the next Magento release. I’m confident now that nothing this egregious exists in the Magento codebase currently (we would have noticed it), but if this can sneak its way into core Magento, then there will certainly be other unexpected issues to look out for. It’s crucial to keep a close eye on third-party modules, the state of your caching infrastructure, and even Magento core to keep ahead of possible issues.


Infrastructure Enhancements

The typical services you need for the Magento stack in your infrastructure are sufficient for achieving scalability, but unless you’ve optimized for it before, you’re probably missing out on some easy optimizations.

Database and Redis Replication

For your database and Redis instances, these can only scale so much vertically so your traffic capacity will always be limited to this. It is possible to scale these horizontally, but the Magento codebase is not designed for this, making it very difficult to do so. Magento Enterprise did offer a way of partially doing this by splitting some of the core tables into separate databases (orders, products, checkout), but this functionality was deprecated. [10] You may want to consider this approach if you’re dealing with traffic beyond ours, but we scaled to support 20,000+ concurrent users without having to do this.


To do this, we configured both the database and Redis instances into master and slave instances so that we could have separate instances for handling write and read queries respectively. Magento has this functionality out of the box, allowing you to scale your slave instances an infinite amount horizontally. Theoretically with this approach, your infrastructure should be able to handle an unlimited amount of read queries, and your write queries will be limited to the capacity of how far you can vertically scale the master instance. [11]

Segmenting Web Servers

In the default Magento stack everything runs on one webserver - the frontend, the admin, and the cronjobs. This means that if any one of those services utilizes a high load, the other services will also be impacted as they’re all running on the same instance. To prevent this, you should split these three services out into their own respective instances:


  1. Webserver - this becomes the template instance that all of your users’ requests are routed to. This instance should be configured to auto-scale horizontally so that your capacity can adjust depending on the level of traffic. You’ll also need to configure a load balancer to route traffic accordingly.

  2. Admin - this is where all of the traffic to your admin panel is routed. We didn’t scale this horizontally as our admin traffic never exceeded a single instance’s capacity. If you have unusually high admin traffic (eg. your website is a marketplace where retailers can log in to the admin), then you’ll want to scale this horizontally. This can be done the same way you scale the regular webserver horizontally.

  3. Cron - this instance is solely for running the cronjobs each minute. This instance should never be scaled horizontally, as the crons should only be running once per minute. You can configure this to auto-scale vertically, however, be mindful of the time it’ll take for the instance to be terminated and the new one to be spun up. This could cause jobs to be killed inadvertently, or the cronjob to miss schedules. Unless you have tight budget constraints, I would recommend disabling auto-scaling and just configuring it to be a larger instance for when it’s needed.


Tangible Results

The great thing about optimization for scale is that it’s not hard to see the results. The increased capacity of our infrastructure after carrying out this process meant we were able to take $1 million in revenue in less than 20 minutes at midnight on Black Friday. Being able to successfully handle the largest spike in traffic our store would see, without any dip in performance for our users, making it worth the sleepless nights and months of work.


You should revisit these methodologies regularly as your infrastructure, particularly codebase, will naturally change over time. It’s a constant effort to ensure your infrastructure is truly scalable, but the initial time investment will last you for years.


References

[1] BuiltWith. (2011). Websites using Magento. [Online]. builtwith.com. Last Updated: 13th November 2024. Available at: https://trends.builtwith.com/websitelist/Magento [Accessed 20 November 2024].


[2] Venktesh Subramaniam. (2019). The thundering herd — Distributed Systems rate limiting. [Online]. Medium. Last Updated: 6th August 2019. Available at: https://medium.com/@venkteshsubramaniam/the-thundering-herd-distributed-systems-rate-limiting-9128d2 [Accessed 20 November 2024].


[3] Jeremy Carey-Dressler. (2016). Your Load Test Model Is Broken: How to Understand and Correct the Data. [Online]. StickyMinds. Last Updated: 5th September 2016. Available at: https://www.stickyminds.com/article/your-load-test-model-broken-how-understand-and-correct-data [Accessed 20 November 2024].


[4] Tara Maurer. (2023). Best CDN Providers 2024: Azure vs AWS vs Fastly vs Cloudflare vs Akamai vs Cloudfront vs Incapsula vs Rackspace & More. [Online]. We Rock Your Web. Last Updated: 24th May 2023. Available at: https://www.werockyourweb.com/best-cdn/ [Accessed 20 November 2024].


[5] Firebear Studio. (2020). How to optimize private content blocks in Magento 2 for better performance. [Online]. Firebear Studio. Last Updated: 14th May 2020. Available at: https://firebearstudio.com/blog/optimize-private-content-blocks-in-magento-2-performance.html [Accessed 20 November 2024].


[6] Oleksandr Drok. (2023). Common issues and a few hacks with Magento 2 Full Page Cache. [Online]. Mirasvit. Last Updated: 24th October 2023. Available at: https://mirasvit.com/blog/common-issues-and-few-hacks-with-magento-2-full-page-cache.html [Accessed 20 November 2024].


[7] Hayley Brown. (2024). What are the pros and cons of using third-party APIs?. [Online]. Cyclr. Last Updated: 6th September 2024. Available at: https://cyclr.com/blog/what-are-the-pros-and-cons-of-using-third-party-apis [Accessed 20 November 2024].


[8] Hiếu Lê. (2024). Speeding Up Your Magento Website: Is Hyva Theme The Best Choice. [Online]. BSS Commerce. Last Updated: 22nd August 2024. Available at: https://bsscommerce.com/blog/speed-up-magento-website-by-hyva-theme/ [Accessed 20 November 2024].


[9] Jisse Reitsma. (2024). Setup a Magento module via setup:upgrade or not?. [Online]. Yireo. Last Updated: 7th April 2024. Available at: https://www.yireo.com/blog/2024-04-07-enabling-magento-modules-with-setup-upgrade [Accessed 20 November 2024].


[10] Magento, an Adobe Company. (2020). Overview of the split database solution. [Online]. Adobe Experience League. Last Updated: 15th April 2024. Available at: https://experienceleague.adobe.com/en/docs/commerce-operations/configuration-guide/storage/split-db/ [Accessed 20 November 2024].


[11] Binadox. (2024). Vertical Scaling vs Horizontal Scaling: Choosing the Right Approach. [Online]. Binadox. Last Updated: 7th October 2024. Available at: https://www.binadox.com/blog/vertical-scaling-vs-horizontal-scaling-choosing-the-right-approach/ [Accessed 20 November 2024].