Holiday Season: Predicting and Tolerating High Load in Business - Dealing With Unexpected Anomalies

Written by jmeshcheryakova | Published 2023/01/10
Tech Story Tags: business | time-series | predictions | anomaly-detection | highload | holiday | machine-learning | forecasting

TLDRThe holiday season brings an increased load on business. It's possible to predict and manage a high load along with setting up monitoring systems for early anomaly detection. Practical example.via the TL;DR App

“This holiday season is all one more reason to cry…”
- (the best-ever Christmas song)

Table of Contents

  1. How holidays affect people and businesses.
  2. Is it possible to manage a high load and minimize its impact?
  3. How to get accurate predictions for high load, and do we need the exact numbers?
  4. What size of margins is sufficient?
  5. Anomalies: how to predict the unpredictable?
  6. Example: assemble the anomaly criteria and alerts system
  7. Conclusion

How Holidays Affect People and Businesses

Most people got stuck in traffic at least once in their life trying to fetch last Christmas gifts or were unable to make phone calls to friends on New Year's Eve due to a high load on the mobile network.

The holiday season (a time of joy and celebration) usually brings an increased load on various aspects of life. This is especially evident in areas such as travel, postal services, and banking, as there is a higher demand for these services due to the increased number of people traveling and sending gifts.

The high load can lead to delays, congestion, and other issues that can be a source of stress for individuals and businesses.

Companies have to work harder to meet the increased demand for their products or services.

Failure to tolerate the higher pressure can lead to customer disappointment and negatively impact a company's image.

This can be particularly damaging for businesses that rely directly on positive customer experiences to drive sales and attract new customers, but it also affects B2B products.

When a supplier may need to intensify product manufacturing and distribution, a service provider has to deal with an increased number of operations.

Is It Possible to Manage a High Load and Minimize Its Impact?

If it wasn't possible, I wouldn't be writing this article. For many years, businesses of all sizes - from small bakeries to large manufacturing companies – have been studying how demand changes over time (trends) and periodically (seasonality).

For example, a bakery might produce more heart-shaped cakes around Valentine's Day.

To prepare for a higher volume of production, businesses introduce scaling of all kinds: increased staffing levels (e.g., hiring seasonal workers), paid overtime, goods production beforehand, and moving critical software services to clouds that support auto-scaling out of the box.

But to activate scaling, businesses need predictions, etc.

How to Get Accurate Predictions for High Load, and Do We Need the Exact Numbers?

Specialists analyze worldwide trends, perform social studies, and rely on historical. However, there is no known method that will give a perfect forecast. The real world is way too complicated, and it is impossible to take all the factors into account and include them in any sort of model.

While manual processing of data is time-consuming and not very accurate, modern businesses tend to rely on machine learning (ML) methods to approach the task.

They need to identify the target value (or several values) to predict: demand for products or services, traffic on a website or social media platform, or the number of customer service calls. Then they work with the time-series forecasting problem.

I won't describe the approaches to time-series forecasting in this article (all the standard techniques are applicable), but I will highlight what is special about them when we are interested in high load as a narrower class of tasks.

The main thing is that, in many cases, we don't need accurate numbers. The insight would be more of "if this day/week we'll see a high demand or not" rather than "what precise number of customers we expect" – rather flagging the fact of high load and its degree than producing the precise number.

For example, if a company needs to prepare for an open-air festival in a small town that is expected to attract a large number of visitors, it's important to know if the demand for hotel rooms will be high and plan accordingly.

However, it's hard to predict the exact number of visitors as many make last-minute decisions that are affected by factors such as the weather.

When it comes to software applications, the load is often mapped to a number of servers (nodes) with large margins included, so that resources are not fully utilized at all times. This allows for more efficient use of resources, and when a new machine is needed, it is a granular decision.

Whether it's a need for 0.05 or 0.7 machines, it still means allocating an additional node.

The key principle of machine learning remains: use the simplest model that produces sufficient results for a particular task.

What Size of Margins Is Sufficient?

While businesses can implement backup plans to cover potential increases in demand, it also increases costs. If resources are not used, it also affects revenue.

That's why it's important to get the forecast as accurate as possible and include margins that balance between being overly optimistic and overly pessimistic.

For example, imagine a small newsstand that normally sells around 100 papers a day. A trivial prediction would be 100. But the owner knows that based on news and other sporadic factors, he can sometimes sell more items, and he doesn't want to lose his customers.

So, in an optimistic approach (if the forecast is precise enough), he adds 5%, and in a pessimistic approach, he adds 10%.

When looking at the calculation, we can see that the optimistic approach generates more revenue – fewer papers are recycled, but there are 5 unhappy customers in total (on the 6th and on the 10th day).

In the case of papers, it makes sense to maximize revenue – possibly the customers were late in the evening and didn't expect to buy a paper, so they weren't as upset.

But if we're talking about a bank or a hospital instead of a newsstand, the damage from the inability to handle a higher number of transactions or patients is crucial, and management would likely accept higher costs.

It is important to consider the nature of the business, the cost of overproduction, and the cost of not meeting the demand while defining the margin. The margin is a balance, and it needs to be tuned to the particular case.

Anomalies: How to Predict the Unpredictable?

Let's assume that we've built a model that produces fair predictions for most of the time, including general trends, seasonal, weekly, and daily changes, and the business has developed measures to deal with high load and even has contingency plans in place to deal with anomalies.

But how do we detect such anomalies?

If a business is able to notice it early enough, the company can minimize the negative impact of high load and maintain a high level of service for their customers. To achieve this, we can compare the predicted data with real-time data coming from the field.

If there is a significant discrepancy, an alert can be triggered, and proactive measures can be activated, either automatically or through human intervention.

Example: Assemble the Anomaly Criteria and Alerts System

I once worked for a B2B company* that processed millions of requests per day, with one stage of the processing performed as a nightly job, which typically finished before users became active again in the morning.

However, one night it failed, processing only around 60% of requests which resulted in serious issues for the morning and additional costs to mitigate the impact. Fortunately, an ad-hoc solution was quickly implemented, preventing any damage to the company's reputation.

But the lesson was learned. The company wanted to investigate the situation and ensure that it had mechanisms in place to activate alerts as soon as a mild anomaly was detected, in order to mitigate risks in the future.

To do this, we assembled a dataset by aggregating transaction numbers for the last 10 years since the product was launched in production. By performing Exploratory Data Analysis (EDA) on the data, we noticed weekly patterns, a trend, and a clear outlier on April 17th, 2022.

The daily volume was ~2.5 times higher than the maximal historical number. Such a deviation would have a big impact on the model's accuracy; therefore, it is essential to determine if it can be explained naturally and is possible to predict.

65% of requests that day had a null amount of money associated. When we dug deeper, we found that the difference was caused by a single business user who had mistakenly requested past data for every request.

So, the conclusion was made: this was a true anomaly caused by an unpredictable external factor – the perfect candidate for an early anomaly detection challenge.

As the company is interested in knowing the total number of requests for a day to be handled tonight and knowing if the actual numbers are much higher than predicted, I suggested splitting the day into hourly buckets and comparing the predicted requests for each bucket.

By doing this, we transformed the problem into multi-target regression and predicted a number for each hour of the day. This allowed us to compare numbers throughout the day and spot any mismatches by the end of the day with granularity.

To check the hypothesis, we split the day into four parts:

When we built the model, the standard deviation (std) of relative error did not exceed 3.4% for day parts and 5% for the entire day.

The error distribution was close to normal, so the 3-sigma rule was natural to be applied to formulate the anomaly criteria: if the difference between predicted and actual counts is more than 3 standard deviations, an alert is triggered.

When we tested this criterion for April 17th, we noticed that the alarm would have been triggered as early as 6 AM, when the requests for the first buckets were calculated.

Monitoring systems can maintain events of different severity levels, such as a warning for a mismatch breaking 2 sigmas, and a critical alert for breaking 3 standard deviations.

Since the system operates on a daily basis and buckets allow for scaling, this adds a layer of early detection and improves the company's ability to respond to unexpected events in real-time.

The suggested criteria are just an example and shall be adjected for particular business scenarios based on the nature of the data. However, the described approach can also be applied to low load detection by looking at deviations in a negative direction.

For example, in retail, many offline shops are closed on holidays, and delivery from online shops may not work on New Year's Eve, which can affect warehouse loads and the number of people needed to work those days.

Conclusion

In conclusion, while it is challenging to predict high load with 100% accuracy, businesses can use data and machine learning techniques to build models that can anticipate and prepare for high load periods, set up early detection of anomalies, and by these means, minimize the negative impact and maintain a high quality of service for their customers.

*The name of the company is left out for privacy, and the real data was altered


Written by jmeshcheryakova | Senior ML Engineer / Member of AI Technical Board at DataArt. Enjoy sports and eco-activism.
Published by HackerNoon on 2023/01/10