Tackling the Augean Stables of Customer Support: Our Success Story

Once, the Business Support department at Social Discovery Group faced the daunting challenge of clearing a backlog of 1500 tickets that had accumulated over four years. Managing such a high volume of issues was a serious undertaking, and we constantly found ourselves struggling to keep up with the KPIs. Despite our best efforts, the tickets kept getting shuffled from one sprint to another, leaving customers frustrated and us feeling overwhelmed.

In this article, we would like to share our experience of tackling this seemingly impossible task, reminding us of the legendary sixth labor of Hercules. We will take you through the challenges SDGroup team faced and the steps we took to rebuild our department's processes. We adopted the STATIK approach, which proved to be incredibly effective in helping us clear the backlog and bring much-needed relief to our team. So, if you're looking for practical insights into tackling a backlog of tickets, keep reading!

How We Ended Up With the 1,500-Ticket Queue

As the Business Support department, we are responsible for handling tickets that contain client complaints and suggestions about our services. For example, if a client experiences difficulties with the payment system on our website, they may raise an issue with SDG technical support team. Colleagues gather all relevant information, including issue reproduction steps and screenshots, and create a ticket in Jira, assigning it to our department. We then run additional tests to ensure that the problem is not a temporary customer glitch, but a real issue on our side. If confirmed, we create a bug report, prioritize it, and forward it to the development team for resolution.

On average, we used to receive 20-30 tickets daily. We spent 2-3 hours reproducing the problem, but whenever we required assistance from other departments, such as case details, service restarts, data from the database, or analysis from the dev team, our tasks tended to stuck. Our colleagues were often unable to respond promptly, and there was no separate team of developers assigned to our bug reports. Furthermore, our development tickets had lower priority compared to business tasks. As a result, even high-priority tickets could remain unresolved for a few months, while lower-priority tasks remained on the board for years.

As you can see, this flow led to too much uncertainty in meeting deadlines, which caused dissatisfaction for both us and our clients. The situation resulted in several issues:

Clients were stressed and frustrated when their problems were not resolved quickly enough, and we were unable to provide a clear timeline.
Our support team was frustrated, as clients demanded updates for unresolved issues almost every day.
The support team wrote to us and received the same response: "The problem is not solved yet."
Our team was overloaded and unable to pick up old tasks and link them as recurring cases.
We felt despair, as we looked at the daunting number of 1,500 unresolved tasks on our board. We knew that we needed to change our approach to make any headway.

This is what the ticket life cycle looked like during that period.

Let's take a look at some of the issues we encountered while managing the issue backlog:

The support team lacked scripts to promptly document and filter issues.
Tasks in the "New" status often became stuck in limbo. When we didn't have enough information or access to resolve the issue, it would move to the "Escalated" status, and we would seek assistance from colleagues in other departments. Unfortunately, it could take weeks or even months for them to respond. And sometimes, we missed comments due to the overwhelming volume of incoming tasks.
Once we eventually verified the issue, we shifted it to the "Awaiting fix" status. Due to a chronic shortage of dev resources, these tickets could remain unresolved for years. Clients didn’t close the tickets, hoping that we would receive more resources in the future.
The substantial task backlog resulted in duplicates. It was difficult to merge and close them since we often struggled to remember which tasks had been started in prior years.

This poorly organized approach resulted in a backlog of 1,500 unresolved tickets over four years.

The STATIK Approach and the Sixth Labor of Hercules

We realized that our department was too focused on processing tickets rather than addressing the core purpose of resolving client issues: improving their loyalty towards Social Discovery Group products and detecting vulnerabilities in our services and websites. You may wonder, "Why not hire more staff if you can't handle the workload?" However, experience has shown that by establishing efficient processes, we can go without extra resources.

Similarly, Hercules completed his labor alone by redirecting the river's flow towards the Augean stables, which cleared them in just one day. To streamline our ticket board, we referred to Mike Burrows' book "Kanban from the Inside" and implemented the STATIK approach, a systematic strategy for executing the Kanban method.

While implementing the STATIK approach, we followed these five steps:

1. We identified the customers' expectations from our Business Support department and concluded that customers need to be satisfied with the provided support. To accomplish this, we need to ensure the following:

Transparency in our work on the task enables clients to understand what's happening with their requests at any given moment.
Promptness of our work on the tasks, providing a definitive time for feedback in any state of the issue.

2. We defined both the internal and external sources of dissatisfaction. The internal source is what hindered us and caused frustration in our own work.

The external source is what caused frustration for our customers and hindered their experience.

3. We analyzed the sources and nature of our workload. We examined the tickets submitted to our department and categorized them by the departments they came from, their frequency, and the customers' expectations regarding response times and solutions.

4. We evaluated our current capabilities. At this stage, we assessed how efficiently we were handling the tickets and determined how many tasks we could realistically manage in a week.

Additionally, we calculated the average time it took to go from ticket initiation to releasing the bug fixing in production.

5. We rebuilt the ticket life cycle and created a new process.

Our Initial ticket life cycle

Using the gained insights, we have developed a new task life cycle.

Next, we implemented a comprehensive approach to resolve all the previously mentioned problems. We took the following steps:

We created scripts for our support team to filter tasks at the initial level.
We introduced a temporary "Back to Reporter" status for tasks that required customer input to investigate the issue. If the necessary information was not provided within three days, the task would be automatically closed.
We removed the "Escalated" status and replaced it with a more specific "Problem Confirmation" status.
We spent over two months reviewing and closing duplicate tasks, leaving us with only 800 unique tickets out of the 1,500 we started with.
We implemented Service Level Agreements (SLAs) and set timers for each status. The "Problem Confirmation" status had a timer of 7 days, while the "Development" status had a timer of 30 days.
Having tasks in the "Problem Confirmation" status motivated us to ping our colleagues for a response. We changed our communication approach, using private messages or Slack threads instead of commenting in Jira. This reduced the response time to one day versus the weeks it took before.
Although we didn't put a timer on the "Awaiting Fix" status, we allocated more time for customer communication and discussion. If a task was deemed unfeasible, we informed the customer and prioritized more important tasks. This led to a 50% reduction in on-hold tasks. Additionally, we started organizing meetings where customers could explain the importance of a task, helping us prioritize our workload better.
For the "Development" status, we set a timer of 30 days and limited the number of tasks in this status to four. This helped us focus on fixing the bug within the set time frame and avoid being overwhelmed with unnecessary visual noise of 20-40 tasks in the "Development" status on the board.
To ensure our practices are sustainable, we brought in the 90th percentile principle, which meant that 90% of tasks had to be completed on time, as determined by the timers for each status. We monitored this using Jira's Control Chart and Jira-helper plugin for building the 90th percentile.
Finally, we introduced additional motivation for the team, promising a bonus if the 90th percentile principle was followed, just as Augeas promised Hercules a tenth of his horses as a bonus.

By implementing this new approach and rebuilding our department's processes, we were finally able to tackle a problem that had plagued us for four long years. The solution didn't require any extra time, effort, or budget, yet we significantly reduced the number of piling-up tasks. In just five months, with a team of only two intelligent employees, we managed to whittle down the queue from 1,500 tickets to a mere 150.

This experience has taught us the importance of identifying the root cause of a problem to effectively address it. It has also underscored the importance of having well-designed processes in place, as poor processes will inevitably lead to a buildup of issues, as we discovered firsthand.

Written by Dimitri Andrews, Software Testing Engineer at Social Discovery Group