paint-brush
Implementing Enterprise Incident Management Frameworksby@cerniauskas
158 reads

Implementing Enterprise Incident Management Frameworks

by Julius ČerniauskasFebruary 13th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Due to the large scale of the business, enterprise-grade companies face an increased risk of incidents that might have far-reaching implications. Having a robust incident management framework allows faster response and, as such, is a way to protect revenue, customers, and future business operations.
featured image - Implementing Enterprise Incident Management Frameworks
Julius Černiauskas HackerNoon profile picture

At some point, every company will need to implement a risk management framework. Some may be more prone to the attention of bad actors (e.g., financial institutions) and need a well-defined course of action earlier, but even relatively safe industries such as publishing or mining will eventually need a risk management framework.


The necessity arises due to three factors compounding upon each other, and every industry or company may be analyzed using these three layers:

  • External actors. Usually, most active in industries where fraud or hacking may net substantial financial gain.
  • Internal actors. Intentional or not, employees, contractors, and any other related personnel may make mistakes or perform actions that induce risk to the company.
  • Company size. As size increases, so do the number of internal actors and potential touchpoints for risk. For many companies, the threat of external actors also rises when the business becomes more visible on a global scale.


Relatively safe industries will have a smaller number of external actors affecting them at any time; however, any successful business eventually expands both in operations and employees, which increases the likelihood of something going wrong. Therefore, developing a risk management framework should be at the top of the priority list for any growing company.

Separating incidents by severity

Every negative event isn’t made equal. Some may be minor communication risks (e.g., receiving a negative review from a customer), while others could have long and wide-ranging effects on the continuity of operations (e.g., major technological failure). Grading incidents by severity is a cornerstone of any incident management framework.


A well-established methodology for this has been acquired from ITIL practices. While ITIL has largely been built for the tech industry, the incident response strategy can be easily adapted to fit any sector, company, or even group of people. Usually, all incidents are segmented into five levels:


  • S1. Critical severity that might have the potential to highly negatively impact business operations to the point that further actions may be compromised. Essential services are affected, causing issues for customers and degrading their experience.
  • S2. Major severity that has the potential to negatively impact business operations. Some essential services or a limited number of customers are affected.
  • S3. Medium severity that negatively affects a limited number of customers without causing a failure in regular business operations.
  • S4. Low severity that affects a limited number of customers without significantly impacting business operations.
  • S5. Minor severity that has no influence on business operations and doesn’t degrade service quality (e.g., a formatting error in an important document).


S5 is often excluded from an enterprise risk management plan because most such incidents are resolved internally without any sufficient negative effect on the business.

Setting up response teams

For all incidents with the severity above S3, an incident response team that is always on-call should be created, as most of these issues need to be resolved within a short timeframe. Anything below S3 is considered an inconvenient incident that may be resolved during working hours. S3 itself may be highly dependent upon the industry; however, most strategies still put it in the “something that can be solved during working hours” basket.


A proper course of action should be well documented for all incidents, especially anything that falls under S2 or above it. Most incident response plans include:

  1. Dedicated teams and points of contact. All parties should be informed ahead of time that they will be involved in an incident recovery team. Each team that’s involved should have a primary and secondary contact, either of whom should be available for any recovery process.
  2. Initial steps with clear accountability. When incidents happen, reaction times are often vital. Therefore, all initial actions should be decided beforehand. For example, after the initial discovery, one person should be responsible for getting the incident recovery team together, establishing communication channels, and assigning action points (unless previously defined otherwise).
  3. Clearly defined expectations. Some incidents may not be resolved within an hour or two; however, teams should know that, according to severity, certain issues may need complete focus and the complete removal of any other work.


Incident response teams should remain largely static, regardless of the incident. At first, it’s often better to invite everyone who can provide support and remove those who cannot have any input later.


Finally, the composition of incident response teams can be fairly identical across organizations. Usually, C-level Executives, Legal, Risk Management, and Public Relations (or Communications) Teams are involved in recoveries. Others may come and go, such as Account Management, Developers, and anyone else who’s directly involved in making the fixes happen.

Standardizing recovery reports

One final step in implementing enterprise risk management frameworks is ensuring a two-fold recovery report (sometimes referred to as post-mortem). One version of the report is intended for internal audiences. It should clearly describe the incident, its genesis, recovery efforts, and eventual result. Atlassian has developed a fairly extensive list of templates that can be used for a post-mortem.


Another version of the report is intended to be delivered to customers. Differences arise as customers are largely interested in the reason (and whether the provider is liable), resolution, compensation (if applicable), and next steps or assurances. Most customers won’t be interested in the technical details and genesis of the problem.


Additionally, informing customers about a high-severity incident ahead of time is considered good practice as, in these cases, some of them are already affected. Sending out email communication with incident details and proposed time of resolution will give the company some time to work in peace towards solving the issue.


Just like with the internal recovery report, a template should be prepared for most cases, as writing out manually during or after each incident is highly sensitive to human error. These situations are often intense and charged with emotion, so any risk that can be minimized should be minimized.


Finally, most customer-facing communication should be sent out from a reasonably high-ranking person from within the organization, such as a C-level executive or anyone with a similar title. There are numerous reasons for such an approach, but it’s primarily done from an account management standpoint — the goal is to maximize customer satisfaction by communicating through an important person.

Conclusion

Risk is an inevitable part of doing business. As companies expand, the potential for problems rises in tandem, meaning that, at some point, incidents will start to happen more regularly. Implementing a risk management framework that allows companies to respond to incidents quickly and effectively is, essentially, a way to protect revenue, customers, and business operations.