What are Self Healing Systems & How Can You Develop One?
When people get injured, their bodies self-heal. What if technology could do the same?
Companies are racing to develop self-healing systems, which could improve quality, cut costs and boost customer trust. For example, IBM is experimenting with ‘self-managing’ products that configure, protect and heal themselves.
A self-healing system can discover errors in its functioning and make changes to itself without human intervention, thereby restoring itself to a better-functioning state. There are three levels of self-healing systems, each of which has its own size and resource requirements:
Application Level
In typical applications, problems are documented in an ‘exceptions log’ for further examination. Most problems are minor and can be ignored. Serious problems may require the application to stop (for example, an inability to connect to a database that has been taken offline).
By contrast, self-healing applications incorporate design elements that resolve problems. For example, applications that use Akka arrange elements in a hierarchy and assign an actor’s problems to its supervisor. Many such libraries and frameworks facilitate applications that self-heal by design.
System Level
Unlike application level self-healing, system level self-healing does not depend on a programming language or specific components. Rather, it can be generalized and applied to all services and applications, independent of their internal components.
The most common system level errors include process failures (often resolved by redeploying or restarting) and response time issues (often resolved by scaling and descaling). Self-healing systems conduct health checks on different components and automatically attempt fixes (such as redeploying) to recuperate to their desired states.
Hardware Level
Hardware level self-healing redeploys services from an unhealthy node to a healthy one. It also conducts health checks on different components. Since true hardware level self-healing (for example, a machine that can heal failed memory or repair a broken hard disk) does not exist, current hardware level solutions are essentially system level solutions.
Reactive Healing
Reactive healing is healing in response to an error and is already in widespread use. For example, redeploying an application to a new physical node in response to an error, thereby preventing downtime, is reactive healing.
The desirable level of reactive healing depends on how much risk a system can tolerate. For example, if a system relies on a single data center, the possibility of the entire data center losing power, resulting in all nodes not working, may be so slim that designing a system that responds to this possibility is unnecessary and expensive. However, if it is a critical system, it may make sense to design it to recuperate automatically after such an event.
Preventive Healing
Preventive healing proactively prevents errors. Take the example of proactively preventing processing time errors by using real-time data. You send an HTTP request to check the health of a service and better use resources. If it takes more than 500 milliseconds to respond, you design the system to scale it, and if it responds in less than 100 milliseconds, you design the system to descale it.
However, using real-time data can be troublesome if response times change a lot, because the system will scale and descale constantly (this can use a lot of resources in rigid architecture, and a smaller amount of resources in a micro-services architecture).
Combining real-time and historical data is a better (and also more complex) preventive healing approach. Using our response time example, you design a system that stores response time, memory and CPU information and uses an appropriate algorithm to process it alongside real-time data to predict future needs. So, if memory usage has been increasing steadily for the past hour and reaches a critical point of 90 percent, your system determines that scaling is appropriate, thereby preventing errors.
Principles
Roadmap
Designing systems and applications that are self-healing (or even better, automatically determine when errors might occur and prevent them) can improve quality, cut costs and improve customer trust. Even the best systems still require human intervention, but they can be designed so that the intervention is light-touch and easy for the human. Unlike self-healing software and services, self-healing hardware is still in the sci-fi realm and is leading to a newfound appreciation for biology, spurring fresh interest in biological computing.
Previously published at https://lansaar.com/self-healing-systems/