Microservice.add(observability) != Microservice.add(monitoring)

Written by kayalvizhi | Published 2020/06/22
Tech Story Tags: observability | microservices | resiliency | monitoring | services | inter-service-communication | logging | hackernoon-top-story

TLDR A single transaction can flow through many independently deployed microservices, or pods, and discovering where performance bottlenecks have occurred provides valuable information. Microservices ain’t easy, but it’s necessary. Distributed systems are pathologically unpredictable. An obvious area where it adds complexity is communications between services. A primary microservices challenge is trying to understand how individual pieces of the overall system are interacting. So, in short, observability is not a panacea, but is the ability of the data collected from these data collected.via the TL;DR App

You are reading this content, which means that you are not novice to the microservices field. So let me just scratch the surface of it before moving to Observable Microservices. Once upon a time Monolith Application was now transformed into Microservices based application. 
  • Each service owns a single responsibility
  • The large database is also broken into smaller units and managed by the respective services. 
  • They communicate with the external world via REST APIs & graphQL APIs.
  • They internally communicate with each other via Enterprise Message Bus.

Runtime Complexities with Microservices

Microservices ain’t easy, but it’s necessary. Distributed systems are pathologically unpredictable. Some things actually become more difficult. An obvious area where it adds complexity is communications between services.
A primary microservices challenge is trying to understand how individual pieces of the overall system are interacting. A single transaction can flow through many independently deployed microservices, or pods, and discovering where performance bottlenecks have occurred provides valuable information. Walking through a typical flow, we can quickly get a sense of this complexity.
  • By looking at the sketch, we know that the orderservice calls inventory service. 
  • A bug has escaped the quality assurance gates and got moved to the Inventory Service’s production environment.
  • When a user tries to place an order request, OrderService tries to check with the inventory service for the availability of the goods.
  • InventoryService fails to respond back because of the bug, connection timeout occurs.
  • Orderservice responds back to the user saying “Something Went Wrong; Please try after sometime”.
When the team tries to fix and to understand the root cause of the issue, wait, where do they start with? What do they search for? Yes, they look into the logs, the de facto choice of tools to debug in production environment.
Immutable discrete timestamped event, what happened, at what time, what has been requested and what has been sent etc…
If there are 100s or 1000s of lines, looking at them manually would be fair enough. Remember the fact that we are not dealing with a development environment instead production environment logs where millions of events are recorded. So manual scanning is impossible, isn’t it?

Observability

Observability is not a new term. It has a long history stemming from engineering and control theory
Not so clear, uh, let us look at a few examples, where they are coded for observability.
Load Balancer
It is a reverse proxy, distributing application traffic across a number of application servers.
What it also does is, routinely monitoring the application server instance's health. So that if an instance is down, it avoids sending the requests to the failed instance. Once the failed instance resumes, load balancer targets the instance. Load balancer uses health check pings to measure the health of the instances. All right, we know this. What is observability here?
Load balancer does not know anything about the internals of the application instead it knows the state of the system, meaning the health of the instances, with the help of the external outputs, that is the health check pings.
So, load balancer is enabled with observable code, agree?
Autoscaling
Auto-Scaling helps us ensure that the right number of instances are available to handle the application traffic. It can launch or terminate instances based on the traffic.
In the below example, the scaling policy is configured in such a way that if the load crosses 65%, launch the new instances as configured. When it is 33%, it does nothing. When the utilisation crosses 65%, it scales out, launching the new instances.
Again, we all know this. But what is observability here?
It constantly observes the utilisation by monitoring. And scales out & in based on the scaling policy configured. So, auto-scaling is enabled with observable code.

Observability is Not APM

Observability might mean different things to different people. So, is Observability the New Monitoring? Like any IT trend, it is difficult to perceive, as many conclude without analysing much. For some, it’s the old wine of monitoring in a new bottle.
But observability is not APM - Application performance Monitoring
So, what is observability? - Logs, Metrics & Traces are often known as the three pillars of observability.There are many powerful tools in the open source and commercial markets like ELK, Prometheus, Zipkin etc…
Plainly having these tools configured does not mean that our application is observable. They generate a myriad of events and logs. What needs to be observed? So that our application is resilient.
So, in short, observability is not a panacea, but is the ability of the usage of the inferred data collected from these tools.
So, how to use the inferred data?
  • Retry & Schedule for Later - When an incident occurs, we could orchestrate different services to retry at different times. Retrying immediately may not help, adding a back-off time would help.
  • Fail Over & Fall-back - When a down-stream application fails for any reason, a fallback service call is added to reduce the failure rate and increase the resiliency of the system.
  • Notify Controllers - Coding the services to notify the failures.
  • Communicate - Graceful communication to the caller of the service, regarding what is going on and when likely the request will be fulfilled etc…
Resilience4j is a lightweight fault tolerance library inspired by Netflix Hystrix. I like its lightweight and modular structure where I can pull in specific modules for specific capabilities such as circuit-breaking, rate-limiting, retry, and bulkhead and coded the observable microservices in our organisation.

Conclusion

The goal of observable microservices is not to collect logs, not to collect traces & metrics. It is to build a culture of engineering based on facts and feedback.
Observability is about being data driven especially during debugging and there by it helps the SRE/ developer team with simplified monitoring in place.

Written by kayalvizhi | Principal Software Architect, Microservices & Cloud Computing enthusiast, Hands-on Java Developer
Published by HackerNoon on 2020/06/22