Nicolas A Perez

@anicolaspp

Circuit Breaker in Lagom Microservices

One of the main goal behind the micro services, is to build systems or subsystems that are isolated from each other where each of them have a single responsibility and by working together, bigger and more complex systems can be built.

In most cases, we need communication between different services. Communication should be done in a resilient and fault tolerant way. It is important to take care of cascading failures by avoiding propagation of errors across services.

Let’s compare Lagom to other widely adopted libraries.

Lagom

Defining micro services in Lagom is an easy task. First, we need to define our service interface or contracts. These contracts are what is going to be exposed to clients.

For simplicity, let’s implement a small micro service that returns a list of countries and then use it to show some important constructs.

As we can see, we are exposing /countries and it should return a list of the Country class.

I don’t intend to go over how Lagom works, but you can look at Lagom docs

Now, we need to implement our service.

Notice that the list of countries is coming from the countriesRepository and the service is using the underlying data access implementation through the ICountriesRepository interface.

When using this service, potentially from another service. Whoever uses CountriesService implementation we say it is a client of the service.

At this point, we have introduced at least two points of failure, one the CountriesService itself, and the second, the underlying data access that the CountriesService is using. The question is how to manage failures of these dependencies so we don't cascade them making entire subsystems to fail.

Circuits Breakers

Circuit Breakers are a standard way to avoid cascading failures by stopping downstream calls once a failure has occurred on the downstream dependency.

There are few things we need from a circuit breaker.

  • It should open itself once a number of continuous errors had happened to a downstream service.
  • It should close itself after certain time AND the downstream service has recovered.
  • It should offer some kind of monitoring to help to understand the state of the breaker.

The way to go for Circuit Breakers seems to be Hystrix, an open source library created by Netflix. Hystrix provides a solid implementation of circuit breakers that has been widely adopted by the industry.

Even though Hystrix provides for a great number of circuits breaker patterns, there is an associated cost of adding such a library to an existing and ongoing project. Code complexity will tend to increase while the addition of new components requires a learning curve from the development team.

Let’s do a comparison between what Hystrix offers and what we already have in the Lagom framework.

Managing Data Access Failures

From the point of view of CountriesService, the underlying implementation of the data access to retrieve the list of countries is a potential point of failure. Network call timeouts, connection drops or degradation of the data access subsystem are only few examples of what can go wrong.

By using Hystrix we can avoid degradation of our service when there is a problem with the data access layer.

From the point of view of CountriesService, ICountriesRepository is a dependency and circuit breakers are a client's concern, we should manage them on the client side, CountriesService in this case.

We can create a Hystrix command that wraps the call to ICountriesRepository in the following way.

Our service needs to be changed so it uses the command instead of calling directly the repository.

If there is a problem accessing the underlying data that ICountriesRepository uses, the circuit breaker (the hystrix command) will be activated as expected.

Notice that the complexity of this pattern increases rapidly since for every call to the ICountriesRepository we will need to create a Hystrix command.

Based on the micro service principle that each micro service must own its own data, should we say that if a micro service fails to access it’s underlying storage (Cassandra, Kafka, MySQL, etc…) the micro service itself fail?

Lagom takes this into consideration, and exceptions on the data layer of service can be propagated through the service which activate a circuit breaker at the service level that protects the clients of our service. This implies that the extra protection that we can get from Hystrix is redundant at this point.

Data as a Service

Let’s suppose our service’s data is coming from another service. In this case, our service is client of the data service. Let’s look at an alternative implementation of our service.

In this case, our service is almost identical. We are wrapping the call to the dependency within a Hystrix command that manage the circuit breaker logic if there is something wrong with the dependency.

In some cases, we want to provide a fallback result if our dependency is down, we can do that within the Hystrix command by implementing getFallback()in the following way.

If there are errors on the ContentService dependency, the fallback value will be provided until the breaker is close, which means our dependency is ready to be used again.

On the other hand, Lagom offers the same functionality already without introducing new concepts. Let’s see how.

Notice that we are using .exceptionally to provide the fallback value. Also, Lagom activates a circuit breaker when the service dependency malfunction which is the same behavior we can get from Hystrix. At the same time, we don't need to create Hystrix command for each of the calls on the dependencies, which reduces the complexity around our code.

Service Clients

From the point of view of our client (service that call our service in order to obtain certain functionality), it does not matter what library for circuit breakers we use since circuit breakers are client responsibility, no service responsibility. Our clients should have a way to monitor exceptions coming from our service so they can take the required steps to avoid cascading failures.

Is the client of our service is another Lagom service, this functionality is already built in as we saw when using ContentService. Is the client of our service, is another kind of application, they can use the circuit breaker strategy they like along with the library of their choice. Again, this a client concern, not a service concern.

Conclusions

  • If when implementing a Lagom service we need to call dependencies, these dependencies are access through managed and unmanaged Lagom services and calls to these dependencies already have circuit breakers built in and fallback functionality through the .exceptionally function.
  • If our Lagom service access a data layer directly and this access fails, we might want our service to fail since the data is owned by the service and the service cannot perform its function appropriately. This should activate breakers in the client side of our service so they take the required measures.
  • Hystrix is very powerful, but at this level, it does not offer different functionality from what we already have in Lagom.
  • Lagom circuit breaker functionality is way simpler than the one offered by Hystrix, but integrates well with Lagom pieces while keeping simplicity and by reusing already known language features.
  • When using Lagom, we should only thrown exception at the service layer to inform of an internal service error so clients of our service take the necessary steps to prevent cascading failures. We tend to throw exception to express some business logic, this should be avoided and model in a different way since exceptions activate circuit breakers logic in the upstream services and sometime there is no such failures in our service itself.

If you really need to have circuit breakers around you data access calls, that is Cassandra and other, Hystrix seems to be a good choice to go with.

More by Nicolas A Perez

Topics of interest

More Related Stories