With the evolution of micro services, software development world has changed for ever. With a smaller code to maintain, clearer contract to follow and ability to push the components to production independently, developers' life has become much easier.
With smaller code at hand, it also brought opportunities to experiment with newer technology stacks/architectural patterns e.g. Asynchronous message based communication, communication over JSON(Rest), in memory caching and more.
Looking at all these, it paints a very rosy picture of how a highly available and scalable enterprise software will look like and how it fits in agile methodology of delivering features faster and at frequent intervals.
While most of it is true, faster delivery of end to end features are still alluding most of the development teams and high availability doesn’t come for free even in the world of micro services.
What’s stopping teams from delivering end to end features faster? Why availability is still a concern? Let’s take a closer look.
What micro services and the newer technology stacks promised is to cut down on the the development effort significantly and it surely does. But does it also cut down other parts of the sdlc phase?
While it simplified the development cycle, micro services added a significant overhead on monitoring, identifying the failure point and testing in general.
Each service has several dependencies now. Some of them are direct dependencies and many of them are transitive.Each team keeps on adding new features which in turn adds new dependencies for the upstream clients. Keeping a tab on someone else’s world is practically impossible. So the very first challenge is to setup an environment with all the dependencies where functional end to end test cases can run. In many cases all these services can fit and run on one single box. For such cases life is still quite manageable. There comes a times when all these services may not run on the same box because of many underlying infrastructure gotchas, e.g. Multiple services listening on the same ports, too many services to handle on one box, conflicting dependencies(Java 1.7 vs Java 1.8).
The more and more services we have, chances of getting into this mess is very high.
The resiliency of the system comes from the fact that every part of the system is able to handle at least some reasonable level of errors and faults. Be it a downstream service unavailability, network latency, caching issues or db issues; distributed system is full of such implicit non functional requirements and corresponding handling.
Our testing can’t be completed until we do a certain level of negative/fault tolerance testing.
Setting up environment for this is even harder. Considering hundred of tests needing fault at different levels and each setup needing a whole bunch of boxes, the scenario here looks pretty scary. It won’t be an exaggeration to say that many teams/companies actually won’t have a setup to do such testing at scale.
The approach we are going to discuss is to add an intelligent routing component/proxy layer which can inspect the requests and forward it to the right services and if required inject the desired fault during testing.
We can have multiple instances of the same component running to make sure higher availability of the test infrastructure.
For simplicity let’s say we have 3 services providing restful endpoints for different entities. ServiceA is listening on port A and has a downstream dependency on ServiceB listening on port B which further depends on ServiceC which listens to port C.
Lets also assume we have a common header request-id which will identify the whole request flow and will be passed along in all the downstream/side stream service calls.
Test Setup with Proxy layer handling proxying and fault injection
We can use nginx+lua to setup the proxy layer.
You can read more about nginx and lua here:
We will be using openresty for the proxy implementation.
Requirements for the proxy:
We can use Redis to store service and hosts mapping. For each request for a specific service our Lua code can get the upstream details from Redis and forwards the request in round robin fashion.
sample redis entry:
Redis Key: service_discovery
Sub key: 9001
value: server1, server2
Basic approach: in test setup step generate a request-id and store the required fault details in a shared place which can be accessed by routing layer. Routing layer, upon receiving the request, inspects the request-id and checks against Redis if there is any fault injection required in the immediate service call. If not, then it continues as earlier, else, it parses the stored fault information and responds accordingly.
The format could look like:
Redis Key: faultinjection
Sub key: request_f7a7bb674caec