Swapping out your monolith for micro-services— or just using micro-services in the first place — is, in theory an obvious thing to do. After all, the larger your monolith (“one big thing”!) gets, the harder it is to for you to keep track of everything in it. The architecture eventually get to a place where you just can’t visualize it all in your head. Oh, loose-coupling helps, as does componentization, layered architectures, queues/service-buses, and whatnot, but, in the end, you’re limited by the irreducible complexity of the system.
So, you do the obvious. You extend loose-coupling to the breaking point, by, literally, breaking apart the individual components into distinct micro-services. Now, each of these micro-services can be visualized in entirety by its development team, and you can task somebody else with visualizing how all the individual components fit together.
So simple, right? You’ve basically decoupled the individual components of your system, and ushered in a new era of speed, efficiency, and apple-pie for lunch.
Or have you?
The thing is “decouple” above carries a lot of water! If, for example, the API to one of the components is constantly changing, then any of the other teams that is using this API will need to constantly refactor their implementation to keep it up to date. So yeah, you need to spend a lot of time up-front making sure that you have strong — and stable! — APIs/Interfaces/Contracts between your components.
(And that’s a good thing, not just for micro-services, but for everything that you build 🙌)
That said, your individual services also have soft dependencies. Maybe you decided to use Cassandra for the time-series data in your component. You know that everybody else is using Postgres, right? And that while it’s not that much worse for you to have used Postgres instead, the operations load of maintaining — and paying for! — Cassandra is going to be waaaaay higher?¹
The point here being that while somethings are necessary — such as making sure that the contracts between components are clear and well-defined, others — such as the opacity of the implementation — are much more loose and ill-defined. You will need to trade-off individual efficiency for the greater good (Cassandra vs Postgres above), and this co-ordination burden will actually be greater than it was when you were just working with that monolith. After all, in monolith-world, it was axiomatic that everybody used Postgres (“That’s the database. You want to store time-series data? Use it!”), but now architectural decisions need to be co-ordinated across teams!
And that, my friends, is the point I’m trying to make. Moving to micro-services is not a free pass when it comes to implementation. At the end of the day, you are building out a distributed system, and if you want one that is fault-tolerant, you have to internalize that your choices will impact other people.
When you decoupled your system, all those interactions didn’t magically vanish, they just moved into a different domain. In a typical monolith, while you get architectural decisions that are pretty consist across the whole system, you have human co-ordination issues that are a righteous PITA. For example, an updated database schema might ripple across the whole system, resulting in a bunch of different teams having to co-ordinate their development — and release dates! — to get this implemented, slowing pretty much everything else down. The larger your system gets, the more…entertaining 🙄…the process of actually getting releases out the door.
With micro-services though, you trade off these human co-ordination issues for system co-ordination issues. You are, in effect, building a distributed system, and to make sure that it is robust and resilient, you have to make sure that you’ve got all your ducks in a row. Dependency management, version control, deployment pipelines, concurrency and deadlocks, oh, there are an infinity of issues that crop up in this world, and you need to deal with all of them.
“Complexity never goes away, it just moves up the food chain”. It’s true, and it’s just what you did. You traded off the PITA of human co-ordination for the entirely different — but equally horrific — PITA of system co-ordination. If you think your life got easier, well, it didn’t. It’s just that the illusion of control is much greater with system co-ordination issues, and this illusion allows us to delay the inevitable day of reckoning till we can actually afford to deal with it, or we’ve gone out of business. Mind you, now that I think of it, that’s not such a bad tradeoff…
- I recall a scenario where one team used the TICK stack, while all the other teams used Prometheus. The rational was that the lead dev on that team had never used Prometheus, and went with TICK because “speed to market”. It took quite a while to unwind the whole mess…