Whoever came up with the famous saying that you can choose from two of three options – fast, cheap or good – was probably not an observability engineer.
But, he or she may as well have been, because when it comes to observability, deciding whether you want fast, affordable or in-depth insights has traditionally been one of the first tasks of engineering teams. Historically, the monitoring architectures and tools that we depended on didn't allow us to have it all, at least not in the context of distributed, cloud-native apps.
Fortunately, that has changed. It turns out that, by rethinking the fundamentals of your approach to monitoring, you can have it all – when it comes to observability at least. You can get deep insights quickly, without paying through the nose for your monitoring workflow.
Sounds too good to be true? Well, keep reading for a look at how to square the circle surrounding cost and depth (and speed, too) in regard to cloud-native observability.
The engineers of generations past didn't have to choose between cost, depth or speed when it came to observability. They got to have it all without even trying.
This is because they were working with centralized, monolithic apps. In that context, it didn't cost much time or money to collect monitoring data in quantities sufficient to enable fully informed management decisions. In most cases, you simply integrated a lightweight monitoring SDK into your monolith and let it collect basic metrics and log data. The design was simple, the process was simple and the costs were low.
This approach worked perfectly well in a world dominated by monoliths and single-node application deployments. No one questioned it, because there was no reason to question it.
The problem that many engineering teams have run into over the past decade or so is that, when you take a conventional, monolith-friendly monitoring architecture and try to graft it onto a distributed, cloud-native app, you can no longer have it all.
You can't necessarily monitor quickly because integrating an SDK into all your microservices takes time. Plus, you have loads more data to collect, since you're not just dealing with some basic metrics from a single app. Instead, you’ve got a whole host of logs and metrics (and don't forget your traces!) from a bunch of microservices.
It's also very hard to collect all of this cost-effectively. You can run into steep egress bills just to move the data into a place where you can analyze it, and then you have to pay storage fees on top of that. Glacier-tier storage may be cheap, but it adds up when you have reams and reams of monitoring data that you need to retain for years.
Now, one way to speed up cloud-native observability, and reduce costs, would be to collect only some data that your tools select at random, instead of trying to collect and analyze every single log, metric and trace available to you. But then you'd be sampling – and no one wants to be a data sampler, because data sampling means you may miss important information due to incomplete data collection and analysis.
So, if you depend on a traditional observability strategy for distributed apps, you end up facing, what I like to call, the cost-depth trade-off. You can observe quickly and cheaply. Or you can observe in-depth, but at a high cost in terms of time and effort. You can't have it all.
Fortunately, things don't have to be this way. If you step back and rethink your approach to observability, you realize that you can have both cheap and in-depth observations.
Here's the trick: Instead of trying to collect and analyze every bit of data available or sample it randomly, you sample it intelligently by identifying the most interesting data right at the source, then select only that data to send to your observability platform. You also translate the data into granular, actionable metrics, so that it's primed to be analyzed as soon as it hits your observability portal.
Still too good to be true, right? The thing is that an instrumentation-based approach for monitoring will essentially mean that if we’re trying to measure key things about your code, you will need to “wrap” parts of it with external code segments that will manage the logic of the monitoring tool.
Make the logic too complex, and you pay with an overhead. Generally speaking, the more you analyze the data coming from your application before just sending it out, the larger the overall impact of the generated overhead on your application.
That’s why conventional monitoring tools turn to simple logic like sample it all, or sample it randomly.
Without diving too deep into details, that limitation can be lifted by new emerging technologies like eBPF.
The result is much less data for you to transfer, analyze and store. But at the same time, because you're focusing on relevant data, you don't have to lie awake at night wondering if the data you've selected will reliably yield the in-depth insights you need.
If you like analogies, here's one that sums up the observability strategy we're talking about here: It's akin to sorting through each bundle of dried grass as you build a haystack, checking to see if there's a needle inside.
That way, you catch the needles early on, without having to wait until you have a whole haystack in place to sort through it and try to pull out the needle. The needles are easy to find because they never get buried inside the haystack in the first place. In fact, you have no haystack to deal with at all, because you can discard the hay you don't care about before it turns into a costly mess of a haystack.
I think that this approach to observability – which is the one that inspired the design of groundcover – is key for any team that wants to maintain reliable visibility into its systems without paying a fortune in monitoring and data storage costs. The volume and complexity of cloud-native log, metric, and tracing data will only increase, making conventional monitoring and observability strategies less and less viable for cloud-native environments.
If you want to have it all – and you can. Adopt an observability architecture that lets you hone in on relevant data right at the source, and you’re golden.