The nature of application development has transformed over the last decade. 10 years ago when I was building software, even the coolest new startups were building Rails monoliths with centralized logging and reserved VMs.
Since then, we’ve seen microservices, containers, distributed systems, and edge computing explode in popularity. New tools have allowed us to scale our teams and software in ways I couldn’t have imagined a decade ago, but while these approaches have made environments more dynamic, they’ve also made software more complex and fragmented. It’s tough to get a holistic view of your entire application when it’s running on 200 pods deployed to 12 nodes distributed across the world.
Observability has become critical and more difficult for operations teams who need to track, manage, and optimize the performance and availability of these environments.
I’ve been talking to a lot of people about observability lately and realized that the complexity of modern application architectures can create an enormous challenge for executives and technical decision-makers when it comes to observability. In this post, I’ll offer some insights from others in the industry on these issues and some thoughts on the tools and best practices that can help make observability more manageable.
Developers, DevOps teams, and decision-makers need a way to extract data from their systems so they can identify potential failures in real-time. Gabriel Rodriguez, Developer Advocate at Tyk shared several ways that observability is used across organizations they work with:
“Developers will need insights into transactional processing…The SRE/DevOps function will need telemetry data for the environment…The IT or Security Manager may need insights into throttling, port exposure, whitelisting, policy adherence, & data integrity.”
Traditionally, if your entire application ran on a single server (or just a few replicas), you could pipe all the logs into a central store and have a pretty good idea of what was going on. Unfortunately, this is rarely the way modern applications are built. Between containers, PaaS, serverless, and self-hosted services, observability data is now widely scattered across multiple clouds.
Aggregating and organizing observability data is a huge challenge, but just because it’s hard does not mean we can ignore it.
“As more organizations describe their computing environments as IoT, micro-service, multi-cloud, ‘real-time’, distributed computing…observability needs to be at the forefront of the design process. Logging, traffic, transactional messaging, and telemetry data are equally important as feature development.”- Gabriel Rodriguez, Tyk.io
Nate Matherson, Co-founder & CEO of ContainIQ agreed, pointing out that observability is important for both engineering teams and end-users of a product:
“For engineering teams, by increasing focus on observability, they can get better insight into the root causes of issues that happen and improve system performance. For the end-users of a product, a businesses' focus on observability means better reliability, less downtime, and a more productive environment.” - Nate Matherson, ContainIQ
Observability is also important for decision-makers. Research shows that more than ever, end-users expect a personalized experience. Insight into how customers interact with your application and how your application performs during those interactions allows your product team to better support the customer experience.
As I pointed out, many traditional monitoring tools are designed for monolithic systems, often observing the health and behavior of a single application or server. Just a few of the challenges I’ve seen include:
The accelerated rate at which new technology is released and implemented has resulted in an unmanageable volume of data and more complex, dynamic environments to monitor. It’s impossible for IT teams using manual tools or traditional monitoring to understand how everything in their environment works together. Teams need ways to understand interdependencies and eliminate blind spots across the expanding environments.
Containers and microservices provide the speed and agility necessary for modern application development. But the dynamic nature of microservices architecture creates issues with real-time visibility into workloads running within containers.
Without proper tooling, IT teams can’t complete end-to-end tracing from user requests through microservices to isolate the root cause of anomalies. So, they have to turn to the engineers and architects who built the system or (in the worst case) simply guess what went wrong.
Using a variety of tools and dashboards, teams sift through an increasing amount of data to define thresholds for normal behavior in a constantly changing environment. But how can you monitor issues you aren’t aware of and don’t understand?
In a recent report on Innovation Insight for Observability, Gartner states, “Static dashboards with human-generated thresholds do not scale to these modern environments and are inflexible in assisting the resolution of unforeseen events.”
Teams often end up stitching together pieces of information from multiple static dashboards using timestamps or guesswork to figure out how different events might have contributed to the system failure.
While most engineers know the value of modern observability tooling and best practices, it’s not always easy to make the business case for these tools. Rodriguez believes that part of the reason is that engineers struggle to communicate the value of observability to decision-makers:
“Engineers are notoriously bad at describing value propositions. They need to translate the complexity of the tech stack into hours that can be put into a Return on Investment (ROI) discussion. Stop technically explaining what ‘observability’ tooling does. Instead, explain how it is a revenue source. It simply is ‘cost avoidance’ revenue versus transactional revenue.”
In other words, the engineering cost of debugging increasingly complex systems is much higher than the cost of implementing a comprehensive observability plan in the long term, but it does require sacrifices in the short term.
A highly observable system should translate to less downtime, fewer serious support requests, and improved engineering morale. But, making that case isn’t always easy in an organization that values product velocity over cleaning up technical debt.
“Modern tools are getting smarter and are helping teams cut through too much noise. Modern tools have also focused on reducing setup time and the ongoing maintenance of the tooling post-setup.” - Nate Matherson, ContainIQ
The right observability tools can give you unified visibility with a clear understanding of how your application is performing. They make it easier to monitor and troubleshoot issues by centralizing your data and providing smarter insights into key metrics on performance, usage, and user behavior.
While the landscape is constantly changing, there are several tools that I’ve found to make observability in modern distributed systems much easier.
Assuming that you’re using Kubernetes as a platform for your microservices, ContainIQ has a native monitoring and tracing platform. When set up inside your cluster, you’re able to see request-level data as it flows through your application and into each service.
“Our goal has been to introduce an out-of-the-box solution that takes less time to maintain going forward,” Matherson of ContainIQ told me. “We are focused on helping engineers avoid alert fatigue and to get to the root of issues before they impact end-user experience.”
Open-source products like Jaeger, Prometheus, and OpenTelemetry are also good options if you’d prefer to manage your own solution.
On the other hand, if you build on a managed API platform like Tyk, they provide observability into your entire system as well. This can give you insight into each API transaction as it moves through your services, database, caching, etc.
Whatever combination of tooling you choose, there are a few things you should consider:
To understand the current state of your app, you need to be able to correlate application events with your user’s actions. Real-time insights let you view user actions as they occur, so you can understand what users are doing, where they are having problems, and how you can improve their experience.
Data aggregation and visualization play a huge role in making observability actionable. Observability tools that pull data from multiple sources to provide interactive visual summaries are a huge step up from reading logs.
Finally, it’s worth pointing out that tools that a steep learning curve or implementation challenges might be a show-stopper. Observability tools should support the languages and frameworks you use, integrate easily with your container platform and the other tools you use, including any communication or alert systems you use.
Tooling is only part of the solution though. Your team still has to adopt an observability mindset during the software design and implementation phase if they want to get the most out of these tools. It’s really hard to generalize this advice, but here are a few recommendations:
There are more great resources for learning about DevOps than ever before. For example, Google’s SRE (Site Reliability Engineering) Guide offers a fantastic set of engineering practices for building reliable systems.
Rodriguez noted the 60/40 rule as one of his big takeaways from the guide:
“The rule effectively was to automate everything, and you would know when automation was sufficient when 40% of a developer’s time was dedicated to making automation more efficient and 60% of the time was on new feature development.”
Guides like Google’s are a great way to learn from established players without spending a lot of time or money on training.
Monitoring tools are great, but typically give little insight into application performance as their focus is on infrastructure rather than code. To increase observability, you need to use tools that can show you how an application is performing, allow you to drill down to a specific area of code when an error occurs, and give insight into how users are being affected by the issue. This allows you to see where exactly in the code changes are needed.
While too many alerts can be disruptive, it’s good to have a few business-critical alerts related to user experience. Alerts on high latency and request failure rates can let you know when users cannot use your service and alert developers if application performance falls below a certain threshold.
Monitoring tools can help you identify anomalies and errors before they become critical, but with hundreds or even thousands of alerts, it's difficult to determine the non-critical errors from the important ones. Many observability tools group errors by incident, making it easier to figure out if the alert is relevant and may even point you to the root cause of the incident leading to faster resolution.
In a video series on the challenges of shifting to a modern observability strategy, Nancy Gohring, senior analyst at 451 Research, talks about the importance of collecting the right types of data in large volumes and variety:
“You are going to want to make sure that you have good context about what's happening, and this is particularly important if your application is built using microservices. You need to understand how neighboring or dependent services might have an impact on the performance of the service that you're responsible for.”
When an incident occurs, the more data you have, the better. Data that spans multiple teams, systems, and services will give you a greater understanding of a specific issue and better enable you to handle unexpected outages.
“Engineering leaders shouldn't silo the role of observability to a specific individual or team. Instead, the entire org should have some knowledge of observability best practices and know-how current tooling at the organization works.” - Nate Matherson
Implementing observability tools won’t make a difference if they aren’t being used. Observability must become a culture within the organization through onboarding training and continuing education. Show its value by incorporating observability output into company and team meetings. Ensuring that all team players implement best practices and use observability tools creates a culture of data-driven decision-making that results in more robust systems and reduced outages.
Modern distributed development environments have created new challenges and complexities that have forced us to implement better methods for monitoring infrastructure, application state, and identifying and resolving issues.
Implementing modern observability tools and best practices can help achieve the end-to-end visibility that DevOps and SRE teams support successful digital transformation with fewer service interruptions and better user experiences.