Photo by Vivaan Trivedii on Unsplash
As data flows through, energy is delivered to activate new opportunities. Oftentimes, we focus on specialized components, the vital organs of our software systems.
What can we learn by tapping into the connectors, themselves, pulling insights from the streams? Here's a quick overview on bootstrapping an observability strategy for APIs.
Data lives everywhere. When it comes to measuring success, anticipating problems, or looking for our next opportunity, we instinctively scrape, scrub, polish, and analyze information to the best of our ability.
Finding signals in the noise has been a natural activity for living beings since the dawn of vigilance.
As time has progressed, we've applied these data-crunching instincts to our digital assets, as well. For modern business, this practice is one for survival.
With so much of our business being driven by APIs, are we searching for the right signals?
The acronym MELT defines our starting point.
The process of communicating and recording these signals is called telemetry.
Further reading:
Most interactions we have with APIs over the network are fairly high-level. We send a blob of JSON. We receive a blob of JSON. Profit! 💰 What signals can we acquire from what lies below?
The association of signal trackers to our systems is called instrumentation.
When it comes to the lower-level components, we can often take advantage of automatic instrumentation. This can surface in the form of wrapping components within a standard library or adding listeners along connection paths.
Examples:
Today, we strive to capture more signals than ever before. We see both virtual machine metrics and application logs being shipped to storage and analysis tools.
But what about all the business-y stuff in-between? How are we capturing measurements for business Key Performance Indicators (KPIs)? We look to instrument the domain.
Domain events are the result of applying a command in the business domain to a specific context.
Whether captured or not, these events are happening all the time. What kind of questions may we ask of these insights?
What's the average length of time between discount codes being offered and being applied at checkout?
What's the correlation between in-app product announcements and newsletter sign-ups?
When appointments are canceled, what behavior directly precedes this action?
As the questions we ask evolve, so too must our methods of collecting these signals.
Further reading:
When we blew up the monolith into many services, we lost the ability to step through our code with a debugger: it now hops the network. Our tools are still coming to grips with this seismic shift. — Charity Majors, Observability — a 3-Year Retrospective
To reap the benefits of a distributed system, we sacrifice the convenience of having one-stop inspection. It wasn't always this way, and that's one aspect which makes upgrading our observability strategy difficult.
Let's take a look at how the observability tooling landscape has evolved.
Logging and monitoring solutions started when we were writing code close to the metal. The open source stack that used to dominate the landscape was a combination of these tools:
What many observability articles tend to ignore is that this stack is still heavily deployed and in-use today. Some of us are still here, and that's okay.
As virtual machines—and eventually cloud infrastructure—gained traction over running on bare metal servers, we saw a shift in how we approach signal-gathering. This gave rise to two prominent stacks in the observability space:
ElasticSearch - full-text search engine for log storage and querying
Logstash - log ingestion
Kibana - log visualization and alerting
It is common to run these stacks—or some combination thereof—in parallel. Many organizations are still here, and it makes sense. For the most part, they're incredibly robust and mature solutions.
However, we are in the midst of yet another sea change. There's one last stop on the map.
There has been a dramatic shift to cloud-native infrastructure. And for systems running in self-managed data centers, containers are beginning to take over as the atomic unit of application deployment.
On top of this, Kubernetes has grown to be the dominant container orchestrator (Note: in the Kubernetes world, an atomic unit is known as a Pod and consists of one or more related containers).
Why so many options? This world is still maturing, and it has become significantly more complex.
The mere existence of OpenTelemetry, discussed more in the next section, gives insight into the fact that the number of options in this space are growing at a fast pace.
No matter where businesses are in their journey today, observability of containers and their interactions is likely to become an important initiative.
A tool-agnostic observability framework for communicating telemetry, OpenTelemetry.io defines the project as:
OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.
Many Application Performance Monitoring (APM) tools are adding support for OpenTelemetry, as well. Check the OpenTelemetry Registry for more information.
Here's an example of adding auto-instrumentation to a Node.js application.
/* tracing.js */
// Require dependencies
const opentelemetry = require("@opentelemetry/sdk-node");
const {
getNodeAutoInstrumentations,
} = require("@opentelemetry/auto-instrumentations-node");
const sdk = new opentelemetry.NodeSDK({
traceExporter: new opentelemetry.tracing.ConsoleSpanExporter(),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
This gives an excellent jumpstart to help us start acquiring signals with low-effort.
Distributed Traces offer a superpower for API observability. They allow us to track requests through our distributed systems, and we can even include domain events in our context propagation.
A trace has a few components worth noting.
Parent (t1)
└── Trace (t2)
├── Span (s1)
│ ├── Event (e1)
│ └── Event (e2)
└── Span (s2)
Traces can be nested, creating tree-based observability structures.
Spans log segments of a trace. Here's an example of a server receiving an HTTP request:
// This server is artificial and for example only
import { SemanticAttributes } from "@opentelemetry/semantic-conventions";
import { trace, SpanKind, SpanStatusCode } from "@opentelemetry/api";
async function onGet(request, response) {
const span = tracer.startSpan(`GET /applicants/:id`, {
attributes: {
[SemanticAttributes.HTTP_METHOD]: "GET",
[SemanticAttributes.HTTP_FLAVOR]: "1.1",
[SemanticAttributes.HTTP_URL]: request.url,
[SemanticAttributes.NET_PEER_IP]: "192.0.2.5",
},
kind: SpanKind.SERVER,
});
const user = await getUser();
response.send(user.toJson());
span.setStatus({
code: SpanStatusCode.OK,
});
span.end();
}
server.on("GET", "/applicants/:id", onGet);
Span events are a special type of structured logging. They can be associated with a trace, giving insight into what domain events are happening in the broader context of a full interaction.
An example of adding events to a span:
// Get the current span
const span = tracer.getCurrentSpan();
// Perform the action
applicant.adopt(pet)
// Record the action
span.addEvent( "applicant.adoption.request", {
"applicant.id", applicant.id,
"pet.id": pet.id,
"applicant.eligibilityScore": applicant.eligibilityScore,
})
This is only a high-level overview. Check out the OpenTelemetry docs for more details!
Read more:
A significant driver of containerization is a shift in architectural trends to break apart monolithic applications. Containers and microservices have a symbiotic relationship. A co-evolution is occurring in this space.
In the VMWare State of Observability Report 2021, the findings state the following reasons for a rise in the complexity of managing cloud applications:
Legacy telemetry strategies are not enough. How do we start pushing forward an initiative to improve?
Observability creates a window into the organic flow of information that moves through our systems. It allows us to ask the important questions that impact our business. The good news is we almost certainly have familiarity with some of the practices involved. As the landscape continues to grow, it takes a lot of effort to stay ahead of the curve. That's expected. Following an observability initiative is a long-term approach to ensuring survival. Evolution, as we know, requires patience. 🧘
Additional reading:
A Three-Phased Approach to Observability by New Relic
Distributed Systems Observability by Cindy Sridharan
Observing is not Debugging (and other misnomers) by Kislay Verma
Splunk's State of Observability 2021
Social photo by Emiliano Vittoriosi on Unsplash