A common architectural design pattern these days is to break up an application monolith into smaller microservices. Each microservice is then responsible for a specific aspect or feature of your app. For example, one microservice might be responsible for serving external API requests, while another might handle data fetching for your frontend. Designing a robust and fail-safe infrastructure in this way can be challenging; monitoring the operations of all these microservices together can be even harder. It's best not to simply rely on your application logs for an understanding of your systems' successes and errors. Setting up proper monitoring will provide you with a more complete picture, but it can be difficult to know where to start. In this post, we'll cover service areas your metrics should focus on to ensure you're not missing key insights. Before Getting Started We're going to make a few assumptions about your app setup. Don't worry—you don't need to use any specific framework to start tracking metrics. However, it does help to have a general understanding of the components involved. In other words, how you set up your observability tooling matters less than what you track. Since a sufficiently large set of microservices requires some level of coordination, we're going to assume you are using for orchestration. We're also assuming you have a time series database like or for storing your metrics data. You might also need an ingress controller, such as to control traffic flow, and a service mesh, such as , to better facilitate connections between services. Kubernetes Prometheus InfluxDB the one Kong provides Kuma Before implementing any monitoring, it's essential to know how your services actually interact with one another. Writing out a document that identifies which services and features depend on one another and how availability issues would impact them can help you strategize around setting baseline numbers for what constitutes an appropriate threshold. Types of Metrics You should be able to see data points from two perspectives: Impact Data and Causal Data. Impact Data represents information that identifies who is being impacted. For example, if there's a service interruption and responses slow down, Impact Data can help identify what percentage of your active users is affected. While Impact Data determines is being affected, Causal Data identifies is being affected and why. Kong Ingress, which can monitor network activity, can give us insight into Impact Data. Meanwhile, Kuma can collect and report Causal Data. who what Let's look at a few data sources and explore the differences between Impact Data and Causal Data that can be collected about them. Latency Latency is the amount of time it takes between a user performing an action and its final result. For example, if a user adds an item to their shopping cart, the latency would measure the time between the item addition and the moment the user sees a response that indicates its successful addition. If the service responsible for fulfilling this action degraded, the latency would increase, and without an immediate response, the user might wonder whether the site was working at all. To properly track latency in an Impact Data context, it's necessary to follow a single event throughout its entire lifetime. Sticking with our purchasing example, we might expect the full flow of an event to look like the following: The customer clicks the "Add to Cart" button The browser makes a server-side request, initiating the event The server accepts the request A database query ensures that the product is still in stock The database response is parsed, a response is sent to the user, and the event is complete To successfully follow this sequence, you should standardize on a naming pattern that identifies both what is happening and when it's happening, such as , , , and so on. Depending on your programming language, you might be able to provide a function block or lambda to the metrics service: customer_purchase.initiate customer_purchase.queried customer_purchase.finalized statsd.timing( ) # ... end 'customer_purchase.initiate' do By providing specific keywords, you ought to hone in on which segment of the event was slow in the event of a latency issue. Tracking latency in a Causal Data context requires you to track the speed of an event between services, not just the actions performed. In practice, this means timing service-to-service requests: statsd.histogram( ) statsd.histogram( ) # ... end end 'customer_purchase.initiate' do 'customer_purchase.external_database_query' do This shouldn't be limited to capturing the overall endpoint request/response cycles. That sort of latency tracking is too broad and ought to be more granular. Suppose you have a microservice with an endpoint that makes internal database requests. In that case, you might want to time the moment the request was received, how long the query took, the moment the service responded with a request, and the moment when the originating client received that request. This way, you can pinpoint precisely how the services communicate with one another. Traffic You want your application to be useful and popular—but an influx of users can be too much of a good thing if you're not prepared! Changes in site traffic can be difficult to predict. You might be able to serve user load on a day-to-day basis, but events (both expected and unexpected) can have unanticipated consequences. Is your eCommerce site running a weekend promotion? Did your site go viral because of some unexpected praise? Traffic variances can also be affected by geolocation. Perhaps users in Japan are experiencing traffic load in a way that users in France are not. You might think that your systems are working as intended, but all it takes is a massive influx of users to test that belief. If an event takes 200ms to complete, but your system can only process one event at a time, it might not seem like there's a problem—until the event queue is suddenly clogged up with work. Similar to latency, it's useful to track the number of events being processed throughout the event's lifecycle to get a sense of any bottlenecks. For example, tracking the number of jobs in a queue, the number of HTTP requests completed per second, and the number of active users are good starting points for monitoring traffic. For Causal Data, monitoring traffic involves capturing how services transmit information to one another, similar to how we did it for latency. Your monitoring setup ought to track the number of requests to specific services, their response codes, their payload sizes, and so on—as much about the request and response cycle as necessary. When you need to investigate worsening performance, knowing which service is experiencing problems will help you track the possible source much sooner. Error Rates Tracking error rates is rather straightforward. Any 5xx (or even 4xx) issued as an HTTP response by your server should be tagged and counted. Even situations that you've accounted for, such as caught exceptions, should be monitored because they still represent a non-ideal state. These issues can act as warnings for deeper problems stemming from defensive coding that doesn't address actual problems. Kuma can capture the error codes and messages thrown by your service, but this represents only a portion of actionable data. For example, you can also capture the arguments which caused the error (in case a query was malformed), the database query issued (in case it timed out), the permissions of the acting user (in case they made an unauthorized attempt), and so on. In short, capturing the state of your service at the moment it produces an error can help you replicate the issue in your development and testing environments. Saturation You should track the memory usage, CPU utilization, disk reads/writes, and available storage of each of your microservices. If your resource usage regularly spikes during certain hours or operations or increases at a steady rate, this suggests you’re overutilizing your server. While your server may be running as expected, once again, an influx of traffic or other unforeseen occurrences can quickly topple it over. Kong Ingress only monitors network activity, so it's not ideal for tracking saturation. However, there are for tracking this with Kubernetes. many tools available Implementing Monitoring and Observability Up to now, we've discussed the kinds of metrics that will be important to track in your cloud application. Next, let’s dive into some specific steps you can take to implement this monitoring and observability. Install Prometheus Prometheus is the go-to standard for monitoring, an open-source system that is easy to install and integrate with your Kubernetes setup. Installation is especially simple if you use . Helm First, we create a namespace: monitoring Next, we use Helm to install Prometheus. We make sure to add the Prometheus charts to Helm as well: $ kubectl create namespace monitoring $ helm repo add prometheus-community https: $ helm repo add stable https: $ helm repo update $ helm install -f https: //prometheus-community.github.io/helm-charts //kubernetes-charts.storage.googleapis.com/ //bit.ly/2RgzDtg -n monitoring prometheus prometheus-community/prometheus The values file referenced at https://bit.ly/2RgzDtg sets the data scrape interval for Prometheus to ten seconds. Enable Prometheus Plugin in Kong Assuming you are using Kong Ingress Controller (KIC) for Kubernetes, your next step will be to create a custom resource—a resource—which integrates into the KIC. Create a file called : KongPlugin prometheus-plugin.yml apiVersion: configuration.konghq.com/v1 kind: KongClusterPlugin metadata: name: prometheus annotations: kubernetes.io/ingress.class: kong labels: global: plugin: prometheus "true" Install Grafana Grafana is an observability platform that provides excellent dashboards for visualization of data scraped by Prometheus. We use Helm to install Grafana as follows: $ helm install grafana stable/grafana -n monitoring --values http: //bit.ly/2FuFVfV You can view the bit.ly URL in the above command to see the specific configuration values for Grafana that we provide upon installation. Enable Port Forwarding Now that Prometheus and Grafana are up and running in our Kubernetes cluster, we'll need access to their dashboards. For this article, we'll set up basic port forwarding to expose those services. This is a simple—but not very secure—way to get access, but not advisable for production deployments. $ POD_NAME=$(kubectl get pods --namespace monitoring -l -o jsonpath= ) kubectl --namespace monitoring port-forward $POD_NAME & $ POD_NAME=$(kubectl get pods --namespace monitoring -l -o jsonpath= ) kubectl --namespace monitoring port-forward $POD_NAME & "app=prometheus,component=server" "{.items[0].metadata.name}" 9090 "app.kubernetes.io/instance=grafana" "{.items[0].metadata.name}" 3000 The above two commands expose the Prometheus server on port and the Grafana dashboard on port . 9090 3000 Those simple steps should be sufficient to set you off and running. With Kong Ingress Controller and its integrated Prometheus plugin, capturing metrics with Prometheus and visualizing them with Grafana are quick and simple to set up. Conclusion Whenever you need to investigate worsening performance, your Impact Data metrics can help orient you on the magnitude of the problem: it should tell you how many people are affected. Likewise, your Causal Data identifies what isn't working and why. The former points you to the plume of smoke, and the latter takes you to the fire. In addition to all of the above, you should also consider the rate at which your metrics are changing. For example, say your traffic numbers are increasing. Observing how quickly those numbers are moving can help you determine when (or if) it'll become a problem. This is essential for managing upcoming work with regular deployments and changes to your services. It also establishes what an ideal performance metric should be. , which is a must-read for any developer. If you're already running Kong alongside your clusters, plugins integrate directly with Prometheus, which means less configuration on your part to monitor and store metrics for your services. Google wrote an entire book on site reliability such as this one