If you are new to “Service Mesh” and “Envoy”, i have a post explaining both of them . here This is the second post in the series, you can read the first post about Distributed Tracing . Observability with Envoy service mesh here With microservices you cannot be in the dark when it comes to monitoring, you need to at least know that something is going wrong. Let us look into how Envoy can help us get some info on what is going on with our services. With a service mesh, all the traffic goes through the mesh, meaning no service talks to the other service directly, the service make a call to Envoy and Envoy will route the call to the destination service, so Envoy will have context about the incoming and outgoing traffic. Envoy generally provides metrics about the requests, requests and the . incoming outgoing state of the Envoy instance Setup Here is an overview of what we are trying to build overall setup Statsd Envoy supports publishing metrics in 2 or 3 formats, but for this post we will use format. statsd So with that said, the flow will be, Envoy pushes the metrics to statsd and from statsd we will pull the metrics using (a time series database) and then we will visualise the metrics using . prometheus grafana In our setup diagram i have mentioned statsd exporter instead of statsd for a reason, we are not going to have statsd as such, we are going to have a converter(service) which will accept data in statsd format and expose it in prometheus format. Gets the job done for us. Envoy’s metrics can be majorly classified into two Counter: An ever increasing metric. E.g.: total number of requests Gauge: A metric that can go up or down, like an instant value. E.g.: current CPU utilisation Let us look at an Envoy configuration with stats sink lines 8–13 tells Envoy that we need metrics in statsd format, what is the prefix for our stats(usually your service name) and the location where our statsd sink lives lines 55–63 configures the statsd sink in our environment that is all the configuration that is needed to get stats out of Envoy. Now if you look at lines 2–7, there are two things happening Envoy exposes an admin endpoint on port 9901 which you can use to dynamically change the log level, view current configuration, stats, etc.. Envoy can also generate access logs similar to nginx, which you can use to understand your traffic. the format of the access log is also configurable, lines 29–33 does exactly that You need to add the same stats configuration to the other side car Envoy’s of the services in our system (yes, every service has its own Envoy side car). The services themselves are written in go and they do not do much except for calling other services through Envoy. You can look at the service and Envoy configurations . here So right now we only have statsd exporter in the picture, with this, if we run the docker containers(docker-compose build & docker-compose up) and send some traffic to Front Envoy(localhost:8080), Envoy would start sending metrics about the traffic to our statsd exporter, which will convert the metrics to prometheus format and expose it in port 9102. This is how the stats look like in statsd exporter metrics from statsd exporter in prometheus format there would be hundreds of stats and in the above screenshot we are seeing the latency metrics for communication between Service A and Service B. The metrics in the above image are in prometheus format metric_name ["{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"] value [ timestamp ] You can read more about it . here Prometheus We are going to use as our time series database to store our metrics. Prometheus is not just a time series database, it is a monitoring system in itself, but in our setup we will use it as a datastore for our metrics. An important thing to note is prometheus is a pull based system, which means you have to tell prometheus where to scrape the metrics from, in our case it will be our statsd exporter. Prometheus Adding prometheus to the equation is very straightforward, we just need to pass the scrape targets(statsd exporter) as a configuration file to prometheus. Here is what the configuration will look like scrape_interval is the frequency in which prometheus will pull configuration from the target. So now we should have Prometheus up, and some data in prometheus as well. Let’s fire up locahost:9090 and see what it has prometheus query page as we can see, our metrics is available. You can do a lot more than just selecting existing metrics, you can read about prometheus query language . It can also plot graphs based on our queries. Has also an alerting system. here If we load up the targets page in prometheus we see all the scraping targets and health of those targets prometheus targets Grafana Grafana is an awesome Visualisation & Monitoring solution which supports a lot of backends like Prometheus, Graphite, InfluxDB, ElasticSearch, etc... Grafana has two major components that we need to configure Datasource: The backend from which grafana will get the metrics. You could configure the datasource using a configuration file which will look like this 2. Dashboard: This is where you visualise the metrics from your data source. Grafana supports a wide variety of visual elements like Graphs, Single Stats, Heatmaps, etc… and you can extend this and build your own using plugins. The only problem i have with Grafana is that there is no standard way of developing these dashboards as code. There are some third party libraries which support this and we will use the one from weaveworks called . grafanalib Here is the dashboard that we are trying to build expressed as python code We are building graphs for 2xx, 5xx and latency. lines 5–22 is important, it is extracting the service names available in our setup as grafana variables, it makes our dashboard dynamic, meaning we will be able to select the source and destination service for which we want to view these statistics. More about variables . here You have to use the grafanalib command to generate the dashboard from the above python file generate-dashboard -o dashboard.json service-dashboard.py beware the generated dashboard.json is not easy to read. So we just need to pass the dashboard and the datasource while starting up Grafana. And when you visit http:localhost:3000, you will be greeted with: grafana dashboard there you go, you have your 2xx, 5xx and latency charts and you also see the dropdown where you can select the source and destination services. There is more to grafana than what we have discussed, there is a powerful query editor, an alert system. More importantly, everything is extensible using plugins and applications, checkout an example . If you are visualising metrics of common services like redis, rabbitmq, etc.. Grafana has a repository of from which you can just import them and use. One more good thing about Grafana is you can create and manage everything with configuration files and code without dabbling much with the UI. here public dashboards I would urge you to play with prometheus and grafana to figure out more. Thanks for you time. Please leave your feedback as comments. You can find all the code, configuration files . here _Demo for monitoring micro services with envoy service mesh, prometheus & grafana - dnivra26/envoy_monitoring_github.com dnivra26/envoy_monitoring