If you’ve ever worked with complex high-load cloud infrastructure in Microsoft Azure, you probably understand how important it is to have an appropriate monitoring system.
Unfortunately, there aren’t always ready-made solutions that can guarantee the proper coverage of all required metrics and provide the option for extra customization. Eventually, if you find them, they will likely cost your company a fortune, particularly if you have a complex multicomponent infrastructure.
First, here’s a few words about the company and why we’ve implemented this solution, which will give you a better understanding of what we’re dealing with.
Wirex's entire platform is hosted in the Microsoft Azure cloud. We have a complex multicomponent infrastructure with thousands of transactions and a million events each day. As a fintech product with 3.5 million clients, our platform uses cryptocurrencies in addition to traditional currencies, so we can't allow customers to experience any difficulties during transactions or any delays when exchanging currencies. This imposes certain risks and challenges which means it requires specific monitoring systems that need to be more advanced and dynamic.
Here’s the list of our main targets:
Before going any further, let's take a brief look at what we’re dealing with:
Our goal is to monitor every single Virtual Machine instance, each microservice, and receive metrics for every single environment.
Let's start at the beginning with tools Microsoft offers from the box. Azure has a standard Azure Monitor, a great instrument that allows you to get different insights on various services such as Key Vaults, Storage Accounts, VM, Service Health Statuses among all resources. This tool also gives the ability to set up auto-scale for scaling resources, which is really useful to have all in one place. What’s more, you can configure alerts. Azure Monitor has convenient settings for sending notifications regarding service health statuses to any preferred messenger. An added bonus is that it can be easily integrated with our 3rd party products.
Another essential tool from Microsoft is PowerShell, and I think it’s worth mentioning as it's a unique tool with a unique script language. It provides a large number of modules to work with Azure, and you can even cover aspects and metrics of resources that are sometimes hidden and hard to find, such as various different customization tools.
Of course, Microsoft provides a lot more, but we are not going to cover them all. The main point to understand is that we’re still not able to get all the information we need.
In addition to Microsoft tools, we have third-party products that we use to monitor our infrastructure.
First of all, we had all of our Application Insights metrics in one place. Azure cache, databases, Virtual Machines, Application Services, Key Vaults, Storage Accounts, and many more. Below is just a brief look at what we have visualized and customized in Grafana.
Unfortunately, we were unable to display information on Scale Sets and microservices and this information is not only fundamental but it’s also a goal target of our cluster. The issue is that the integration with Azure Monitor does not provide metrics on output from Virtual Machine Scale Sets. We also didn't know what was happening with our microservices on each instance; so, we tried to solve this problem.
What kind of solution did we come up with?
Our first answer is Prometheus. It's an Open Source product, and, in the world of Open Source projects, it’s quite popular in the monitoring industry, almost as popular as Kubernetes is for cluster orchestration.
According to the official documentation, Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. Grafana or other API consumers can then be used to visualize the collected data.
Source: https://prometheus.io/docs/introduction/overview/
What’s useful is that Prometheus includes a local on-disk time-series database and optionally integrates with remote storage systems. It might sound like a disadvantage initially, but in general, it’s good to have the extra security in regards to the setup process, as it means there will be fewer points for database failure as well as a direct read/write operation.
Of course, a big plus is that Prometheus can discover Scale Sets automatically with Azure configurations. Why is this a plus for us? Usually, when you write a Prometheus configuration, you have to set each target individually, using the target IP address with its name. But when you have an infrastructure with more than 150 Virtual Machine Scale Sets with more than 20 instances in each set, that's not so convenient.
The question is how to discover each Scale Set, especially the Scale Sets that appear after auto-scaling? Furthermore, would you have the ability to monitor these new, extra Scale Set? The answer is “yes” because it will be automatically discovered, and named as it is in Grafana, which is extremely useful.
Prometheus also has a customization tool called PromQL, which allows you to customize your queries and get the information that you need.
So, here’s the list of main Prometheus benefits:
How does Prometheus know what’s happening in each particular instance? And how do we connect Prometheus with our instances? The answer is with Windows Exporter.
Windows Exporter is an extra tool developed by the Prometheus community. It's deployed on each instance of the Virtual Machine Scale Set. And this service collects metrics that were previously set in a pre-deploy. Here's a brief view of available collectors.
The most important thing here is that we’re able to display information on processes.
Our microservice exists on an instance as a process, so getting metrics for processes allows us to monitor them. We get information specifically for services such as its CPU time, working set, virtual bites, i/o, display of when the process was started, etc.
Also, if you have extra deployed services on your instance that need to be monitored, this isn’t a problem.
In these situations, we use Terraform, DSC configurations, and Chocolatey.
We deploy DSC configuration with the described procedure to install an exporter as a Virtual Machine extension in Azure.
It allows us to manage the infrastructure and control our resources. We can deploy our exporter on every instance and configure continuous configuration deployment. It’s a good advantage as we get rid of all drift configurations, so all our instances are consistent with each other, and that's really remarkable.
Here is a monitoring scheme.
On the left, you can see a Service Fabric cluster which consists of several Scale Sets; it could even be adopted for Kubernetes. Each Scale Set has its own installed Exporter and, of course, deployed application microservices.
Then we have our Prometheus with TSDB and Grafana, which use Prometheus as a data source.
What does Exporter do? It runs on localhost on a specific port (9128, by default) and collects all predefined metrics, and posts them. Prometheus then scrapes metrics with configured scrape intervals through HTTP, which could be 5-10-15 seconds, so you can get instant updates.
For example, you could configure your entire monitoring infrastructure as a separate local network and whitelist required connections through firewall rules without exposing resources to the public.
Lastly, you need to configure Grafana to use Prometheus as a data source and customize output through PromQL in order to display all the data on the beautiful dashboards.
We have a BitBucket repository with Ansible playbooks, which describes the installation process and configuration for Prometheus. We also have TeamCity as our main Continuous Integration (CI) tool that we use for our deployment project builds, allowing them to deploy and update. We have Ansible as a configuration management tool for the deployment of Prometheus of Grafana as Docker containers. Here, Ansible could be a separate instance, or it could be configured as a TeamCity runner; either option is possible. If you select a runner, it means it will be adapted as a TeamCity plugin.
It’s actually quite simple. TeamCity pulls configuration for our Prometheus configuration and then it triggers Ansible, which is deploying Prometheus as a Docker container. All this running through TeamCity agents.
This is how you can implement your automation pipeline. It could take less than 20 seconds to deploy a new container or update it with our pre-deployed configuration.
Thanks to Grafana and Prometheus, we've got all the metrics we need, and more for all our Virtual Machine Scale Sets instances, plus we’ve got information for the instances themselves and for each microservice in every instance.
We monitor about 370 microservices in every environment, from development to production and we’ve got metrics for all Scale Sets. We have a better view of what is happening on production, and, as a bonus, we understand what's going on with each service on development before it’s deployed on the production. This means we can be notified and react quickly.
We’re not worried about targeting and auto-naming if auto-scaling goes up or down on our Virtual Machine Scale Sets, we’re still able to monitor each service and instance. Also, we've got configured thresholds alerts, and we are able to receive alerts for a huge number of different events. We also get great customization options. Grafana provides us with the ability to build dashboards as we want, writing the queries to receive all the needed metrics.
And thanks to these tools, we can continuously deploy, re-work, re-design and re-upgrade our infrastructure.