In this article, I will show you how to setup an Open Source time series platform to monitor your Docker Swarm cluster & send notification on Slack in case of anomaly detection.
Components of our monitoring Stack:
Plugin driven server agent for collecting and reporting metrics.
Scalable time series database for metrics, events and real-time analytics.
Real time visualization tool for building graphs on top of data.
Framework for processing, monitoring, and alerting on time series data.
Real-time team messaging application.
Note: all the code used in this post is available on my Github.
1 — Swarm Setup
If you already have an existing Swarm cluster, you can skip this part, if not use the following script to setup a Swarm with 3 nodes (1 manager & 2 workers):
Issue the following commands:
chmod +x setup.sh./setup.sh
The output of the above command is as follows:
2 — Stack Setup
Once created, connect to your manager node via SSH, and clone the following repository:
To start all of these containers I’m using docker-compose:
Issue the following command to deploy the stack:
docker stack deploy — compose-file docker-compose.yml tick
Wait for nodes to pull the images from DockerHub:
Once pulled you should see the services running:
Open your browser on http://IP:8888 (Chronograf Dashboard) and properly configure the data source:
3 — System usage Dashboard
Click on “create dashboard“, and assign a name to the dashboard:
Before adding graphs, we will use a concept called Dashboard Template Variable, to create dynamic & interactive graphs. Instead of hard-coding things like node name and container name in our metric queries we will use variables in their place. So click on “Templates Varibles” in top of the dashboard created earlier. And, create a variable called :host: as follows:
Note: currently, there’s no solution to set hostname for services created with swarm global mode (Github). Thats why we got list of IDs instead of names
You can now use the dropdown at the top of the dashboard to select the different options for the :host: template variable:
Now it’s time to create our first graph, so click on “Add Graph” button.
3.1 — Memory usage per Node
To create a query, you can either use the Query Builder or, if you’re already familar with InfluxQL, you can manually enter the query in the text input:
SELECT mean(“free”) AS “mean_free”, mean(“used”) AS “mean_used”, mean(“total”) AS “mean_total” FROM “vm_metrics”.”autogen”.”mem_vm” WHERE time > :dashboardTime: AND “host”=:host: GROUP BY :interval: FILL(null)
Our query calculates the average of the field keys free, used, and total in the measurement mem_vm, and it groups them by the time and node name.
You can change the graph type, X, and Y axes format by clicking on “Options” tab:
One visualization on a dashboard isn’t spectacularly interesting, so I added a couple more graphs to show you more possibilities:
3.2 — CPU usage per Node
SELECT mean(“usage_user”) AS “mean_usage_user”, mean(“usage_system”) AS “mean_usage_system” FROM “vm_metrics” .”autogen”.”cpu_vm” WHERE time > :dashboardTime: AND “host”=:host: GROUP BY :interval: FILL(null)
3.3 — Disk usage per Node
SELECT mean(“free”) AS “mean_free”, mean(“total”) AS “mean_total”, mean(“used”) AS “mean_used” FROM “vm_metrics”.”autogen”.”disk_vm” WHERE time > :dashboardTime: AND “host”=:host: GROUP BY :interval: FILL(null)
We end up with a beautiful dashboard like this:
Let’s create another dashboard to monitor Docker Containers running on the Cluster.
4 — Swarm Services Dashboard
Create a second dashboard called “Services“, and create a template variable to store list of services running on cluster:
You can filter now metrics by service name:
4.1 — Memory usage per Service
SELECT mean(“usage_percent”) AS “mean_usage_percent” FROM “docker_metrics”.”autogen”.”docker_container_mem_docker” WHERE time > :dashboardTime: AND “com.docker.swarm.service.name” = :container: GROUP BY :interval: FILL(null)
4.2 — CPU usage per Service
SELECT mean(“usage_percent”) AS “mean_usage_percent” FROM “docker_metrics”.”autogen”.”docker_container_cpu_docker” WHERE time > :dashboardTime: AND “com.docker.swarm.service.name” = :container: GROUP BY :interval: FILL(null)
4.3 — Network Transmit/Receive
SELECT mean(“tx_packets”) AS “mean_tx_packets”, mean(“rx_packets”) AS “mean_rx_packets” FROM “docker_metrics”.”autogen”.”docker_container_net_docker” WHERE time > :dashboardTime: AND “com.docker.swarm.service.name” = :container: GROUP BY :interval: FILL(null)
4.4 — IO Read/Write per Service
SELECT mean(“io_serviced_recursive_write”) AS “mean_io_recursive_write_write”, mean(“io_serviced_recursive_read”) AS “mean_io_serviced_recursive_read” FROM “docker_metrics”.”autogen”.”docker_container_blkio_docker” WHERE time > :dashboardTime: AND “com.docker.swarm.service.name” = :container: GROUP BY :interval: FILL(null)
Result:
Note: you can take this further, and filter metrics by the node on which the service is running on by creating another template variable:
Let’s see what happen if we create another service on Swarm:
docker service create — name api — constraint node.role==worker -p 5000:5000 mlabouardy/books-api
If you go back to Chronograf, you should see the service has been added automatically to the container dropdown list:
And that’s it! You now have the foundation for building beautiful data visualizations and dashboards with Chronograf.
Kapacitor is the last piece of the puzzle. We now know how to store, get and read metrics, and now you need to elaborate on them to do something like alerting or proactive monitoring.
So on the “Configuration” tab, click on “Add config“:
Add new Kapacitor instance as below, and enable Slack:
Note: update the Slack channel & Webhook URL in case you didn’t update the kapacitor.conf file in the beginning of this tutorial. You can get a Webhook URL by going to this page:
5 — Alerts definition
5.1 — High Memory Utilization Alert
Navigate to the “Rule Configuration” page by visiting the “Alerting” page and click on the “Build Rule” button in the top right corner:
We will trigger an alert if the memory usage is over 60%:
Next, we select Slack as the event handler and configure the alert message:
Note: there’s no need to include a Slack channel in the Alert Message section if you specified a default channel in the initial Slack configuration.
5.2 — High CPU Utilization Alert
Create a second rule to trigger an alert if CPU usage is over 40%:
Alert endpoint:
Save the rule, and you’re all set !
Now our alert rules are defined, lets test them out by creating some load on our cluster.
6 — Stress Testing
I used stress, a tool for generating workload. It can produce CPU, memory, I/O, and disk stress.
6.1 — Stressing the CPU
docker run — rm -it progrium/stress — cpu 4 — timeout 20s
Note: depending on the type of your CPU, make sure to replace ‘4‘ accordingly.
After few seconds, you should receive a Slack notification:
Kapacitor trigger an alert and also recovered them (Status OK) if the alert is resolved.
6.2 — Stressing the Memory
docker run — rm -it progrium/stress — vm 3 — timeout 20s
It will stress memory using three processes, with each about 256 Mb(override with the option –vm-bytes).
Let it run for a couple seconds :
That’s it! You’ve successfully setup a highly scalable, distributed monitoring platform for your Swarm cluster with only Open Source projects.