In this article, I will show you how to setup an Open Source time series platform to monitor your cluster & send notification on in case of anomaly detection. Docker Swarm Slack Components of our monitoring Stack: Plugin driven server agent for collecting and reporting metrics. Scalable time series database for metrics, events and real-time analytics. Real time visualization tool for building graphs on top of data. Framework for processing, monitoring, and alerting on time series data. Real-time team messaging application. Note: all the code used in this post is available on my . Github 1 — Swarm Setup If you already have an existing , you can skip this part, if not use the following script to setup a with 3 nodes (1 & 2 ): Swarm cluster Swarm manager workers Issue the following commands: chmod +x setup.sh./setup.sh The output of the above command is as follows: 2 — Stack Setup Once created, connect to your via , and clone the following repository: manager node SSH git clone https://github.com/mlabouardy/swarm-tick.git To start all of these containers I’m using : docker-compose Issue the following command to deploy the stack: docker stack deploy — compose-file docker-compose.yml tick Wait for nodes to pull the images from : DockerHub Once pulled you should see the services running: Open your browser on ( Dashboard) and properly configure the data source: http://IP:8888 Chronograf 3 — System usage Dashboard Click on “ “, and assign a name to the dashboard: create dashboard Before adding graphs, we will use a concept called , to create dynamic & interactive graphs. Instead of hard-coding things like node name and container name in our metric queries we will use variables in their place. So click on “ ” in top of the dashboard created earlier. And, create a variable called as follows: Dashboard Template Variable Templates Varibles :host: Note: currently, there’s no solution to set hostname for services created with ( ). Thats why we got list of IDs instead of names swarm global mode Github You can now use the dropdown at the top of the dashboard to select the different options for the template variable: :host: Now it’s time to create our first graph, so click on “ ” button. Add Graph 3.1 — Memory usage per Node To create a query, you can either use the or, if you’re already familar with , you can manually enter the query in the text input: Query Builder InfluxQL SELECT mean(“free”) AS “mean_free”, mean(“used”) AS “mean_used”, mean(“total”) AS “mean_total” FROM “vm_metrics”.”autogen”.”mem_vm” WHERE time > :dashboardTime: AND “host”=:host: GROUP BY :interval: FILL(null) Our query calculates the average of the field keys , , and in the measurement , and it groups them by the time and node name. free used total mem_vm You can change the graph type, X, and Y axes format by clicking on “ ” tab: Options One visualization on a dashboard isn’t spectacularly interesting, so I added a couple more graphs to show you more possibilities: 3.2 — CPU usage per Node SELECT mean(“usage_user”) AS “mean_usage_user”, mean(“usage_system”) AS “mean_usage_system” FROM “vm_metrics” .”autogen”.”cpu_vm” WHERE time > :dashboardTime: AND “host”=:host: GROUP BY :interval: FILL(null) 3.3 — Disk usage per Node SELECT mean(“free”) AS “mean_free”, mean(“total”) AS “mean_total”, mean(“used”) AS “mean_used” FROM “vm_metrics”.”autogen”.”disk_vm” WHERE time > :dashboardTime: AND “host”=:host: GROUP BY :interval: FILL(null) We end up with a beautiful dashboard like this: Let’s create another dashboard to monitor running on the . Docker Containers Cluster 4 — Swarm Services Dashboard Create a second dashboard called “ “, and create a template variable to store list of services running on cluster: Services You can filter now metrics by service name: 4.1 — Memory usage per Service SELECT mean(“usage_percent”) AS “mean_usage_percent” FROM “docker_metrics”.”autogen”.”docker_container_mem_docker” WHERE time > :dashboardTime: AND “com.docker.swarm.service.name” = :container: GROUP BY :interval: FILL(null) 4.2 — CPU usage per Service SELECT mean(“usage_percent”) AS “mean_usage_percent” FROM “docker_metrics”.”autogen”.”docker_container_cpu_docker” WHERE time > :dashboardTime: AND “com.docker.swarm.service.name” = :container: GROUP BY :interval: FILL(null) 4.3 — Network Transmit/Receive SELECT mean(“tx_packets”) AS “mean_tx_packets”, mean(“rx_packets”) AS “mean_rx_packets” FROM “docker_metrics”.”autogen”.”docker_container_net_docker” WHERE time > :dashboardTime: AND “com.docker.swarm.service.name” = :container: GROUP BY :interval: FILL(null) 4.4 — IO Read/Write per Service SELECT mean(“io_serviced_recursive_write”) AS “mean_io_recursive_write_write”, mean(“io_serviced_recursive_read”) AS “mean_io_serviced_recursive_read” FROM “docker_metrics”.”autogen”.”docker_container_blkio_docker” WHERE time > :dashboardTime: AND “com.docker.swarm.service.name” = :container: GROUP BY :interval: FILL(null) Result: Note: you can take this further, and filter metrics by the node on which the service is running on by creating another template variable: Let’s see what happen if we create another service on : Swarm docker service create — name api — constraint node.role==worker -p 5000:5000 mlabouardy/books-api If you go back to , you should see the service has been added automatically to the container dropdown list: Chronograf And that’s it! You now have the foundation for building beautiful data visualizations and dashboards with . Chronograf is the last piece of the puzzle. We now know how to store, get and read metrics, and now you need to elaborate on them to do something like alerting or proactive monitoring. Kapacitor So on the “ ” tab, click on “ “: Configuration Add config Add new instance as below, and enable : Kapacitor Slack Note: update the channel & in case you didn’t update the file in the beginning of this tutorial. You can get a by going to this : Slack Webhook URL kapacitor.conf Webhook URL page 5 — Alerts definition 5.1 — High Memory Utilization Alert Navigate to the “ page by visiting the “ ” page and click on the “ ” button in the top right corner: Rule Configuration” Alerting Build Rule We will trigger an alert if the usage is over : memory 60% Next, we select as the event handler and configure the alert message: Slack Note: there’s no need to include a channel in the section if you specified a default channel in the initial configuration. Slack Alert Message Slack 5.2 — High CPU Utilization Alert Create a second rule to trigger an alert if usage is over : CPU 40% Alert endpoint: Save the rule, and you’re all set ! Now our alert rules are defined, lets test them out by creating some load on our cluster. 6 — Stress Testing I used , a tool for generating workload. It can produce CPU, memory, I/O, and disk stress. stress 6.1 — Stressing the CPU docker run — rm -it progrium/stress — cpu 4 — timeout 20s Note: depending on the type of your CPU, make sure to replace ‘ ‘ accordingly. 4 After few seconds, you should receive a notification: Slack trigger an alert and also recovered them (Status OK) if the alert is resolved. Kapacitor 6.2 — Stressing the Memory docker run — rm -it progrium/stress — vm 3 — timeout 20s It will stress memory using three processes, with each about (override with the option ). 256 Mb –vm-bytes Let it run for a couple seconds : That’s it! You’ve successfully setup a highly scalable, distributed monitoring platform for your with only Open Source projects. Swarm cluster
Share Your Thoughts