paint-brush
When You Can't Rely on the Prometheus Up Metricby@esca
2,200 reads
2,200 reads

When You Can't Rely on the Prometheus Up Metric

by EscaNovember 14th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Prometheus can collect metric from 1 pod when Prometheus can reach the pod to scrape metrics. But this setup will not be able to tell how many pods are ready. This is when scraping directly from pod is done via Kubernetes service. This causes a false positive when Prometheus scrapes directly from the pod. With this topology, prometheus can reach all the pods and the up metric of each pod will have the value 1, even when the pods are not in the ready state or their readiness probes are failed.

Coin Mentioned

Mention Thumbnail
featured image - When You Can't Rely on the Prometheus Up Metric
Esca HackerNoon profile picture

up
metric has value 1 when Prometheus can reach the pod to collect/scrape the metrics. It might be useful to monitor pod's readiness(in some case) if the scraping is done through the k8s service. But it causes a false positive when Prometheus scrapes directly from the pod.

This is the request flow when the metric scraping is done via Kubernetes service.

In here, the kubernetes service work as a load balancer to route the scraping request to pods. So each time, Prometheus can collect the metric from 1 pod. But this setup will not be able to tell how many pods are ready.

That's when scraping directly from pod comes into the picture

With this topology, prometheus can reach all the pods and the

up
metric of each pod will have the value 1, even when the pods are not in the ready state or their readiness probes are failed. This does not happen with the scraping through Kubernetes service because Kubernetes service won't send the request to un-ready pods, it returns 503 instead.

To avoid this false positive, we need to introduce a custom gauge metric which will indicate the readiness of the pod. I choose the descriptive name

pod_readiness
for that. But how do we update the value of the metric?

In the above picture, I use one servlet filter to catch the HTTP response from the actuator and set the pod_readiness metric's value accordingly.

Once the metric is collected from pods, we can design some Prometheus queries to monitor the number of ready pods and fire the alert if necessary.

Cheers