Site Color

Text Color

Ad Color

Text Color

Evergreen

Duotone

Mysterious

Classic

Sign Up to Save Your Colors

or

When You Can't Rely on the Prometheus Up Metric by@esca

When You Can't Rely on the Prometheus Up Metric

image
Esca Hacker Noon profile picture

Esca

up
metric has value 1 when Prometheus can reach the pod to collect/scrape the metrics. It might be useful to monitor pod's readiness(in some case) if the scraping is done through the k8s service. But it causes a false positive when Prometheus scrapes directly from the pod.

This is the request flow when the metric scraping is done via Kubernetes service.

image

In here, the kubernetes service work as a load balancer to route the scraping request to pods. So each time, Prometheus can collect the metric from 1 pod. But this setup will not be able to tell how many pods are ready.

That's when scraping directly from pod comes into the picture

image

With this topology, prometheus can reach all the pods and the

up
metric of each pod will have the value 1, even when the pods are not in the ready state or their readiness probes are failed. This does not happen with the scraping through Kubernetes service because Kubernetes service won't send the request to un-ready pods, it returns 503 instead.

To avoid this false positive, we need to introduce a custom gauge metric which will indicate the readiness of the pod. I choose the descriptive name

pod_readiness
for that. But how do we update the value of the metric?

image

In the above picture, I use one servlet filter to catch the HTTP response from the actuator and set the pod_readiness metric's value accordingly.

Once the metric is collected from pods, we can design some Prometheus queries to monitor the number of ready pods and fire the alert if necessary.

Cheers

Tags