Introduction The need for Prometheus High Availability Kubernetes adoption has grown multifold in the past few months and it is now clear that Kubernetes is the defacto for container orchestration. That being said, Prometheus is also considered an excellent choice for monitoring both containerized and non-containerized workloads. Monitoring is an essential aspect of any infrastructure, and we should make sure that our monitoring set-up is and in order to match the needs of an ever growing infrastructure, especially in the case of Kubernetes. Therefore, today we will deploy a clustered Prometheus set-up which is not only resilient to node failures, but also ensures appropriate data archiving for future references. Our set-up is also very scalable, to the extent that we can span multiple Kubernetes clusters under the same monitoring umbrella. Majority of Prometheus deployments use persistent volume for pods, while Prometheus is scaled using a federated set-up. However, not all data can be aggregated using a federated mechanism, where you often need a mechanism to manage Prometheus configuration when you add additional servers. Thanos aims at solving the above problems. With the help of Thanos, we can not only multiply instances of Prometheus and de-duplicate data across them, but also archive data in a long term storage such as GCS or S3. highly-available highly-scalable Present scenario The Solution Implementation Thanos Architecture Thanos consists of the following components: Image Source: https://thanos.io/quick-tutorial.md/ : This is the main component that runs along Prometheus. It reads and archives data on the object store. Moreover, it manages Prometheus’ configuration and lifecycle. To distinguish each Prometheus instance, the sidecar component injects external labels into the Prometheus configuration. This component is capable of running queries on Prometheus servers’ interface. Sidecar components also listen on Thanos gRPC protocol and translate queries between gRPC and REST. Thanos Sidecar PromQL : This component implements the Store API on top of historical data in an object storage bucket. It acts primarily as an API gateway and therefore does not need significant amounts of local disk space. It joins a Thanos cluster on startup and advertises the data it can access. It keeps a small amount of information about all remote blocks on local disk and keeps it in-sync with the bucket. This data is generally safe to delete across restarts at the cost of increased startup times. Thanos Store : The Query component listens on HTTP and translates queries to Thanos gRPC format. It aggregates the query result from different sources, and can read data from Sidecar and Store. In a HA setup, it even deduplicates the result. Thanos Query Run-time deduplication of HA groups Prometheus is stateful and does not allow replicating its database. This means that increasing high-availability by running multiple Prometheus replicas are not very easy to use. Simple load balancing will not work, as for example after some crash, a replica might be up but querying such replica will result in a small gap during the period it was down. You have a second replica that maybe was up, but it could be down in another moment (e.g rolling restart), so load balancing on top of those will not work well. instead pulls data from both replicas, and deduplicate those signals, filling the gaps if any, transparently to the Querier consumer. Thanos Querier : The compactor component of Thanos applies the compaction procedure of the Prometheus 2.0 storage engine to block data stored in object storage. It is generally not semantically concurrency safe and must be deployed as a singleton against a bucket. It is also responsible for downsampling of data - performing 5m downsampling after 40 hours and 1h downsampling after 10 days. Thanos Compact : It basically does the same thing as Prometheus’ rules. The only difference is that it can communicate with Thanos components. Thanos Ruler Configuration Prerequisite In order to completely understand this tutorial, the following are needed: Working knowledge of Kubernetes and using kubectl A running Kubernetes cluster with at least 3 nodes (for the purpose of this demo a GKE cluster is being used) Implementing Ingress Controller and ingress objects (for the purpose of this demo Nginx Ingress Controller is being used). Although this is not mandatory but it is highly recommended inorder to decrease the number of external endpoints created. Creating credentials to be used by Thanos components to access object store (in this case GCS bucket) Create 2 GCS buckets and name them as prometheus-long-term and thanos-ruler Create a service account with the role as Storage Object Admin Download the key file as json credentials and name it as thanos-gcs-credentials.json Create kubernetes secret using the credentials kubectl create secret generic thanos-gcs-credentials --from-file=thanos-gcs-credentials.json -n monitoring Deploying various components Deploying Prometheus Services Accounts, and Clusterrole Clusterrolebinding apiVersion: v1 kind: Namespace metadata: name: monitoring --- apiVersion: v1 kind: ServiceAccount metadata: name: monitoring namespace: monitoring --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: monitoring namespace: monitoring rules: - apiGroups: [ ] resources: - nodes - nodes/proxy - services - endpoints - pods verbs: [ , , ] - apiGroups: [ ] resources: - configmaps verbs: [ ] - nonResourceURLs: [ ] verbs: [ ] --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: monitoring subjects: - kind: ServiceAccount name: monitoring namespace: monitoring roleRef: kind: ClusterRole Name: monitoring apiGroup: rbac.authorization.k8s.io --- "" "get" "list" "watch" "" "get" "/metrics" "get" The above manifest creates the monitoring namespace and service accounts, and needed by Prometheus. clusterrole clusterrolebinding Deploying Prometheus Configuration configmap apiVersion: v1 kind: ConfigMap metadata: name: prometheus-server-conf labels: name: prometheus-server-conf namespace: monitoring data: prometheus.yaml.tmpl: |- global: scrape_interval: 5s evaluation_interval: 5s external_labels: cluster: prometheus-ha replica: $(POD_NAME) rule_files: - /etc/prometheus/rules/*rules.yaml alerting: alert_relabel_configs: - regex: replica action: labeldrop alertmanagers: - scheme: http path_prefix: / static_configs: - targets: [ ] scrape_configs: - job_name: kubernetes-nodes-cadvisor scrape_interval: 10s scrape_timeout: 10s scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/ /proxy/metrics/cadvisor metric_relabel_configs: - action: replace source_labels: [id] regex: target_label: rkt_container_name replacement: - action: replace source_labels: [id] regex: target_label: systemd_service_name replacement: - job_name: kubernetes_sd_configs: - role: pod relabel_configs: - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: : - job_name: kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: kubernetes_sd_configs: - role: endpoints relabel_configs: - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (.+)(?::\d+);(\d+) replacement: : # Each Prometheus has to have unique labels. # We want our alerts to be deduplicated # from different replicas. 'alertmanager:9093' # Only for Kubernetes ^1.7.3. # See: https://github.com/prometheus/prometheus/issues/2916 ${1} '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$' '${2}-${1}' '^/system\.slice/(.+)\.service$' '${1}' 'kubernetes-pods' true $1 $2 'kubernetes-apiservers' 'kubernetes-service-endpoints' true $1 $2 The above Configmap creates Prometheus configuration file template. This configuration file template will be read by the Thanos sidecar component and it will generate the actual configuration file, which will in turn be consumed by the Prometheus container running in the same pod. It is extremely important to add the section in the config file so that the can deduplicate data based on that. external_labels Querier Deploying Prometheus Rules configmap This will create our alert rules which will be relayed to alertmanager for delivery apiVersion: v1 kind: ConfigMap metadata: name: prometheus-rules labels: name: prometheus-rules namespace: monitoring data: alert-rules.yaml: |- groups: - name: Deployment rules: - alert: Deployment at 0 Replicas annotations: summary: Deployment {{ .deployment}} {{ .namespace}} is currently having no pods running expr: | sum(kube_deployment_status_replicas{pod_template_hash= }) by (deployment,namespace) < 1 : 1m labels: team: devops - alert: HPA Scaling Limited annotations: summary: HPA named {{ .hpa}} {{ .namespace}} namespace has reached scaling limited state expr: | (sum(kube_hpa_status_condition{condition= ,status= }) by (hpa,namespace)) == 1 : 1m labels: team: devops - alert: HPA at MaxCapacity annotations: summary: HPA named {{ .hpa}} {{ .namespace}} namespace is running at Max Capacity expr: | ((sum(kube_hpa_spec_max_replicas) by (hpa,namespace)) - (sum(kube_hpa_status_current_replicas) by (hpa,namespace))) == 0 : 1m labels: team: devops - name: Pods rules: - alert: Container restarted annotations: summary: Container named {{ .container}} {{ .pod}} {{ .namespace}} was restarted expr: | sum(increase(kube_pod_container_status_restarts_total{namespace!= ,pod_template_hash= }[1m])) by (pod,namespace,container) > 0 : 0m labels: team: dev - alert: High Memory Usage of Container annotations: summary: Container named {{ .container}} {{ .pod}} {{ .namespace}} is using more than 75% of Memory Limit expr: | ((( sum(container_memory_usage_bytes{image!= ,container_name!= , namespace!= }) by (namespace,container_name,pod_name) / sum(container_spec_memory_limit_bytes{image!= ,container_name!= ,namespace!= }) by (namespace,container_name,pod_name) ) * 100 ) < +Inf ) > 75 : 5m labels: team: dev - alert: High CPU Usage of Container annotations: summary: Container named {{ .container}} {{ .pod}} {{ .namespace}} is using more than 75% of CPU Limit expr: | ((sum(irate(container_cpu_usage_seconds_total{image!= ,container_name!= , namespace!= }[30s])) by (namespace,container_name,pod_name) / sum(container_spec_cpu_quota{image!= ,container_name!= , namespace!= } / container_spec_cpu_period{image!= ,container_name!= , namespace!= }) by (namespace,container_name,pod_name) ) * 100) > 75 : 5m labels: team: dev - name: Nodes rules: - alert: High Node Memory Usage annotations: summary: Node {{ .kubernetes_io_hostname}} has more than 80% memory used. Plan Capcity expr: | (sum (container_memory_working_set_bytes{id= ,container_name!= }) by (kubernetes_io_hostname) / sum (machine_memory_bytes{}) by (kubernetes_io_hostname) * 100) > 80 : 5m labels: team: devops - alert: High Node CPU Usage annotations: summary: Node {{ .kubernetes_io_hostname}} has more than 80% allocatable cpu used. Plan Capacity. expr: | (sum(rate(container_cpu_usage_seconds_total{id= , container_name!= }[1m])) by (kubernetes_io_hostname) / sum(machine_cpu_cores) by (kubernetes_io_hostname) * 100) > 80 : 5m labels: team: devops - alert: High Node Disk Usage annotations: summary: Node {{ .kubernetes_io_hostname}} has more than 85% disk used. Plan Capacity. expr: | (sum(container_fs_usage_bytes{device=~ ,id= ,container_name!= }) by (kubernetes_io_hostname) / sum(container_fs_limit_bytes{container_name!= ,device=~ ,id= }) by (kubernetes_io_hostname)) * 100 > 85 : 5m labels: team: devops $labels in $labels "" for $labels in $labels "ScalingLimited" "true" for $labels in $labels for $labels in $labels in $labels "kube-system" "" for $labels in $labels in $labels "" "POD" "kube-system" "" "POD" "kube-system" for $labels in $labels in $labels "" "POD" "kube-system" "" "POD" "kube-system" "" "POD" "kube-system" for $labels "/" "POD" for $labels "/" "POD" for $labels "^/dev/[sv]d[a-z][1-9]$" "/" "POD" "POD" "^/dev/[sv]d[a-z][1-9]$" "/" for Deploying Prometheus Stateful Set apiVersion: storage.k8s.io/v1beta1 kind: StorageClass metadata: name: fast namespace: monitoring provisioner: kubernetes.io/gce-pd allowVolumeExpansion: --- apiVersion: apps/v1beta1 kind: StatefulSet metadata: name: prometheus namespace: monitoring spec: replicas: 3 serviceName: prometheus-service template: metadata: labels: app: prometheus thanos-store-api: spec: serviceAccountName: monitoring containers: - name: prometheus image: prom/prometheus:v2.4.3 args: - - - - - - ports: - name: prometheus containerPort: 9090 volumeMounts: - name: prometheus-storage mountPath: /prometheus/ - name: prometheus-config-shared mountPath: /etc/prometheus-shared/ - name: prometheus-rules mountPath: /etc/prometheus/rules - name: thanos image: quay.io/thanos/thanos:v0.8.0 args: - - - - - - - - env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name : GOOGLE_APPLICATION_CREDENTIALS value: /etc/secret/thanos-gcs-credentials.json ports: - name: http-sidecar containerPort: 10902 - name: grpc containerPort: 10901 livenessProbe: httpGet: port: 10902 path: /-/healthy readinessProbe: httpGet: port: 10902 path: /-/ready volumeMounts: - name: prometheus-storage mountPath: /prometheus - name: prometheus-config-shared mountPath: /etc/prometheus-shared/ - name: prometheus-config mountPath: /etc/prometheus - name: prometheus-rules mountPath: /etc/prometheus/rules - name: thanos-gcs-credentials mountPath: /etc/secret readOnly: securityContext: fsGroup: 2000 runAsNonRoot: runAsUser: 1000 volumes: - name: prometheus-config configMap: defaultMode: 420 name: prometheus-server-conf - name: prometheus-config-shared emptyDir: {} - name: prometheus-rules configMap: name: prometheus-rules - name: thanos-gcs-credentials secret: secretName: thanos-gcs-credentials volumeClaimTemplates: - metadata: name: prometheus-storage namespace: monitoring spec: accessModes: [ ] storageClassName: fast resources: requests: storage: 20Gi true "true" "--config.file=/etc/prometheus-shared/prometheus.yaml" "--storage.tsdb.path=/prometheus/" "--web.enable-lifecycle" "--storage.tsdb.no-lockfile" "--storage.tsdb.min-block-duration=2h" "--storage.tsdb.max-block-duration=2h" "sidecar" "--log.level=debug" "--tsdb.path=/prometheus" "--prometheus.url=http://127.0.0.1:9090" "--objstore.config={type: GCS, config: {bucket: prometheus-long-term}}" "--reloader.config-file=/etc/prometheus/prometheus.yaml.tmpl" "--reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yaml" "--reloader.rule-dir=/etc/prometheus/rules/" false true "ReadWriteOnce" It is the following about the manifest provided above: important to understand Prometheus is deployed as a stateful set with 3 replicas and each replica provisions its own persistent volume dynamically. Prometheus configuration is generated by the Thanos sidecar container using the template file we created above. Thanos handles data compaction and therefore we need to set --storage.tsdb.min-block-duration=2h and --storage.tsdb.max-block-duration=2h Prometheus stateful set is labelled as so that each pod gets discovered by the headless service, which we will create next. It is this headless service which will be used by the to query data across all Prometheus instances. We also apply the same label to the and component so that they are also discovered by the Querier and can be used for querying metrics. thanos-store-api: true Thanos Querier Thanos Store Thanos Ruler GCS bucket credentials path is provided using the environment variable, and the configuration file is mounted to it from the secret which we created as a part of prerequisites. GOOGLE_APPLICATION_CREDENTIALS Deploying Prometheus Services apiVersion: v1 kind: Service metadata: name: prometheus-0-service annotations: prometheus.io/scrape: prometheus.io/port: namespace: monitoring labels: name: prometheus spec: selector: statefulset.kubernetes.io/pod-name: prometheus-0 ports: - name: prometheus port: 8080 targetPort: prometheus --- apiVersion: v1 kind: Service metadata: name: prometheus-1-service annotations: prometheus.io/scrape: prometheus.io/port: namespace: monitoring labels: name: prometheus spec: selector: statefulset.kubernetes.io/pod-name: prometheus-1 ports: - name: prometheus port: 8080 targetPort: prometheus --- apiVersion: v1 kind: Service metadata: name: prometheus-2-service annotations: prometheus.io/scrape: prometheus.io/port: namespace: monitoring labels: name: prometheus spec: selector: statefulset.kubernetes.io/pod-name: prometheus-2 ports: - name: prometheus port: 8080 targetPort: prometheus --- apiVersion: v1 kind: Service metadata: name: thanos-store-gateway namespace: monitoring spec: : ClusterIP clusterIP: None ports: - name: grpc port: 10901 targetPort: grpc selector: thanos-store-api: "true" "9090" "true" "9090" "true" "9090" #This service creates a srv record for querier to find about store-api's type "true" We create different services for each Prometheus pod in the stateful set, although it is not needed. These are created only for debugging purposes. The purpose of headless service has been explained above. We will later expose Prometheus services using an ingress object. thanos-store-gateway Deploying Thanos Querier apiVersion: v1 kind: Namespace metadata: name: monitoring --- apiVersion: apps/v1 kind: Deployment metadata: name: thanos-querier namespace: monitoring labels: app: thanos-querier spec: replicas: 1 selector: matchLabels: app: thanos-querier template: metadata: labels: app: thanos-querier spec: containers: - name: thanos image: quay.io/thanos/thanos:v0.8.0 args: - query - --log.level=debug - --query.replica-label=replica - --store=dnssrv+thanos-store-gateway:10901 ports: - name: http containerPort: 10902 - name: grpc containerPort: 10901 livenessProbe: httpGet: port: http path: /-/healthy readinessProbe: httpGet: port: http path: /-/ready --- apiVersion: v1 kind: Service metadata: labels: app: thanos-querier name: thanos-querier namespace: monitoring spec: ports: - port: 9090 protocol: TCP targetPort: http name: http selector: app: thanos-querier This is one of the main components of Thanos deployment. Note the following: The container argument helps to discover all components from which metric data should be queried. --store=dnssrv+thanos-store-gateway:10901 The service provided a web interface to run PromQL queries. It also has the option to de-duplicate data across various Prometheus clusters. thanos-querier This is the end point where we provide Grafana as a datasource for all dashboards. Deploying Thanos Store Gateway apiVersion: v1 kind: Namespace metadata: name: monitoring --- apiVersion: apps/v1beta1 kind: StatefulSet metadata: name: thanos-store-gateway namespace: monitoring labels: app: thanos-store-gateway spec: replicas: 1 selector: matchLabels: app: thanos-store-gateway serviceName: thanos-store-gateway template: metadata: labels: app: thanos-store-gateway thanos-store-api: spec: containers: - name: thanos image: quay.io/thanos/thanos:v0.8.0 args: - - - - - - env: - name : GOOGLE_APPLICATION_CREDENTIALS value: /etc/secret/thanos-gcs-credentials.json ports: - name: http containerPort: 10902 - name: grpc containerPort: 10901 livenessProbe: httpGet: port: 10902 path: /-/healthy readinessProbe: httpGet: port: 10902 path: /-/ready volumeMounts: - name: thanos-gcs-credentials mountPath: /etc/secret readOnly: volumes: - name: thanos-gcs-credentials secret: secretName: thanos-gcs-credentials --- "true" "store" "--log.level=debug" "--data-dir=/data" "--objstore.config={type: GCS, config: {bucket: prometheus-long-term}}" "--index-cache-size=500MB" "--chunk-pool-size=500MB" false This will create the store component which serves metrics from object storage to the Querier. Deploying Thanos Ruler apiVersion: v1 kind: Namespace metadata: name: monitoring --- apiVersion: v1 kind: ConfigMap metadata: name: thanos-ruler-rules namespace: monitoring data: alert_down_services.rules.yaml: | groups: - name: metamonitoring rules: - alert: PrometheusReplicaDown annotations: message: Prometheus replica cluster {{ .cluster}} has disappeared from Prometheus target discovery. expr: | sum(up{cluster= , instance=~ , job= }) by (job,cluster) < 3 : 15s labels: severity: critical --- apiVersion: apps/v1beta1 kind: StatefulSet metadata: labels: app: thanos-ruler name: thanos-ruler namespace: monitoring spec: replicas: 1 selector: matchLabels: app: thanos-ruler serviceName: thanos-ruler template: metadata: labels: app: thanos-ruler thanos-store-api: spec: containers: - name: thanos image: quay.io/thanos/thanos:v0.8.0 args: - rule - --log.level=debug - --data-dir=/data - -- -interval=15s - --rule-file=/etc/thanos-ruler/*.rules.yaml - --alertmanagers.url=http://alertmanager:9093 - --query=thanos-querier:9090 - - --label=ruler_cluster= - --label=replica= env: - name : GOOGLE_APPLICATION_CREDENTIALS value: /etc/secret/thanos-gcs-credentials.json - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name ports: - name: http containerPort: 10902 - name: grpc containerPort: 10901 livenessProbe: httpGet: port: http path: /-/healthy readinessProbe: httpGet: port: http path: /-/ready volumeMounts: - mountPath: /etc/thanos-ruler name: config - name: thanos-gcs-credentials mountPath: /etc/secret readOnly: volumes: - configMap: name: thanos-ruler-rules name: config - name: thanos-gcs-credentials secret: secretName: thanos-gcs-credentials --- apiVersion: v1 kind: Service metadata: labels: app: thanos-ruler name: thanos-ruler namespace: monitoring spec: ports: - port: 9090 protocol: TCP targetPort: http name: http selector: app: thanos-ruler in $labels "prometheus-ha" ".*:9090" "kubernetes-service-endpoints" for "true" eval "--objstore.config={type: GCS, config: {bucket: thanos-ruler}}" "prometheus-ha" " " $(POD_NAME) false Now if you fire-up on interactive shell in the same namespace as our workloads, and try to see to which all pods does our resolves, you will see something like this: thanos-store-gateway root@my-shell-95cb5df57-4q6w8:/ Server: 10.63.240.10 Address: 10.63.240.10 Name: thanos-store-gateway.monitoring.svc.cluster.local Address: 10.60.25.2 Name: thanos-store-gateway.monitoring.svc.cluster.local Address: 10.60.25.4 Name: thanos-store-gateway.monitoring.svc.cluster.local Address: 10.60.30.2 Name: thanos-store-gateway.monitoring.svc.cluster.local Address: 10.60.30.8 Name: thanos-store-gateway.monitoring.svc.cluster.local Address: 10.60.31.2 root@my-shell-95cb5df57-4q6w8:/ # nslookup thanos-store-gateway #53 # exit The IP’s returned above correspond to our Prometheus pods, thanos-store and thanos-ruler. This can be verified as $ kubectl get pods -o wide -l thanos-store-api= NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES prometheus-0 2/2 Running 0 100m 10.60.31.2 gke-demo-1-pool-1-649cbe02-jdnv <none> <none> prometheus-1 2/2 Running 0 14h 10.60.30.2 gke-demo-1-pool-1-7533d618-kxkd <none> <none> prometheus-2 2/2 Running 0 31h 10.60.25.2 gke-demo-1-pool-1-4e9889dd-27gc <none> <none> thanos-ruler-0 1/1 Running 0 100m 10.60.30.8 gke-demo-1-pool-1-7533d618-kxkd <none> <none> thanos-store-gateway-0 1/1 Running 0 14h 10.60.25.4 gke-demo-1-pool-1-4e9889dd-27gc <none> <none> "true" Deploying Alertmanager apiVersion: v1 kind: Namespace metadata: name: monitoring --- kind: ConfigMap apiVersion: v1 metadata: name: alertmanager namespace: monitoring data: config.yml: |- global: resolve_timeout: 5m slack_api_url: victorops_api_url: templates: - route: group_by: [ , , ] group_wait: 10s group_interval: 1m repeat_interval: 5m receiver: default routes: - match: team: devops receiver: devops : - match: team: dev receiver: dev : receivers: - name: - name: victorops_configs: - api_key: routing_key: message_type: entity_display_name: state_message: slack_configs: - channel: send_resolved: - name: victorops_configs: - api_key: routing_key: message_type: entity_display_name: state_message: slack_configs: - channel: send_resolved: --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: alertmanager namespace: monitoring spec: replicas: 1 selector: matchLabels: app: alertmanager template: metadata: name: alertmanager labels: app: alertmanager spec: containers: - name: alertmanager image: prom/alertmanager:v0.15.3 args: - - ports: - name: alertmanager containerPort: 9093 volumeMounts: - name: config-volume mountPath: /etc/alertmanager - name: alertmanager mountPath: /alertmanager volumes: - name: config-volume configMap: name: alertmanager - name: alertmanager emptyDir: {} --- apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scrape: prometheus.io/path: labels: name: alertmanager name: alertmanager namespace: monitoring spec: selector: app: alertmanager ports: - name: alertmanager protocol: TCP port: 9093 targetPort: 9093 "<your_slack_hook>" "<your_victorops_hook>" '/etc/alertmanager-templates/*.tmpl' 'alertname' 'cluster' 'service' continue true continue true 'default' 'devops' '<YOUR_API_KEY>' 'devops' 'CRITICAL' '{{ .CommonLabels.alertname }}' 'Alert: {{ .CommonLabels.alertname }}. Summary:{{ .CommonAnnotations.summary }}. RawData: {{ .CommonLabels }}' '#k8-alerts' true 'dev' '<YOUR_API_KEY>' 'dev' 'CRITICAL' '{{ .CommonLabels.alertname }}' 'Alert: {{ .CommonLabels.alertname }}. Summary:{{ .CommonAnnotations.summary }}. RawData: {{ .CommonLabels }}' '#k8-alerts' true '--config.file=/etc/alertmanager/config.yml' '--storage.path=/alertmanager' 'true' '/metrics' This will create our deployment which will deliver all alerts generated as per Prometheus rules. alertmanager Deploying Kubestate Metrics apiVersion: v1 kind: Namespace metadata: name: monitoring --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: monitoring --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kube-state-metrics rules: - apiGroups: [ ] resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: [ , ] - apiGroups: [ ] resources: - daemonsets - deployments - replicasets verbs: [ , ] - apiGroups: [ ] resources: - statefulsets verbs: [ , ] - apiGroups: [ ] resources: - cronjobs - verbs: [ , ] - apiGroups: [ ] resources: - horizontalpodautoscalers verbs: [ , ] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: kube-state-metrics namespace: monitoring roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: kube-state-metrics-resizer subjects: - kind: ServiceAccount name: kube-state-metrics namespace: monitoring --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: monitoring name: kube-state-metrics-resizer rules: - apiGroups: [ ] resources: - pods verbs: [ ] - apiGroups: [ ] resources: - deployments resourceNames: [ ] verbs: [ , ] --- apiVersion: v1 kind: ServiceAccount metadata: name: kube-state-metrics namespace: monitoring --- apiVersion: apps/v1 kind: Deployment metadata: name: kube-state-metrics namespace: monitoring spec: selector: matchLabels: k8s-app: kube-state-metrics replicas: 1 template: metadata: labels: k8s-app: kube-state-metrics spec: serviceAccountName: kube-state-metrics containers: - name: kube-state-metrics image: quay.io/mxinden/kube-state-metrics:v1.4.0-gzip.3 ports: - name: http-metrics containerPort: 8080 - name: telemetry containerPort: 8081 readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 - name: addon-resizer image: k8s.gcr.io/addon-resizer:1.8.3 resources: limits: cpu: 150m memory: 50Mi requests: cpu: 150m memory: 50Mi env: - name: MY_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: MY_POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace : - /pod_nanny - --container=kube-state-metrics - --cpu=100m - --extra-cpu=1m - --memory=100Mi - --extra-memory=2Mi - --threshold=5 - --deployment=kube-state-metrics --- apiVersion: v1 kind: Service metadata: name: kube-state-metrics namespace: monitoring labels: k8s-app: kube-state-metrics annotations: prometheus.io/scrape: spec: ports: - name: http-metrics port: 8080 targetPort: http-metrics protocol: TCP - name: telemetry port: 8081 targetPort: telemetry protocol: TCP selector: k8s-app: kube-state-metrics # kubernetes versions before 1.8.0 should use rbac.authorization.k8s.io/v1beta1 # kubernetes versions before 1.8.0 should use rbac.authorization.k8s.io/v1beta1 "" "list" "watch" "extensions" "list" "watch" "apps" "list" "watch" "batch" jobs "list" "watch" "autoscaling" "list" "watch" # kubernetes versions before 1.8.0 should use rbac.authorization.k8s.io/v1beta1 # kubernetes versions before 1.8.0 should use rbac.authorization.k8s.io/v1beta1 "" "get" "extensions" "kube-state-metrics" "get" "update" command 'true' Kubestate metrics deployment is needed to relay some important container metrics which are not natively exposed by the and hence are not directly available to Prometheus. kubelet Deploying Node-Exporter Daemonset apiVersion: v1 kind: Namespace metadata: name: monitoring --- apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring labels: name: node-exporter spec: template: metadata: labels: name: node-exporter annotations: prometheus.io/scrape: prometheus.io/port: spec: hostPID: hostIPC: hostNetwork: containers: - name: node-exporter image: prom/node-exporter:v0.16.0 securityContext: privileged: args: - --path.procfs=/host/proc - --path.sysfs=/host/sys ports: - containerPort: 9100 protocol: TCP resources: limits: cpu: 100m memory: 100Mi requests: cpu: 10m memory: 100Mi volumeMounts: - name: dev mountPath: /host/dev - name: proc mountPath: /host/proc - name: sys mountPath: /host/sys - name: rootfs mountPath: /rootfs volumes: - name: proc hostPath: path: /proc - name: dev hostPath: path: /dev - name: sys hostPath: path: /sys - name: rootfs hostPath: path: / "true" "9100" true true true true Node-Exporter daemonset runs a pod of on each node and exposes very important node related metrics which can be pulled by Prometheus instances. node-exporter Deploying Grafana apiVersion: v1 kind: Namespace metadata: name: monitoring --- apiVersion: storage.k8s.io/v1beta1 kind: StorageClass metadata: name: fast namespace: monitoring provisioner: kubernetes.io/gce-pd allowVolumeExpansion: --- apiVersion: apps/v1beta1 kind: StatefulSet metadata: name: grafana namespace: monitoring spec: replicas: 1 serviceName: grafana template: metadata: labels: task: monitoring k8s-app: grafana spec: containers: - name: grafana image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4 ports: - containerPort: 3000 protocol: TCP volumeMounts: - mountPath: /etc/ssl/certs name: ca-certificates readOnly: - mountPath: /var name: grafana-storage env: - name: GF_SERVER_HTTP_PORT value: - name: GF_AUTH_BASIC_ENABLED value: - name: GF_AUTH_ANONYMOUS_ENABLED value: - name: GF_AUTH_ANONYMOUS_ORG_ROLE value: Admin - name: GF_SERVER_ROOT_URL value: / volumes: - name: ca-certificates hostPath: path: /etc/ssl/certs volumeClaimTemplates: - metadata: name: grafana-storage namespace: monitoring spec: accessModes: [ ] storageClassName: fast resources: requests: storage: 5Gi --- apiVersion: v1 kind: Service metadata: labels: kubernetes.io/cluster-service: kubernetes.io/name: grafana name: grafana namespace: monitoring spec: ports: - port: 3000 targetPort: 3000 selector: k8s-app: grafana true true "3000" # The following env variables are required to make Grafana accessible via # the kubernetes api-server proxy. On production clusters, we recommend # removing these env variables, setup auth for grafana, and expose the grafana # service using a LoadBalancer or a public IP. "false" "true" # If you're only using the API Server proxy, set this value instead: # value: /api/v1/namespaces/kube-system/services/monitoring-grafana/proxy "ReadWriteOnce" 'true' This will create our Grafana Deployment and Service which will be exposed using our Ingress Object. We should add Thanos-Querier as the datasource for our Grafana deployment. In order to do so: Click on Add DataSource Set Name: DS_PROMETHEUS Set Type: Prometheus Set URL: http://thanos-querier:9090 Save and Test. You can now build your custom dashboards or simply import dashboards from . Dashboard #315 and #1471 are good to start with. grafana.net Deploying the Ingress Object apiVersion: extensions/v1beta1 kind: Ingress metadata: name: monitoring-ingress namespace: monitoring annotations: kubernetes.io/ingress.class: spec: rules: - host: grafana.<yourdomain>.com http: paths: - path: / backend: serviceName: grafana servicePort: 3000 - host: prometheus-0.<yourdomain>.com http: paths: - path: / backend: serviceName: prometheus-0-service servicePort: 8080 - host: prometheus-1.<yourdomain>.com http: paths: - path: / backend: serviceName: prometheus-1-service servicePort: 8080 - host: prometheus-2.<yourdomain>.com http: paths: - path: / backend: serviceName: prometheus-2-service servicePort: 8080 - host: alertmanager.<yourdomain>.com http: paths: - path: / backend: serviceName: alertmanager servicePort: 9093 - host: thanos-querier.<yourdomain>.com http: paths: - path: / backend: serviceName: thanos-querier servicePort: 9090 - host: thanos-ruler.<yourdomain>.com http: paths: - path: / backend: serviceName: thanos-ruler servicePort: 9090 "nginx" This is the final piece in the puzzle. This will help expose all our services outside the Kubernetes cluster and help us access them. Make sure you replace with a domain name which is accessible to you and you can point the Ingress-Controller’s service to. <yourdomain> You should now be able to access Thanos Querier at . It will look something like this: http://thanos-querier .<yourdomain>.com Make sure deduplication is selected. If you click on all the active endpoints discovered by service can be seen Stores thanos-store-gateway Now you add Thanos Querier as the datasource in Grafana and start creating dashboards Kubernetes Cluster Monitoring Dashboard Kubernetes Node Monitoring Dashboard Conclusion Integrating Thanos with Prometheus definitely provides the ability to scale Prometheus horizontally, and also since Thanos-Querier is able to pull metrics from other querier instances, you can practically pull metrics across clusters visualize them in a single dashboard. We are also able to archive metric data in an object store that provides infinite storage to our monitoring system along with serving metrics from the object storage itself. A major part of cost for this set-up can be attributed to the object storage (S3 or GCS). This can be further reduced if we apply appropriate retention policies to them. However, achieving all this requires quite a bit of configuration on your part. The manifests provided above have been tested in a production environment. Feel free to reach out should you have any questions around them. This article was originally published on https://appfleet.com/blog/ha-kubernetes-monitoring-using-prometheus-and-thanos/ .