Overview Overview When I was working at a startup, our goal was to set up a monitoring solution to track infrastructure components like virtual machines and applications - all while staying within a limited budget and a short timeframe. To achieve this, I chose open-source tools such as Prometheus, Grafana, Mimir, and Nginx. Since we were hosted on Google Cloud, the easiest way to get started with infrastructure and application monitoring using these tools was by deploying them on Google Kubernetes Engine (GKE). However, this guide can easily be adapted to set up monitoring on any cloud platform. The open-source monitoring stack I selected includes: Prometheus: A time-series database (TSDB) that collects and stores metrics from infrastructure and applications. Mimir: A scalable, long-term storage backend that extends Prometheus by handling large volumes of time-series data. Grafana: A rich visualization and monitoring tool that displays collected metrics in dashboards and supports alerting based on thresholds. Prometheus: A time-series database (TSDB) that collects and stores metrics from infrastructure and applications. Prometheus Mimir: A scalable, long-term storage backend that extends Prometheus by handling large volumes of time-series data. Mimir Grafana: A rich visualization and monitoring tool that displays collected metrics in dashboards and supports alerting based on thresholds. Grafana Component Descriptions and Flow: Component Descriptions and Flow: IoT Devices, Servers, and Applications: These are the data sources emitting metrics such as CPU usage, memory utilization, and custom application-specific metrics. Prometheus (TSDB): Collects and stores time-series metrics from IoT devices, servers, and applications. Grafana Mimir (Scaling Layer): Extends Prometheus by providing scalable, durable storage for large-scale metric workloads. Grafana (Visualization): Displays collected metrics in customizable dashboards and graphs and provides alerting capabilities. NGINX (Ingress Controller): Acts as a reverse proxy and secure access point to the Grafana and Prometheus user interfaces. Kubernetes: Orchestrates the entire monitoring stack as containerized services. Google Cloud Platform (GCP): Hosts the Kubernetes cluster and the supporting infrastructure. IoT Devices, Servers, and Applications: These are the data sources emitting metrics such as CPU usage, memory utilization, and custom application-specific metrics. IoT Devices, Servers, and Applications Prometheus (TSDB): Collects and stores time-series metrics from IoT devices, servers, and applications. Prometheus (TSDB) Grafana Mimir (Scaling Layer): Extends Prometheus by providing scalable, durable storage for large-scale metric workloads. Grafana Mimir (Scaling Layer) Grafana (Visualization): Displays collected metrics in customizable dashboards and graphs and provides alerting capabilities. Grafana (Visualization) NGINX (Ingress Controller): Acts as a reverse proxy and secure access point to the Grafana and Prometheus user interfaces. NGINX (Ingress Controller) Kubernetes: Orchestrates the entire monitoring stack as containerized services. Kubernetes Google Cloud Platform (GCP): Hosts the Kubernetes cluster and the supporting infrastructure. Google Cloud Platform (GCP) Cluster Creation: Cluster Creation: Below is the Terraform code to create a private Kubernetes cluster in GCP. A similar approach can be used to create private clusters in other cloud environments as well. private Note: In this setup, we are using a shared network from another project, so appropriate IAM permissions and network configurations must be applied. Note: GitHub code repo: https://github.com/pradeep-gaddamidi/Monitoring https://github.com/pradeep-gaddamidi/Monitoring Create a kubernetes cluster using the terraform: cluster.tf cluster.tf # google_client_config and kubernetes provider must be explicitly specified like the following. data "google_client_config" "default" {} provider "kubernetes" { host = "https://${module.gke.endpoint}" token = data.google_client_config.default.access_token cluster_ca_certificate = base64decode(module.gke.ca_certificate) } # Use selected cluster configuration module "gke" { source = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster" version = "30.2.0" project_id = var.cluster_config[local.env].project_id name = var.cluster_config[local.env].name region = var.cluster_config[local.env].region zones = var.cluster_config[local.env].zones network = var.cluster_config[local.env].network network_project_id = var.cluster_config[local.env].network_project_id subnetwork = var.cluster_config[local.env].subnetwork ip_range_pods = "${var.cluster_config[local.env].subnetwork}-pods" ip_range_services = "${var.cluster_config[local.env].subnetwork}-services" http_load_balancing = true enable_l4_ilb_subsetting = true network_policy = false horizontal_pod_autoscaling = true filestore_csi_driver = false enable_private_endpoint = true enable_private_nodes = true remove_default_node_pool = true master_ipv4_cidr_block = "172.16.0.0/28" node_pools = [ { name = "node-pool" machine_type = var.cluster_config[local.env].machine_type node_locations = join(",", var.cluster_config[local.env].zones) min_count = 1 max_count = 1 local_ssd_count = 0 spot = false disk_size_gb = var.cluster_config[local.env].disk_size_gb disk_type = "pd-standard" image_type = "COS_CONTAINERD" enable_gcfs = false enable_gvnic = false logging_variant = "DEFAULT" auto_repair = true auto_upgrade = true service_account = "${google_service_account.gke.email}" preemptible = false initial_node_count = 1 autoscaling = false }, ] node_pools_oauth_scopes = { all = [ "https://www.googleapis.com/auth/logging.write", "https://www.googleapis.com/auth/monitoring", ] } cluster_resource_labels = { environment = local.env project = var.cluster_config[local.env].project_id, resource_type = "gke", resource_name = var.cluster_config[local.env].name customer = "all" } node_pools_labels = { all = {} default-node-pool = { default-node-pool = true } } node_pools_metadata = { all = {} default-node-pool = { node-pool-metadata-custom-value = "node-pool" } } node_pools_taints = { all = [] default-node-pool = [ { key = "default-node-pool" value = true effect = "PREFER_NO_SCHEDULE" }, ] } node_pools_tags = { all = [] default-node-pool = [ "default-node-pool", ] } master_authorized_networks = [ { cidr_block = var.cluster_config[local.env].subnetwork_allow display_name = "VPC" } ] } resource "google_compute_subnetwork_iam_member" "network_user_service_account" { for_each = { for user in var.cluster_config[local.env].network_user : user => user } project = var.cluster_config[local.env].network_project_id subnetwork = var.cluster_config[local.env].subnetwork region = var.cluster_config[local.env].region role = "roles/compute.networkUser" member = "serviceAccount:${each.value}" } resource "google_project_iam_member" "hostServiceAgentUser_service_account" { for_each = { for user in var.cluster_config[local.env].hostServiceAgent_user : user => user } project = var.cluster_config[local.env].network_project_id member = "serviceAccount:${each.value}" role = "roles/container.hostServiceAgentUser" } resource "google_project_iam_member" "serviceAgent_service_account" { for_each = { for user in var.cluster_config[local.env].serviceAgent_user : user => user } project = var.cluster_config[local.env].network_project_id member = "serviceAccount:${each.value}" role = "roles/container.serviceAgent" } # google_client_config and kubernetes provider must be explicitly specified like the following. data "google_client_config" "default" {} provider "kubernetes" { host = "https://${module.gke.endpoint}" token = data.google_client_config.default.access_token cluster_ca_certificate = base64decode(module.gke.ca_certificate) } # Use selected cluster configuration module "gke" { source = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster" version = "30.2.0" project_id = var.cluster_config[local.env].project_id name = var.cluster_config[local.env].name region = var.cluster_config[local.env].region zones = var.cluster_config[local.env].zones network = var.cluster_config[local.env].network network_project_id = var.cluster_config[local.env].network_project_id subnetwork = var.cluster_config[local.env].subnetwork ip_range_pods = "${var.cluster_config[local.env].subnetwork}-pods" ip_range_services = "${var.cluster_config[local.env].subnetwork}-services" http_load_balancing = true enable_l4_ilb_subsetting = true network_policy = false horizontal_pod_autoscaling = true filestore_csi_driver = false enable_private_endpoint = true enable_private_nodes = true remove_default_node_pool = true master_ipv4_cidr_block = "172.16.0.0/28" node_pools = [ { name = "node-pool" machine_type = var.cluster_config[local.env].machine_type node_locations = join(",", var.cluster_config[local.env].zones) min_count = 1 max_count = 1 local_ssd_count = 0 spot = false disk_size_gb = var.cluster_config[local.env].disk_size_gb disk_type = "pd-standard" image_type = "COS_CONTAINERD" enable_gcfs = false enable_gvnic = false logging_variant = "DEFAULT" auto_repair = true auto_upgrade = true service_account = "${google_service_account.gke.email}" preemptible = false initial_node_count = 1 autoscaling = false }, ] node_pools_oauth_scopes = { all = [ "https://www.googleapis.com/auth/logging.write", "https://www.googleapis.com/auth/monitoring", ] } cluster_resource_labels = { environment = local.env project = var.cluster_config[local.env].project_id, resource_type = "gke", resource_name = var.cluster_config[local.env].name customer = "all" } node_pools_labels = { all = {} default-node-pool = { default-node-pool = true } } node_pools_metadata = { all = {} default-node-pool = { node-pool-metadata-custom-value = "node-pool" } } node_pools_taints = { all = [] default-node-pool = [ { key = "default-node-pool" value = true effect = "PREFER_NO_SCHEDULE" }, ] } node_pools_tags = { all = [] default-node-pool = [ "default-node-pool", ] } master_authorized_networks = [ { cidr_block = var.cluster_config[local.env].subnetwork_allow display_name = "VPC" } ] } resource "google_compute_subnetwork_iam_member" "network_user_service_account" { for_each = { for user in var.cluster_config[local.env].network_user : user => user } project = var.cluster_config[local.env].network_project_id subnetwork = var.cluster_config[local.env].subnetwork region = var.cluster_config[local.env].region role = "roles/compute.networkUser" member = "serviceAccount:${each.value}" } resource "google_project_iam_member" "hostServiceAgentUser_service_account" { for_each = { for user in var.cluster_config[local.env].hostServiceAgent_user : user => user } project = var.cluster_config[local.env].network_project_id member = "serviceAccount:${each.value}" role = "roles/container.hostServiceAgentUser" } resource "google_project_iam_member" "serviceAgent_service_account" { for_each = { for user in var.cluster_config[local.env].serviceAgent_user : user => user } project = var.cluster_config[local.env].network_project_id member = "serviceAccount:${each.value}" role = "roles/container.serviceAgent" } In the Terraform configuration above, we utilize the publicly available Google Terraform module terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster. This approach allows us to leverage well-maintained, community-supported code, avoiding the need to develop and maintain complex infrastructure code from scratch. terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster The permissions required for the service accounts used in this Terraform configuration are detailed below: Role Why it's needed for GKE roles/compute.networkUser Allow nodes and load balancers to use the subnetwork. roles/container.hostServiceAgentUser Allow GKE to configure networking (firewalls, IPs, etc.) in host/shared VPCs. roles/container.serviceAgent Allow GKE control plane to manage itself and use necessary GCP APIs. Role Why it's needed for GKE roles/compute.networkUser Allow nodes and load balancers to use the subnetwork. roles/container.hostServiceAgentUser Allow GKE to configure networking (firewalls, IPs, etc.) in host/shared VPCs. roles/container.serviceAgent Allow GKE control plane to manage itself and use necessary GCP APIs. Role Why it's needed for GKE Role Role Role Why it's needed for GKE Why it's needed for GKE Why it's needed for GKE roles/compute.networkUser Allow nodes and load balancers to use the subnetwork. roles/compute.networkUser roles/compute.networkUser Allow nodes and load balancers to use the subnetwork. Allow nodes and load balancers to use the subnetwork. Allow nodes and load balancers to use the subnetwork. roles/container.hostServiceAgentUser Allow GKE to configure networking (firewalls, IPs, etc.) in host/shared VPCs. roles/container.hostServiceAgentUser roles/container.hostServiceAgentUser Allow GKE to configure networking (firewalls, IPs, etc.) in host/shared VPCs. Allow GKE to configure networking (firewalls, IPs, etc.) in host/shared VPCs. Allow GKE to configure networking (firewalls, IPs, etc.) in host/shared VPCs. roles/container.serviceAgent Allow GKE control plane to manage itself and use necessary GCP APIs. roles/container.serviceAgent roles/container.serviceAgent Allow GKE control plane to manage itself and use necessary GCP APIs. Allow GKE control plane to manage itself and use necessary GCP APIs. Allow GKE control plane to manage itself and use necessary GCP APIs. Terraform Variables: Terraform Variables: Below are the variables I used in the terraform code variables.tf variables.tf variable "cluster_config" { description = "Cluster configuration per environment" type = map(object({ project_id = string name = string description = string regional = bool region = string zones = list(string) network = string subnetwork = string network_project_id = string machine_type = string disk_size_gb = number subnetwork_allow = string bucket_names = list(string) host_project = string network_user = list(string) hostServiceAgent_user = list(string) serviceAgent_user = list(string) static_ips = list(string) # Add more attributes as needed })) default = { nonprod-mon = { project_id = "nonprod-monitoring" name = "cluster-nonprod" description = "nonprod cluster" regional = true region = "us-central1" zones = ["us-central1-a", "us-central1-b", "us-central1-c"] network = "nonprod-vpc" subnetwork = "nonprod-us-central1-sb01" subnetwork_allow = "10.226.0.0/22" network_project_id = "nonprod-networking" machine_type = "e2-custom-4-10240" disk_size_gb = "50" bucket_names = ["mon_blocks_storage", "mon_alertmanager_storage", "mon_ruler_storage"] host_project = "nonprod-networking" network_user = ["service-123456789123@container-engine-robot.iam.gserviceaccount.com", "123456789123@cloudservices.gserviceaccount.com"] hostServiceAgent_user = ["service-123456789123@container-engine-robot.iam.gserviceaccount.com"] serviceAgent_user = ["service-123456789123@container-engine-robot.iam.gserviceaccount.com"] static_ips = ["internal-ingress"] } prod-mon = { project_id = "prod-monitoring" name = "cluster-prod" description = "prod cluster" regional = true region = "us-central1" zones = ["us-central1-a", "us-central1-b", "us-central1-c"] network = "prod-vpc" subnetwork = "prod-us-central1-sb01" subnetwork_allow = "10.227.0.0/22" network_project_id = "prod-networking" machine_type = "n2-custom-4-32768" disk_size_gb = "100" bucket_names = ["mon_blocks_storage", "mon_alertmanager_storage", "mon_ruler_storage"] host_project = "prod-networking" network_user = ["service-123456789012@container-engine-robot.iam.gserviceaccount.com", "123456789012@cloudservices.gserviceaccount.com"] hostServiceAgent_user = ["service-123456789012@container-engine-robot.iam.gserviceaccount.com"] serviceAgent_user = ["service-123456789012@container-engine-robot.iam.gserviceaccount.com"] static_ips = ["internal-ingress"] } } } variable "cluster_config" { description = "Cluster configuration per environment" type = map(object({ project_id = string name = string description = string regional = bool region = string zones = list(string) network = string subnetwork = string network_project_id = string machine_type = string disk_size_gb = number subnetwork_allow = string bucket_names = list(string) host_project = string network_user = list(string) hostServiceAgent_user = list(string) serviceAgent_user = list(string) static_ips = list(string) # Add more attributes as needed })) default = { nonprod-mon = { project_id = "nonprod-monitoring" name = "cluster-nonprod" description = "nonprod cluster" regional = true region = "us-central1" zones = ["us-central1-a", "us-central1-b", "us-central1-c"] network = "nonprod-vpc" subnetwork = "nonprod-us-central1-sb01" subnetwork_allow = "10.226.0.0/22" network_project_id = "nonprod-networking" machine_type = "e2-custom-4-10240" disk_size_gb = "50" bucket_names = ["mon_blocks_storage", "mon_alertmanager_storage", "mon_ruler_storage"] host_project = "nonprod-networking" network_user = ["service-123456789123@container-engine-robot.iam.gserviceaccount.com", "123456789123@cloudservices.gserviceaccount.com"] hostServiceAgent_user = ["service-123456789123@container-engine-robot.iam.gserviceaccount.com"] serviceAgent_user = ["service-123456789123@container-engine-robot.iam.gserviceaccount.com"] static_ips = ["internal-ingress"] } prod-mon = { project_id = "prod-monitoring" name = "cluster-prod" description = "prod cluster" regional = true region = "us-central1" zones = ["us-central1-a", "us-central1-b", "us-central1-c"] network = "prod-vpc" subnetwork = "prod-us-central1-sb01" subnetwork_allow = "10.227.0.0/22" network_project_id = "prod-networking" machine_type = "n2-custom-4-32768" disk_size_gb = "100" bucket_names = ["mon_blocks_storage", "mon_alertmanager_storage", "mon_ruler_storage"] host_project = "prod-networking" network_user = ["service-123456789012@container-engine-robot.iam.gserviceaccount.com", "123456789012@cloudservices.gserviceaccount.com"] hostServiceAgent_user = ["service-123456789012@container-engine-robot.iam.gserviceaccount.com"] serviceAgent_user = ["service-123456789012@container-engine-robot.iam.gserviceaccount.com"] static_ips = ["internal-ingress"] } } } Terraform state: A GCS bucket is used to store the terraform state information. backend.tf backend.tf terraform { backend "gcs" { bucket = "environments-state" prefix = "terraform/state/gke" } } terraform { backend "gcs" { bucket = "environments-state" prefix = "terraform/state/gke" } } Terraform workspace: Terraform workspace: I am using the Terraform workspaces, so make sure to configure your workspace before running the Terraform code. For example, you can set the workspace with the following command: terraform workspace set non-prod terraform workspace set non-prod In the main.tf file, I defined the workspace like this: main.tf main.tf locals { env = terraform.workspace } locals { env = terraform.workspace } This automatically sets the env local variable to match the current Terraform workspace (e.g., non-prod, prod), allowing the configuration to dynamically adjust based on the selected environment. env non-prod prod Static IPs We need static IP addresses to configure DNS records, allowing us to access services using domain names such as prometheus.company.com or grafana.company.com. prometheus.company. grafana.company. static_ips.tf static_ips.tf data "google_compute_subnetwork" "subnet" { name = var.cluster_config[local.env].subnetwork project = var.cluster_config[local.env].network_project_id region = var.cluster_config[local.env].region } resource "google_compute_address" "static_ips" { for_each = { for ip in var.cluster_config[local.env].static_ips : ip => ip } name = each.value address_type = "INTERNAL" region = var.cluster_config[local.env].region subnetwork = data.google_compute_subnetwork.subnet.self_link project = var.cluster_config[local.env].project_id } data "google_compute_subnetwork" "subnet" { name = var.cluster_config[local.env].subnetwork project = var.cluster_config[local.env].network_project_id region = var.cluster_config[local.env].region } resource "google_compute_address" "static_ips" { for_each = { for ip in var.cluster_config[local.env].static_ips : ip => ip } name = each.value address_type = "INTERNAL" region = var.cluster_config[local.env].region subnetwork = data.google_compute_subnetwork.subnet.self_link project = var.cluster_config[local.env].project_id } Kuberenetes Service Account: We are using a dedicated service account for the Kubernetes nodes to manage their permissions securely and follow best practices. service_account.tf service_account.tf resource "google_service_account" "gke" { account_id = "gke-${local.env}" project = var.cluster_config[local.env].project_id display_name = "Service account for gke" } resource "google_service_account" "gke" { account_id = "gke-${local.env}" project = var.cluster_config[local.env].project_id display_name = "Service account for gke" } Mimir GCS buckets: We need Google Cloud Storage (GCS) buckets for Mimir's long-term metric storage, allowing us to efficiently scale and persist large volumes of time-series data. gcs_buckets.tf gcs_buckets.tf module "gcs_buckets" { source = "terraform-google-modules/cloud-storage/google" version = "~> 5.0" project_id = var.cluster_config[local.env].project_id location = "US" storage_class = "STANDARD" names = var.cluster_config[local.env].bucket_names labels = { environment = local.env project = var.cluster_config[local.env].project_id resource_type = "gcs" customer = "all" } } resource "google_storage_bucket_iam_binding" "buckets" { for_each = { for bucket in var.cluster_config[local.env].bucket_names : bucket => bucket } bucket = each.value role = "roles/storage.objectAdmin" members = [ "serviceAccount:${google_service_account.gke.email}" ] depends_on = [module.gcs_buckets] } module "gcs_buckets" { source = "terraform-google-modules/cloud-storage/google" version = "~> 5.0" project_id = var.cluster_config[local.env].project_id location = "US" storage_class = "STANDARD" names = var.cluster_config[local.env].bucket_names labels = { environment = local.env project = var.cluster_config[local.env].project_id resource_type = "gcs" customer = "all" } } resource "google_storage_bucket_iam_binding" "buckets" { for_each = { for bucket in var.cluster_config[local.env].bucket_names : bucket => bucket } bucket = each.value role = "roles/storage.objectAdmin" members = [ "serviceAccount:${google_service_account.gke.email}" ] depends_on = [module.gcs_buckets] } Namespaces (Kuberenetes): Namespaces (Kuberenetes): Once the cluster is set up, create the following namespaces Promethues Mimir Grafana Nginx-Ingress Promethues Mimir Grafana Nginx-Ingress Installation (Helm Charts): Use Helm charts to install the various monitoring software in their respective namespaces. Below is an example for Prometheus, but you can apply this approach to install other software such as Grafana, Mimir, and the NGINX Ingress Controller. Promethues - https://github.com/prometheus-community/helm-charts https://github.com/prometheus-community/helm-charts Grafana - https://github.com/grafana/helm-charts https://github.com/grafana/helm-charts Mimir - https://grafana.com/docs/helm-charts/mimir-distributed/latest/get-started-helm-charts/ https://grafana.com/docs/helm-charts/mimir-distributed/latest/get-started-helm-charts/ Nginx controller- https://github.com/kubernetes/ingress-nginx/tree/main/charts/ingress-nginx. https://github.com/kubernetes/ingress-nginx/tree/main/charts/ingress-nginx Helm Commands: Helm Commands: First, add the Prometheus Helm repository and update it: helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm repo update Next, pull the Prometheus chart: helm pull prometheus-community/prometheus --untar helm pull prometheus-community/prometheus --untar This will create a prometheus/ directory in your current working directory, containing the chart contents. You can modify the default prometheus/values.yaml file before installing it, allowing you to set custom configurations such as the admin password, persistence settings, and service type. prometheus/ prometheus/values.yaml Now, you can install Prometheus with the custom values_prod.yaml file: values_prod.yaml helm install prometheus ./prometheus -f prometheus/values_prod.yaml -n monitoring helm install prometheus ./prometheus -f prometheus/values_prod.yaml -n monitoring Similarly, you can install the other components: helm install grafana ./grafana -f grafana/values_prod.yaml -n monitoring helm install grafana ./grafana -f grafana/values_prod.yaml -n monitoring helm install mimir ./mimir -f mimir/values_prod.yaml -f mimir/capped-small.yaml -n mimir helm install mimir ./mimir -f mimir/values_prod.yaml -f mimir/capped-small.yaml -n mimir helm install nginx-ingress ./nginx-ingress -f nginx/values_prod.yaml -n ingress helm install nginx-ingress ./nginx-ingress -f nginx/values_prod.yaml -n ingress Configuration (values.yaml files): Configuration (values.yaml files): Ingress: Ingress: An Ingress controller is required to manage Ingress resources. Simply creating an Ingress resource will have no effect unless there is an Ingress controller in place. While there are many Ingress controllers available, including GKE’s built-in Ingress, I’ve chosen the NGINX Ingress Controller for various reasons. Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined within the Ingress resource. In this tutorial, we are using an internal IP to expose the services. Make sure to configure the following in your helm values.yaml to ensure proper routing and access: helm values.yaml Helm values.yaml Helm values.yaml controller: service: internal: enabled: true annotations: # Create internal LB. More information: https://cloud.google.com/kubernetes-engine/docs/how-to/internal-load-balancing # For GKE versions 1.17 and later networking.gke.io/load-balancer-type: "Internal" # For earlier versions # cloud.google.com/load-balancer-type: "Internal" # Any other annotation can be declared here. controller: service: internal: enabled: true annotations: # Create internal LB. More information: https://cloud.google.com/kubernetes-engine/docs/how-to/internal-load-balancing # For GKE versions 1.17 and later networking.gke.io/load-balancer-type: "Internal" # For earlier versions # cloud.google.com/load-balancer-type: "Internal" # Any other annotation can be declared here. Also, provide the static internal IP you created earlier via Terraform in the loadBalancerIP field, like so: loadBalancerIP Helm values.yaml Helm values.yaml loadBalancerIP: "10.x.x.x" loadBalancerIP: "10.x.x.x" Once the NGINX Ingress controller is installed, it will create a cloud load balancer with your cloud provider (e.g., GCP). Afterward, you need to create Ingress resources to route traffic to the appropriate destinations, such as Grafana and Prometheus. Grafana Prometheus The Ingress spec contains all the necessary information to configure a load balancer or proxy server. To ensure the traffic is routed correctly, you must include either the ingressClassName: nginx spec field or the kubernetes.io/ingress.class: nginx annotation in your Ingress resources. ingressClassName: nginx kubernetes.io/ingress.class : nginx Ingress resource: Ingress resource: my-ingress-prod.yaml my-ingress-prod.yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: company-ingress namespace: monitoring spec: ingressClassName: nginx rules: - host: grafana.company.com http: paths: - path: / pathType: Prefix backend: service: name: grafana port: number: 80 - host: prometheus.company.com http: paths: - path: / pathType: Prefix backend: service: name: prometheus-server port: number: 80 apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: company-ingress namespace: monitoring spec: ingressClassName: nginx rules: - host: grafana.company.com http: paths: - path: / pathType: Prefix backend: service: name: grafana port: number: 80 - host: prometheus.company.com http: paths: - path: / pathType: Prefix backend: service: name: prometheus-server port: number: 80 What it does: Ingress Controller: It uses the nginx Ingress controller to handle the routing of external traffic to the internal services. Routing Rules: grafana.company.com: Traffic directed to grafana.company.com will be routed to the grafana service, specifically to port 80. prometheus.company.com: Traffic directed to prometheus.company.com will be routed to the prometheus-server service, specifically to port 80. Path Handling: Both routes use path: /, meaning any URL that starts with / will be forwarded to the respective services (Grafana or Prometheus). Ingress Controller: It uses the nginx Ingress controller to handle the routing of external traffic to the internal services. Ingress Controller nginx Routing Rules: grafana.company.com: Traffic directed to grafana.company.com will be routed to the grafana service, specifically to port 80. prometheus.company.com: Traffic directed to prometheus.company.com will be routed to the prometheus-server service, specifically to port 80. Routing Rules grafana.company.com: Traffic directed to grafana.company.com will be routed to the grafana service, specifically to port 80. prometheus.company.com: Traffic directed to prometheus.company.com will be routed to the prometheus-server service, specifically to port 80. grafana.company.com: Traffic directed to grafana.company.com will be routed to the grafana service, specifically to port 80. grafana.company.com grafana.company.com grafana prometheus.company.com: Traffic directed to prometheus.company.com will be routed to the prometheus-server service, specifically to port 80. prometheus.company.com prometheus.company.com prometheus-server Path Handling: Both routes use path: /, meaning any URL that starts with / will be forwarded to the respective services (Grafana or Prometheus). Path Handling path: / / This configuration ensures that incoming traffic to the specified domains is directed to the correct service inside your Kubernetes cluster, based on the hostname and path. Prometheus: If you're using the pull model, Prometheus needs to collect metrics from your targets. To configure this, you can set up the scrape configuration as follows in your values.yaml file: values.yaml Helm values.yaml Helm values.yaml scrape_configs: - job_name: prometheus static_configs: - targets: - localhost:9090 scrape_configs: - job_name: prometheus static_configs: - targets: - localhost:9090 If Prometheus requires scaling, it needs to forward the metrics to Mimir for long-term storage. You can configure Prometheus to send a copy of the metrics to Mimir by using the remoteWrite section in the Prometheus Helm values.yaml file, like so: remoteWrite values.yaml Helm values.yaml Helm values.yaml remoteWrite: - url: http://company-mimir-nginx.mimir.svc.cluster.local:80/api/v1/push remoteWrite: - url: http://company-mimir-nginx.mimir.svc.cluster.local:80/api/v1/push The idea behind using Mimir is to provide long-term storage for Prometheus metrics. This setup allows you to scale Prometheus as needed while avoiding a single point of failure. I recommend enabling persistent volumes (PVC) for Prometheus pods. This ensures that your data is not lost when remoteWrite is enabled, or in case you're only using a single Prometheus instance and want to have a fallback in the event Mimir encounters issues. Enabling persistent storage in the Helm values.yaml file ensures that the data is retained through pod restarts. persistent volumes (PVC) remoteWrite values.yaml Helm values.yaml Helm values.yaml persistentVolume: ## If true, Prometheus server will create/use a Persistent Volume Claim ## If false, use emptyDir ## enabled: true accessModes: - ReadWriteOnce ## Prometheus server data Persistent Volume mount root path ## mountPath: /data ## Prometheus server data Persistent Volume size ## size: 500Gi # storageClass: "-" storageClass: "persistent-disk-rwo" persistentVolume: ## If true, Prometheus server will create/use a Persistent Volume Claim ## If false, use emptyDir ## enabled: true accessModes: - ReadWriteOnce ## Prometheus server data Persistent Volume mount root path ## mountPath: /data ## Prometheus server data Persistent Volume size ## size: 500Gi # storageClass: "-" storageClass: "persistent-disk-rwo" Set retention time carefully (--storage.tsdb.retention.time) in helm values.yaml Set retention time --storage.tsdb.retention.time Helm values.yaml Helm values.yaml ## Prometheus data retention period (default if not specified is 15 days) ## retention: "90d" ## Prometheus data retention period (default if not specified is 15 days) ## retention: "90d" Adjust above values as per your needs. Mimir: Mimir Grafana Mimir is an open-source, horizontally scalable, multi-tenant time-series database and monitoring platform Mimir is fully compatible with Prometheus, meaning it supports the Prometheus data model, query language (PromQL), and scraping mechanism. It can serve as a backend to store Prometheus metrics, enabling you to scale beyond what a single Prometheus server can handle. With efficient data storage and compression techniques, Mimir helps reduce the cost of storing long-term metric data. Mimir is useful in Grafana Mimir Store large volumes of time-series data long-term. Scale Prometheus beyond a single instance. Use isolated storage with multi-tenancy support. Ensure distributed, fault-tolerant metric storage. Grafana Mimir’s architecture is based on the principles of distributed systems, using components such as: Store large volumes of time-series data long-term. Scale Prometheus beyond a single instance. Use isolated storage with multi-tenancy support. Ensure distributed, fault-tolerant metric storage. Grafana Mimir’s architecture is based on the principles of distributed systems, using components such as: Distributor: Receives and writes data from Prometheus instances or any compatible scraper. Ingester: Stores and processes incoming data. Data is held temporarily in the ingester until it is flushed to long-term storage. Store Gateway: Handles retrieving data from persistent storage and serves queries. Query Frontend: Manages query execution and routing, ensuring that queries are distributed across the available Mimir instances. Storage Backend: Mimir In our tutorial uses GCS storage backends. Distributor: Receives and writes data from Prometheus instances or any compatible scraper. Distributor Ingester: Stores and processes incoming data. Data is held temporarily in the ingester until it is flushed to long-term storage. Ingester Store Gateway: Handles retrieving data from persistent storage and serves queries. Store Gateway Query Frontend: Manages query execution and routing, ensuring that queries are distributed across the available Mimir instances. Query Frontend Storage Backend: Mimir In our tutorial uses GCS storage backends. Storage Backend The GCS storage backends used by Mimir are mon_blocks_storage, mon_alertmanager_storage, and mon_ruler_storage, which we have configured in our Terraform code. mon_blocks_storage mon_alertmanager_storage mon_ruler_storage In the Helm values.yaml file, configure the GCS buckets for storage along with the necessary credentials to access these GCS storage buckets. This allows Mimir to interact with Google Cloud Storage for long-term metric storage. values.yaml Helm values.yaml Helm values.yaml # -- Additional structured values on top of the text based 'mimir.config'. Applied after the text based config is evaluated for templates. Enables adding and modifying YAML elements in the evaulated 'mimir.config'. # To modify the resulting configuration, either copy and alter 'mimir.config' as a whole or use the 'mimir.structuredConfig' to add and modify certain YAML elements. structuredConfig: limits: out_of_order_time_window: 1h max_label_names_per_series: 100 common: storage: backend: gcs gcs: service_account: | { "type": "service_account", "project_id": "prod-monitoring", "private_key_id": "50885800", "private_key": "xxxxx-----PRIVATE KEY-----\n", "client_email": "gke-mon@radar-prod-monitoring.iam.gserviceaccount.com", "client_id": "108488885", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/gke-prod%40prod-monitoring.iam.gserviceaccount.com", "universe_domain": "googleapis.com" } blocks_storage: backend: gcs gcs: bucket_name: mon_blocks_storage alertmanager_storage: gcs: bucket_name: mon_alertmanager_storage ruler_storage: gcs: bucket_name: mon_ruler_storage # -- Additional structured values on top of the text based 'mimir.config'. Applied after the text based config is evaluated for templates. Enables adding and modifying YAML elements in the evaulated 'mimir.config'. # To modify the resulting configuration, either copy and alter 'mimir.config' as a whole or use the 'mimir.structuredConfig' to add and modify certain YAML elements. structuredConfig: limits: out_of_order_time_window: 1h max_label_names_per_series: 100 common: storage: backend: gcs gcs: service_account: | { "type": "service_account", "project_id": "prod-monitoring", "private_key_id": "50885800", "private_key": "xxxxx-----PRIVATE KEY-----\n", "client_email": "gke-mon@radar-prod-monitoring.iam.gserviceaccount.com", "client_id": "108488885", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/gke-prod%40prod-monitoring.iam.gserviceaccount.com", "universe_domain": "googleapis.com" } blocks_storage: backend: gcs gcs: bucket_name: mon_blocks_storage alertmanager_storage: gcs: bucket_name: mon_alertmanager_storage ruler_storage: gcs: bucket_name: mon_ruler_storage Based on your requirements, use either the capped-small.yaml or capped-large.yaml values files to assign compute resources to the Mimir components. These files allow you to configure the CPU and memory limits for Mimir depending on the scale of your deployment. capped-small.yaml capped-large.yaml Additionally, Mimir has an active community on Slack where you can seek help from other members while setting it up in your cluster. Grafana: Grafana In Grafana, add Mimir as a datasource for long-term metric storage. If you're using only a single Prometheus instance, you can also add Prometheus as a datasource for backup purposes. Once the datasources are set up, you can visualize the metrics, configure dashboards, and create alerts in Grafana. Grafana Additionally, enable Persistent Volume Claims (PVC) for Grafana to ensure that data is not lost if the pod restarts. This will help retain the configuration and data even through pod lifecycle changes. Persistent Volume Claims (PVC) Helm values.yaml Helm values.yaml ## Enable persistence using Persistent Volume Claims ## ref: http://kubernetes.io/docs/user-guide/persistent-volumes/ ## persistence: type: pvc enabled: true storageClassName: persistent-disk-rwo accessModes: - ReadWriteOnce size: 10Gi ## Enable persistence using Persistent Volume Claims ## ref: http://kubernetes.io/docs/user-guide/persistent-volumes/ ## persistence: type: pvc enabled: true storageClassName: persistent-disk-rwo accessModes: - ReadWriteOnce size: 10Gi DNS DNS Once everything is installed and configured, configure the DNS (e.g., prometheus.company.com or grafana.company.com) to point to the static IP you created earlier (10.x.x.x) using Terraform. prometheus.company.com grafana.company.com 10.x.x.x After completing this configuration, you should be able to access the metrics in Grafana. From there, you can visualize the data, create custom dashboards, and set up alerts. For more details on creating dashboards and visualizing data in Grafana, refer to the https://grafana.com/docs/grafana/latest/dashboards/ https://grafana.com/docs/grafana/latest/dashboards/ Good luck! Feel free to connect with me on LinkedIn. LinkedIn