In Kubernetes, Ingress resources are frequently used as traffic controllers, providing external access to services within the cluster. Ingress is essential for routing incoming traffic to your service; however, there may be scenarios in which you want to prevent search engines from indexing your service's content: it might be a development environment or something else. This blog post will walk you through the process of blocking your site's indexing on Kubernetes Ingress using a robots.txt file, preventing search engine bots from crawling and indexing your content. Prerequisites To proceed with the tutorial, you should have a basic grasp of Kubernetes basic objects, Ingress resources, and the official HAProxy ingress controller. You will also need access to the Kubernetes cluster and the necessary permissions to make configuration changes. Keep in mind that for this article, I assume that the HAProxy ingress controller is set as the default controller. Otherwise, if you did not select HAProxy as the default controller, you must add the ingressClassName option to all Ingress code examples. Step 1: Create an Ingress Kubernetes Resource In the first part of our journey, we'll set up a small Ingress resource to expose our service outside of the Kubernetes cluster. Pay attention: for the time being, all web crawlers will have access to the service. To apply the code below, use the command kubectl apply -f ingress.yaml. # file: ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: hackernoon-close-site-indexing-haproxy-example
 annotations:
   cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
 rules:
   - host: hackernoon-close-site-indexing-haproxy-example.referrs.me
     http:
       paths:
         - path: /
           pathType: Prefix
           backend:
             service:
               name: hackernoon-close-site-indexing-haproxy-example-service
               port:
                 number: 80
 tls:
   - hosts:
       - hackernoon-close-site-indexing-haproxy-example.referrs.me
     secretName: hackernoon-close-site-indexing-haproxy-example Step 2: Modify the Ingress Configuration The robots.txt file is used to control how search engines index documents. Its file specifies which URLs search engine crawlers can access on your website. The most basic file that restricts access to the web service looks like this: User-agent: *
Disallow: / HAProxy does not require you to add this file to your web server or website. This can be achieved with the following configuration, which should be added to the backend section for the specific group of servers: acl robots_path path_beg /robots.txt
http-request return status 200 content-type "text/plain" lf-string "User-agent: *\nDisallow: /\n" if robots_path K8S Annotations regulate all manipulations of the HAProxy frontend/backend configuration for a single ingress resource. The full list of HAProxy annotations can be found in the official documentation on GitHub. In our case, we need to use haproxy.org/backend-config-snippet with the HAProxy snippet for blocking any indexing. To do this, edit, open your Ingress resource YAML file, and add the following annotation to the metadata section: #file: ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: hackernoon-avoid-indexing-haproxy-example
 annotations:
   …
    haproxy.org/backend-config-snippet: |
      acl robots_path path_beg /robots.txt
      http-request return status 200 content-type "text/plain" lf-string "User-agent: *\nDisallow: /\n" if robots_path
spec:
  ... Step 3: Apply the Configuration Changes After changing the Ingress YAML file, save it and apply it to the Kubernetes cluster with the kubectl command: kubectl apply -f ingress.yaml. The Ingress controller will detect the changes and update the configuration accordingly. Step 4: Verify the Configuration Inspect the generated robots.txt file to confirm that indexing prevention is working properly. The Ingress controller generates this file based on the annotation you supply. Retrieve the external IP or domain associated with your Ingress resource and add /robots.txt to the URL. Example: $ curl hackernoon-close-site-indexing-haproxy-example.referrs.me/robots.txt
User-agent: *
Disallow: / As we can see, the answer contains a robots.txt file that prevents any search indexing. Step 5: Test Indexing Prevention To verify that search engines are not indexing your site, you can run a search for your website on popular search engines. Keep in mind that search results sometimes take some time to reflect changes, so the indexing status may not be fully updated right away. Conclusion Annotations make it easy to avoid search engine indexing while using HAProxy Kubernetes Ingress. By adding the appropriate annotation to your Ingress resource, you can prohibit search engine bots from crawling and indexing your website's content. A similar approach can be used with other ingress controllers, such as Nginx, Traefic, and others. A similar annotation can also be used for K8S Gateway API resources, which are actively replacing Ingresses. As a final note, robots.txt is a time-honored way for website creators to specify whether or not their sites should be crawled by various bots. However, it turns out that AI crawlers from large language model (LLM) companies frequently ignore the contents of robots.txt and crawl your site regardless. To avoid such situations, utilize password security, noindex, or enterprise load balancer features like HAProxy AI-crawler, which may be also configured as a K8S annotation. In Kubernetes, Ingress resources are frequently used as traffic controllers, providing external access to services within the cluster. Ingress is essential for routing incoming traffic to your service; however, there may be scenarios in which you want to prevent search engines from indexing your service's content: it might be a development environment or something else. This blog post will walk you through the process of blocking your site's indexing on Kubernetes Ingress using a robots.txt file, preventing search engine bots from crawling and indexing your content. Prerequisites To proceed with the tutorial, you should have a basic grasp of Kubernetes basic objects, Ingress resources, and the official HAProxy ingress controller . You will also need access to the Kubernetes cluster and the necessary permissions to make configuration changes. Kubernetes official HAProxy ingress controller Keep in mind that for this article, I assume that the HAProxy ingress controller is set as the default controller. Otherwise, if you did not select HAProxy as the default controller, you must add the ingressClassName option to all Ingress code examples. ingressClassName Step 1: Create an Ingress Kubernetes Resource In the first part of our journey, we'll set up a small Ingress resource to expose our service outside of the Kubernetes cluster. Pay attention: for the time being, all web crawlers will have access to the service. To apply the code below, use the command kubectl apply -f ingress.yaml . kubectl apply -f ingress.yaml # file: ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: hackernoon-close-site-indexing-haproxy-example
 annotations:
   cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
 rules:
   - host: hackernoon-close-site-indexing-haproxy-example.referrs.me
     http:
       paths:
         - path: /
           pathType: Prefix
           backend:
             service:
               name: hackernoon-close-site-indexing-haproxy-example-service
               port:
                 number: 80
 tls:
   - hosts:
       - hackernoon-close-site-indexing-haproxy-example.referrs.me
     secretName: hackernoon-close-site-indexing-haproxy-example # file: ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: hackernoon-close-site-indexing-haproxy-example
 annotations:
   cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
 rules:
   - host: hackernoon-close-site-indexing-haproxy-example.referrs.me
     http:
       paths:
         - path: /
           pathType: Prefix
           backend:
             service:
               name: hackernoon-close-site-indexing-haproxy-example-service
               port:
                 number: 80
 tls:
   - hosts:
       - hackernoon-close-site-indexing-haproxy-example.referrs.me
     secretName: hackernoon-close-site-indexing-haproxy-example Step 2: Modify the Ingress Configuration The robots.txt file is used to control how search engines index documents. Its file specifies which URLs search engine crawlers can access on your website. robots.txt The most basic file that restricts access to the web service looks like this: User-agent: *
Disallow: / User-agent: *
Disallow: / HAProxy does not require you to add this file to your web server or website. This can be achieved with the following configuration, which should be added to the backend section for the specific group of servers: acl robots_path path_beg /robots.txt
http-request return status 200 content-type "text/plain" lf-string "User-agent: *\nDisallow: /\n" if robots_path acl robots_path path_beg /robots.txt
http-request return status 200 content-type "text/plain" lf-string "User-agent: *\nDisallow: /\n" if robots_path K8S Annotations regulate all manipulations of the HAProxy frontend/backend configuration for a single ingress resource. The full list of HAProxy annotations can be found in the official documentation on GitHub . in the official documentation on GitHub In our case, we need to use haproxy.org/backend-config-snippet with the HAProxy snippet for blocking any indexing. To do this, edit, open your Ingress resource YAML file, and add the following annotation to the metadata section: haproxy.org/backend-config-snippet #file: ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: hackernoon-avoid-indexing-haproxy-example
 annotations:
   …
    haproxy.org/backend-config-snippet: |
      acl robots_path path_beg /robots.txt
      http-request return status 200 content-type "text/plain" lf-string "User-agent: *\nDisallow: /\n" if robots_path
spec:
  ... #file: ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: hackernoon-avoid-indexing-haproxy-example
 annotations:
   …
    haproxy.org/backend-config-snippet: |
      acl robots_path path_beg /robots.txt
      http-request return status 200 content-type "text/plain" lf-string "User-agent: *\nDisallow: /\n" if robots_path
spec:
  ... Step 3: Apply the Configuration Changes After changing the Ingress YAML file, save it and apply it to the Kubernetes cluster with the kubectl command: kubectl apply -f ingress.yaml . The Ingress controller will detect the changes and update the configuration accordingly. kubectl apply -f ingress.yaml Step 4: Verify the Configuration Inspect the generated robots.txt file to confirm that indexing prevention is working properly. The Ingress controller generates this file based on the annotation you supply. Retrieve the external IP or domain associated with your Ingress resource and add /robots.txt to the URL. Example: $ curl hackernoon-close-site-indexing-haproxy-example.referrs.me/robots.txt
User-agent: *
Disallow: / $ curl hackernoon-close-site-indexing-haproxy-example.referrs.me/robots.txt
User-agent: *
Disallow: / As we can see, the answer contains a robots.txt file that prevents any search indexing. robots.txt Step 5: Test Indexing Prevention To verify that search engines are not indexing your site, you can run a search for your website on popular search engines. Keep in mind that search results sometimes take some time to reflect changes, so the indexing status may not be fully updated right away. Conclusion Annotations make it easy to avoid search engine indexing while using HAProxy Kubernetes Ingress. By adding the appropriate annotation to your Ingress resource, you can prohibit search engine bots from crawling and indexing your website's content. A similar approach can be used with other ingress controllers, such as Nginx, Traefic, and others. A similar annotation can also be used for K8S Gateway API resources, which are actively replacing Ingresses. K8S Gateway API As a final note, robots.txt is a time-honored way for website creators to specify whether or not their sites should be crawled by various bots. However, it turns out that AI crawlers from large language model (LLM) companies frequently ignore the contents of robots.txt and crawl your site regardless. To avoid such situations, utilize password security, noindex, or enterprise load balancer features like HAProxy AI-crawler , which may be also configured as a K8S annotation. HAProxy AI-crawler

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

How to Block Search Engine Indexing in Kubernetes with HAProxy

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Guide on How to Use GPU Nodes in Amazon EKS

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: Delving Into OpenTelemetry Collector (11/18/2023)

The Noonification: How to Implement a Merkle Tree in Solidity (11/12/2023)

105 Stories To Learn About K8s

104 Stories To Learn About Continuous Integration

A Guide on How to Use GPU Nodes in Amazon EKS

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: Delving Into OpenTelemetry Collector (11/18/2023)

The Noonification: How to Implement a Merkle Tree in Solidity (11/12/2023)

105 Stories To Learn About K8s

104 Stories To Learn About Continuous Integration

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps