I thought I had observability, but it wasn't enough. The picture above says to me that you should know your infrastructure and heed your own advice. Hey brilliant human - so, as it turns out - a critical environment experienced an issue. How can you prevent issues like this before this could affect clients? I’ll tell you a little later in this article. If you are reading this, chances are: You have or had a similar DNS issue or DNS symptoms You are hardworking and you would like to learn how to do not step on a rake If you don't have time to read, and you want to see the solution right away, like how you scroll through all the answers on in search of the marked solution, then . stackoverflow read this detailed AWS DNS troubleshooting guide One more spoiler: Kubernetes’ official documentation . It’s the wrong way and will not resolve a root cause. refers to using NodeLocal DNSCache This article will help you if you have one of the following questions or problems: Kubernetes DNS lookup is slow How to change ndots default value of DNS in Kubernetes External DNS lookup fails in pods Kubernetes kube (or CoreDNS) dns best practices Kubernetes adds the wrong url suffix cluster.local at the end of query CoreDns had thousands RCode responses CoreDns had high load without any reason on the first glance. Most importantly, apps could not resolve DNS queries and could not pull critical data RCodes When CoreDNS encounters an error, it returns an rcode—a standard DNS error code. Errors like NXDomain and FormErr can reveal a problem with the requests CoreDNS is receiving, while a ServFail error could indicate an issue with the function of the CoreDNS server itself. So, below, I’ll tell you about : 3 topics Solving DNS issues for Kubernetes (but more about DNS with Linux really) DNS best practices How to prevent issues like this before this could affect clients The specific DNS issue in detail An issue has been reported that some apps can't connect to a service. It could be an external service or an internal service. We revised these services, everything worked fine. So, very strange behavior and not a persistent issue. Looking at all errors in the centralized log system at that specific time, we found a correlation with DNS service (CoreDNS). We also found problems with that service at that time. CoreDns metrics during this incident could look like below, which is the amount of queries. Here you can see the amount of errors that returned an answer to DNS queries Here you can see cpu usage. The DNS config inside the pods looks like below. /etc/resolv.conf: nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5 NDOTS: n Sets a threshold for the number of dots that must appear in a name given to res_query() before an initial absolute query is made. The default for n is 1, meaning that if there are any dots in a name, the name is tried first as an absolute name before any search list elements are appended to it. The error logs look like the following: [INFO] 192.168.3.71:33238 - 36534 "A IN amazon.com.default.svc.cluster.local. udp 54 false 512" NXDOMAIN qr,aa,rd 147 0.000473434s
[INFO] 192.168.3.71:57098 - 43241 "A IN amazon.com.svc.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.000066171s
[INFO] 192.168.3.71:51937 - 15588 "A IN amazon.com.cluster.local. udp 42 false 512" NXDOMAIN qr,aa,rd 135 0.000137489s
[INFO] 192.168.3.71:52618 - 14916 "A IN amazon.com.ec2.internal. udp 41 false 512" NXDOMAIN qr,rd,ra 41 0.001248388s
[INFO] 192.168.3.71:51298 - 65181 "A IN amazon.com. udp 28 false 512" NOERROR qr,rd,ra 106 0.001711104s As you can see, the message level is info, that is why we did not see these messages before. You can see that the DNS names are absolutely wrong. means that the domain record wasn't found, and means that the domain record was found. NXDOMAIN NOERROR CoreDNS integrates with via the , or with with the . Kubernetes Kubernetes plugin etcd etcd plugin This is the logic of these queries. Why is , and why like ? CoreDNS searching external domains in the local domains there are 4 new domain suffixes .default.svc.cluster.local The app makes a DNS query, like secretsmanager.ca-central-1.amazonaws.com CoreDNS checks where it should search this domain k8s is based on Linux, so that means it searches this domain in the local or . internal system external system So, if the domain name has a format, then CoreDNS should make an , other way search in the . FQDN absolute query(external system) local system , we have a option. BUT ndots If the domain name , search in an internal system is not a FQDN , CoreDNS returns an service IP address in the same namespace. my-internal-service internal Check If a domain name has format  and check the option and of the domain name. FQDN ndots dot in the end If the domain name has fewer dots, than the option, search this domain first in the defined in the directive. ndots local domains search In our example has only 3 dots, whereas option has value 5. secretsmanager.ca-central-1.amazonaws.com ndots So first search: secretsmanager.ca-central-1.amazonaws.com.default.svc.cluster.local
secretsmanager.ca-central-1.amazonaws.com.svc.cluster.local
secretsmanager.ca-central-1.amazonaws.com.cluster.local
secretsmanager.ca-central-1.amazonaws.com.ec2.internal And only after these queries, make an absolutely query: secretsmanager.ca-central-1.amazonaws.com If the domain name has or has a , . more than 4 dots dot in the end make absolutely query If the domain name has FQDN, make an absolutely query. If the domain name has an absolute domain (dot in the end) FQDN, make an absolutely query Fully Qualified Domain Name FQDN: The FQDN consists of two parts: the hostname and the domain name. For example, an FQDN for a hypothetical mail server might be . The hostname is mymail, and the host is located within the domain . mymail.somecollege.edu somecollege.edu in the of the domain name means that this domain is an absolute FQDN. A Dot EG: “amazon.com.” means, make only absolutely query. Okay, we found the root cause. There are 2 questions: Why were there connection errors in the app? How to fix it? Why were there connection errors in the app? As we found, CoreDNS made an additional 4 queries for each query from an app. In peaks of the CoreDNS high load it doesn’t have enough resources. We scaled up an amount of pods, for some time we forgot about this issue. After we saw this issue again, we added , but this did not resolve the root cause. After some time, an amount of the queries grows, and we will see this issue again. So that is why we should fix a root cause. NodeLocal as mentioned in the official documentation How to fix it? We should reduce a value of the option in the . ndots resolv.conf Well, thank goodness k8s allows changing this (because if you are using managing Kubernetes you do not always you have permission to change options like this). option in the pod manifest apiVersion: v1
kind: Pod
metadata:
  namespace: default
  name: web-server
spec:
  containers:
    - name: test
      image: nginx
  dnsConfig:
    options:
      - name: ndots
        value: "2" So, /etc/resolv.conf in our pod will look like: nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:2 After the root cause is fixed, we could improve DNS settings with recommendations from below. DNS for k8s, PRO tips and tricks If you are using a managed k8s service, to scale up DNS, use (or something like ,  Kubernetes Event-driven Autoscaling) instead of manually changing settings because at any time, your provider could overwrite your changes. HPA KEDA HPA example: apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: coredns
  namespace: kube-system
spec:
  maxReplicas: 20
  minReplicas: 2
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coredns
  metrics:
  - type: Resource
    resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 80
  - type: Resource
    resource:
    name: memory
    target:
      type: AverageValue
      averageUtilization: 80 The option above should be enough to not have to use NodeLocal. But if you need to change options in the DNS service, and you can’t make these changes (in the manages k8s case), you can install to your k8s cluster. NodeLocal This option will allow adding one more DNS cache points (this will remove the load from the main DNS service) and setup settings as you want. You should not change nameserver option in your pods, if you have . dnsPolicy ClusterFirst If one of the CoreDNS pods in your k8s system return an error, all will unhealthy, just retry DNS query. This setting we can set up on the system level. apiVersion: v1
kind: Pod
metadata:
  namespace: default
  name: web-server
spec:
  containers:
    - name: test
      image: nginx
  dnsConfig:
    options:
      - name: ndots
        value: "2"
      - name: attempts
        value: "3" If you don’t have enough time to make a patch for your env, of your DNS query. Then, your domain name will be absolutely FQDN, for example, add a “dot” in the end secretsmanager.ca-central-1.amazonaws.com. To propagate these changes to all nodes, you should change the kubelet settings, ing for kubelet settings us resolvConf option Why didn't I know about the problem at first? What’s your problem? Short answer, I didn’t do the next steps: Not enough attention was paid to setting up critical error notifications An audit of all critically important components was not carried out in order to check the monitoring settings later There was no “preflight checklist” like checklist this A little bit more about this methodology. As you know, DevOps it’s not a person, it’s a set of the best practices, and approaches that involve implementing continuous integration and deployment ... We have logs, metrics, and traces as often known as the three pillars of observability (as described in ). Distributed Systems Observability All logs scraped from a current env sending to centralized logs system like ELK, New Relic, DataDog. So in one place you can find and analyze almost any issue of your app or system service. App traces sending in the same as logs place and helps find a sequence of the events, filter step by step all apps related to specific event/request. All metrics from the current env sending to a centralized monitoring system and correlate with logs, and traces. We have alerts, notifications about some monitoring issues, ex: too many restarts of the pods, not enough disk capacity, etc …. My face, when I first saw a critical issue on the critical env, while I thought all ok. Pic from "They Live" (1988). How to prevent issues like this before his could affect clients? But logs, metrics were not enough, why? We had logs, metrics, traces, we had errors in our logs, but who is continuously watching all logs (10k log records every day - it’s ok)? We dig into logs, metrics, etc, when we are faced with an issue. But it’s very expensive when an issue finds your clients. That is why we should predict problems by some symptoms, and prevent issues before they affect something critical. In other words. Because not enough attention was paid to setting up critical error notifications. DNS in the kubernetes infrastructure is the one of the critical systems. So as I mentioned above, we should do  some env checklist  for the critical components  and pass it step by step. Some points of this checklist  should was like: we should that we  have logs, metrics and notification for the critical issues and of course verify it in the real infrastructure. As result, should be done next steps: Fix an issue Make postmortem (example: ) Postmortem of database outage of January 31 Make checklist for the critical components especially for important migrations, updates, deployments Use this shceklist at least once in existing infrastructure or use it before events like migration, update List of useful links Pod's DNS Config resolv.conf(5) - Linux man page How do I troubleshoot DNS failures with Amazon EKS? Fixing EKS DNS Production Ready EKS CoreDNS Configuration Racy conntrack and DNS lookup timeouts So, why I wrote this story today? Everyone makes mistakes(ex: ), those who do nothing and do not learn from their mistakes do not achieve success! GitLab.com database incident Good luck! I hope this helps. Please feel free to add any comments and connect to me . https://kazakov.xyz

Amazon

I Thought I Had Observability - My Short DNS Story

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Deploy Your Personal Web-Page With Hugo, Cloudflare and GitHub 100% For Free

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: Delving Into OpenTelemetry Collector (11/18/2023)

The Noonification: How to Implement a Merkle Tree in Solidity (11/12/2023)

105 Stories To Learn About K8s

104 Stories To Learn About Continuous Integration

Deploy Your Personal Web-Page With Hugo, Cloudflare and GitHub 100% For Free

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: Delving Into OpenTelemetry Collector (11/18/2023)

The Noonification: How to Implement a Merkle Tree in Solidity (11/12/2023)

105 Stories To Learn About K8s

104 Stories To Learn About Continuous Integration

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps