I Thought I Had Observability - My Short DNS Story

I thought I had observability, but it wasn't enough.

The picture above says to me that you should know your infrastructure and heed your own advice.

Hey brilliant human - so, as it turns out - a critical environment experienced an issue.

How can you prevent issues like this before this could affect clients? I’ll tell you a little later in this article.

If you are reading this, chances are:

You have or had a similar DNS issue or DNS symptoms
You are hardworking and you would like to learn how to do not step on a rake

If you don't have time to read, and you want to see the solution right away, like how you scroll through all the answers on stackoverflow in search of the marked solution, then read this detailed AWS DNS troubleshooting guide.

One more spoiler: Kubernetes’ official documentation refers to using NodeLocal DNSCache. It’s the wrong way and will not resolve a root cause.

This article will help you if you have one of the following questions or problems:

Kubernetes DNS lookup is slow
How to change ndots default value of DNS in Kubernetes
External DNS lookup fails in pods
Kubernetes kube (or CoreDNS) dns best practices
Kubernetes adds the wrong url suffix cluster.local at the end of query
CoreDns had thousands RCode responses
CoreDns had high load without any reason on the first glance.
Most importantly, apps could not resolve DNS queries and could not pull critical data

RCodes

When CoreDNS encounters an error, it returns an rcode—a standard DNS error code.

Errors like NXDomain and FormErr can reveal a problem with the requests CoreDNS is receiving, while a ServFail error could indicate an issue with the function of the CoreDNS server itself.

So, below, I’ll tell you about 3 topics:

Solving DNS issues
DNS best practices for Kubernetes (but more about DNS with Linux really)
How to prevent issues like this before this could affect clients

The specific DNS issue in detail

An issue has been reported that some apps can't connect to a service. It could be an external service or an internal service. We revised these services, everything worked fine. So, very strange behavior and not a persistent issue. Looking at all errors in the centralized log system at that specific time, we found a correlation with DNS service (CoreDNS). We also found problems with that service at that time.

CoreDns metrics during this incident could look like below, which is the amount of queries.

Here you can see the amount of errors that returned an answer to DNS queries

Here you can see cpu usage.

The DNS config inside the pods looks like below.

/etc/resolv.conf:

nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

NDOTS: n

Sets a threshold for the number of dots that must appear in a name given to res_query() before an initial absolute query is made. The default for n is 1, meaning that if there are any dots in a name, the name is tried first as an absolute name before any search list elements are appended to it.

The error logs look like the following:

[INFO] 192.168.3.71:33238 - 36534 "A IN amazon.com.default.svc.cluster.local. udp 54 false 512" NXDOMAIN qr,aa,rd 147 0.000473434s
[INFO] 192.168.3.71:57098 - 43241 "A IN amazon.com.svc.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.000066171s
[INFO] 192.168.3.71:51937 - 15588 "A IN amazon.com.cluster.local. udp 42 false 512" NXDOMAIN qr,aa,rd 135 0.000137489s
[INFO] 192.168.3.71:52618 - 14916 "A IN amazon.com.ec2.internal. udp 41 false 512" NXDOMAIN qr,rd,ra 41 0.001248388s
[INFO] 192.168.3.71:51298 - 65181 "A IN amazon.com. udp 28 false 512" NOERROR qr,rd,ra 106 0.001711104s

As you can see, the message level is info, that is why we did not see these messages before. You can see that the DNS names are absolutely wrong.

NXDOMAIN means that the domain record wasn't found, and NOERROR means that the domain record was found.

CoreDNS integrates with Kubernetes via the Kubernetes plugin, or with etcd with the etcd plugin.

This is the logic of these queries.

Why is CoreDNS searching external domains in the local domains, and why there are 4 new domain suffixes like .default.svc.cluster.local ?

The app makes a DNS query, like secretsmanager.ca-central-1.amazonaws.com
CoreDNS checks where it should search this domain

k8s is based on Linux, so that means it searches this domain in the local internal system or external system.

So, if the domain name has a FQDN format, then CoreDNS should make an absolute query(external system), other way search in the local system.

BUT, we have a ndots option.
1. If the domain name is not a FQDN, search in an internal system
  
  my-internal-service, CoreDNS returns an internal service IP address in the same namespace.
2. Check If a domain name has FQDN format and check the ndots option and dot in the end of the domain name.
3. If the domain name has fewer dots, than the ndots option, search this domain first in the local domains defined in the search directive.
  
  In our example secretsmanager.ca-central-1.amazonaws.com has only 3 dots, whereas ndots option has value 5.
  
  So first search:
```
secretsmanager.ca-central-1.amazonaws.com.default.svc.cluster.local
secretsmanager.ca-central-1.amazonaws.com.svc.cluster.local
secretsmanager.ca-central-1.amazonaws.com.cluster.local
secretsmanager.ca-central-1.amazonaws.com.ec2.internal
```
  And only after these queries, make an absolutely query: secretsmanager.ca-central-1.amazonaws.com
4. If the domain name has more than 4 dots or has a dot in the end, make absolutely query.
If the domain name has FQDN, make an absolutely query. If the domain name has an absolute domain (dot in the end) FQDN, make an absolutely query

FQDN: Fully Qualified Domain Name

The FQDN consists of two parts: the hostname and the domain name. For example, an FQDN for a hypothetical mail server might be mymail.somecollege.edu. The hostname is mymail, and the host is located within the domain somecollege.edu.

A Dot in the of the domain name means that this domain is an absolute FQDN.

EG: “amazon.com.” means, make only absolutely query.

Okay, we found the root cause. There are 2 questions:

Why were there connection errors in the app?
How to fix it?

Why were there connection errors in the app?

As we found, CoreDNS made an additional 4 queries for each query from an app. In peaks of the CoreDNS high load it doesn’t have enough resources. We scaled up an amount of pods, for some time we forgot about this issue.

After we saw this issue again, we added NodeLocal as mentioned in the official documentation, but this did not resolve the root cause. After some time, an amount of the queries grows, and we will see this issue again. So that is why we should fix a root cause.

How to fix it?

We should reduce a value of the ndots option in the resolv.conf.

Well, thank goodness k8s allows changing this option in the pod manifest (because if you are using managing Kubernetes you do not always you have permission to change options like this).

apiVersion: v1
kind: Pod
metadata:
  namespace: default
  name: web-server
spec:
  containers:
    - name: test
      image: nginx
  dnsConfig:
    options:
      - name: ndots
        value: "2"

So, /etc/resolv.conf in our pod will look like:

nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:2

After the root cause is fixed, we could improve DNS settings with recommendations from below.

DNS for k8s, PRO tips and tricks

If you are using a managed k8s service, to scale up DNS, use HPA (or something like KEDA, Kubernetes Event-driven Autoscaling) instead of manually changing settings because at any time, your provider could overwrite your changes.

HPA example:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: coredns
  namespace: kube-system
spec:
  maxReplicas: 20
  minReplicas: 2
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coredns
  metrics:
  - type: Resource
    resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 80
  - type: Resource
    resource:
    name: memory
    target:
      type: AverageValue
      averageUtilization: 80

The option above should be enough to not have to use NodeLocal.

But if you need to change options in the DNS service, and you can’t make these changes (in the manages k8s case), you can install NodeLocal to your k8s cluster.

This option will allow adding one more DNS cache points (this will remove the load from the main DNS service) and setup settings as you want.

You should not change nameserver option in your pods, if you have dnsPolicy ClusterFirst.

If one of the CoreDNS pods in your k8s system return an error, all will unhealthy, just retry DNS query. This setting we can set up on the system level.

apiVersion: v1
kind: Pod
metadata:
  namespace: default
  name: web-server
spec:
  containers:
    - name: test
      image: nginx
  dnsConfig:
    options:
      - name: ndots
        value: "2"
      - name: attempts
        value: "3"

If you don’t have enough time to make a patch for your env, add a “dot” in the end of your DNS query. Then, your domain name will be absolutely FQDN, for example, secretsmanager.ca-central-1.amazonaws.com.

To propagate these changes to all nodes, you should change the kubelet settings, using resolvConf option for kubelet settings

What’s your problem? Why didn't I know about the problem at first?

Short answer, I didn’t do the next steps:

Not enough attention was paid to setting up critical error notifications
An audit of all critically important components was not carried out in order to check the monitoring settings later
There was no “preflight checklist” like this checklist

A little bit more about this methodology.

As you know, DevOps it’s not a person, it’s a set of the best practices, and approaches that involve implementing continuous integration and deployment ...

We have logs, metrics, and traces as often known as the three pillars of observability (as described in Distributed Systems Observability).

All logs scraped from a current env sending to centralized logs system like ELK, New Relic, DataDog. So in one place you can find and analyze almost any issue of your app or system service.

App traces sending in the same as logs place and helps find a sequence of the events, filter step by step all apps related to specific event/request.

All metrics from the current env sending to a centralized monitoring system and correlate with logs, and traces.

We have alerts, notifications about some monitoring issues, ex: too many restarts of the pods, not enough disk capacity, etc ….

My face, when I first saw a critical issue on the critical env, while I thought all ok.

Pic from "They Live" (1988).

How to prevent issues like this before his could affect clients?

But logs, metrics were not enough, why?

We had logs, metrics, traces, we had errors in our logs, but who is continuously watching all logs (10k log records every day - it’s ok)? We dig into logs, metrics, etc, when we are faced with an issue. But it’s very expensive when an issue finds your clients. That is why we should predict problems by some symptoms, and prevent issues before they affect something critical.

In other words. Because not enough attention was paid to setting up critical error notifications. DNS in the kubernetes infrastructure is the one of the critical systems.

So as I mentioned above, we should do some env checklist for the critical components and pass it step by step. Some points of this checklist should was like: we should that we have logs, metrics and notification for the critical issues and of course verify it in the real infrastructure.

As result, should be done next steps:

Fix an issue
Make postmortem (example: Postmortem of database outage of January 31)
Make checklist for the critical components especially for important migrations, updates, deployments
Use this shceklist at least once in existing infrastructure or use it before events like migration, update

List of useful links

So, why I wrote this story today?

Everyone makes mistakes(ex: GitLab.com database incident), those who do nothing and do not learn from their mistakes do not achieve success!

Good luck! I hope this helps.

Please feel free to add any comments and connect to me https://kazakov.xyz.