I thought I had observability, but it wasn't enough.
The picture above says to me that you should know your infrastructure and heed your own advice.
Hey brilliant human - so, as it turns out - a critical environment experienced an issue.
How can you prevent issues like this before this could affect clients? I’ll tell you a little later in this article.
If you are reading this, chances are:
If you don't have time to read, and you want to see the solution right away, like how you scroll through all the answers on stackoverflow in search of the marked solution, then read this detailed AWS DNS troubleshooting guide.
One more spoiler: Kubernetes’ official documentation refers to using NodeLocal DNSCache. It’s the wrong way and will not resolve a root cause.
This article will help you if you have one of the following questions or problems:
RCodes
When CoreDNS encounters an error, it returns an rcode—a standard DNS error code.
Errors like NXDomain and FormErr can reveal a problem with the requests CoreDNS is receiving, while a ServFail error could indicate an issue with the function of the CoreDNS server itself.
So, below, I’ll tell you about 3 topics:
Solving DNS issues
DNS best practices for Kubernetes (but more about DNS with Linux really)
How to prevent issues like this before this could affect clients
An issue has been reported that some apps can't connect to a service. It could be an external service or an internal service. We revised these services, everything worked fine. So, very strange behavior and not a persistent issue. Looking at all errors in the centralized log system at that specific time, we found a correlation with DNS service (CoreDNS). We also found problems with that service at that time.
CoreDns metrics during this incident could look like below, which is the amount of queries.
Here you can see the amount of errors that returned an answer to DNS queries
Here you can see cpu usage.
The DNS config inside the pods looks like below.
/etc/resolv.conf:
nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5
NDOTS: n
Sets a threshold for the number of dots that must appear in a name given to res_query() before an initial absolute query is made. The default for n is 1, meaning that if there are any dots in a name, the name is tried first as an absolute name before any search list elements are appended to it.
The error logs look like the following:
[INFO] 192.168.3.71:33238 - 36534 "A IN amazon.com.default.svc.cluster.local. udp 54 false 512" NXDOMAIN qr,aa,rd 147 0.000473434s
[INFO] 192.168.3.71:57098 - 43241 "A IN amazon.com.svc.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.000066171s
[INFO] 192.168.3.71:51937 - 15588 "A IN amazon.com.cluster.local. udp 42 false 512" NXDOMAIN qr,aa,rd 135 0.000137489s
[INFO] 192.168.3.71:52618 - 14916 "A IN amazon.com.ec2.internal. udp 41 false 512" NXDOMAIN qr,rd,ra 41 0.001248388s
[INFO] 192.168.3.71:51298 - 65181 "A IN amazon.com. udp 28 false 512" NOERROR qr,rd,ra 106 0.001711104s
As you can see, the message level is info, that is why we did not see these messages before. You can see that the DNS names are absolutely wrong.
NXDOMAIN means that the domain record wasn't found, and NOERROR means that the domain record was found.
CoreDNS integrates with Kubernetes via the Kubernetes plugin, or with etcd with the etcd plugin.
This is the logic of these queries.
Why is CoreDNS searching external domains in the local domains, and why there are 4 new domain suffixes like .default.svc.cluster.local
?
The app makes a DNS query, like secretsmanager.ca-central-1.amazonaws.com
CoreDNS checks where it should search this domain
k8s is based on Linux, so that means it searches this domain in the local internal system or external system.
So, if the domain name has a FQDN format, then CoreDNS should make an absolute query(external system), other way search in the local system.
BUT, we have a ndots
option.
If the domain name is not a FQDN, search in an internal system
my-internal-service
, CoreDNS returns an internal service IP address in the same namespace.
Check If a domain name has FQDN format and check the ndots
option and dot in the end of the domain name.
If the domain name has fewer dots, than the ndots
option, search this domain first in the local domains defined in the search
directive.
In our example secretsmanager.ca-central-1.amazonaws.com
has only 3 dots, whereas ndots
option has value 5.
So first search:
secretsmanager.ca-central-1.amazonaws.com.default.svc.cluster.local
secretsmanager.ca-central-1.amazonaws.com.svc.cluster.local
secretsmanager.ca-central-1.amazonaws.com.cluster.local
secretsmanager.ca-central-1.amazonaws.com.ec2.internal
And only after these queries, make an absolutely query: secretsmanager.ca-central-1.amazonaws.com
If the domain name has more than 4 dots or has a dot in the end, make absolutely query.
If the domain name has FQDN, make an absolutely query. If the domain name has an absolute domain (dot in the end) FQDN, make an absolutely query
FQDN: Fully Qualified Domain Name
The FQDN consists of two parts: the hostname and the domain name. For example, an FQDN for a hypothetical mail server might be mymail.somecollege.edu. The hostname is mymail, and the host is located within the domain somecollege.edu.
A Dot in the of the domain name means that this domain is an absolute FQDN.
EG: “amazon.com.” means, make only absolutely query.
Okay, we found the root cause. There are 2 questions:
As we found, CoreDNS made an additional 4 queries for each query from an app. In peaks of the CoreDNS high load it doesn’t have enough resources. We scaled up an amount of pods, for some time we forgot about this issue.
After we saw this issue again, we added NodeLocal as mentioned in the official documentation, but this did not resolve the root cause. After some time, an amount of the queries grows, and we will see this issue again. So that is why we should fix a root cause.
We should reduce a value of the ndots
option in the resolv.conf
.
Well, thank goodness k8s allows changing this option in the pod manifest (because if you are using managing Kubernetes you do not always you have permission to change options like this).
apiVersion: v1
kind: Pod
metadata:
namespace: default
name: web-server
spec:
containers:
- name: test
image: nginx
dnsConfig:
options:
- name: ndots
value: "2"
So, /etc/resolv.conf in our pod will look like:
nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:2
After the root cause is fixed, we could improve DNS settings with recommendations from below.
If you are using a managed k8s service, to scale up DNS, use HPA (or something like KEDA, Kubernetes Event-driven Autoscaling) instead of manually changing settings because at any time, your provider could overwrite your changes.
HPA example:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: coredns
namespace: kube-system
spec:
maxReplicas: 20
minReplicas: 2
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: coredns
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageUtilization: 80
The option above should be enough to not have to use NodeLocal.
But if you need to change options in the DNS service, and you can’t make these changes (in the manages k8s case), you can install NodeLocal to your k8s cluster.
This option will allow adding one more DNS cache points (this will remove the load from the main DNS service) and setup settings as you want.
You should not change nameserver option in your pods, if you have dnsPolicy
ClusterFirst
.
If one of the CoreDNS pods in your k8s system return an error, all will unhealthy, just retry DNS query. This setting we can set up on the system level.
apiVersion: v1
kind: Pod
metadata:
namespace: default
name: web-server
spec:
containers:
- name: test
image: nginx
dnsConfig:
options:
- name: ndots
value: "2"
- name: attempts
value: "3"
If you don’t have enough time to make a patch for your env, add a “dot” in the end of your DNS query. Then, your domain name will be absolutely FQDN, for example, secretsmanager.ca-central-1.amazonaws.com.
To propagate these changes to all nodes, you should change the kubelet settings, using resolvConf
option for kubelet settings
Short answer, I didn’t do the next steps:
A little bit more about this methodology.
As you know, DevOps it’s not a person, it’s a set of the best practices, and approaches that involve implementing continuous integration and deployment ...
We have logs, metrics, and traces as often known as the three pillars of observability (as described in Distributed Systems Observability).
All logs scraped from a current env sending to centralized logs system like ELK, New Relic, DataDog. So in one place you can find and analyze almost any issue of your app or system service.
App traces sending in the same as logs place and helps find a sequence of the events, filter step by step all apps related to specific event/request.
All metrics from the current env sending to a centralized monitoring system and correlate with logs, and traces.
We have alerts, notifications about some monitoring issues, ex: too many restarts of the pods, not enough disk capacity, etc ….
My face, when I first saw a critical issue on the critical env, while I thought all ok.
Pic from "They Live" (1988).
But logs, metrics were not enough, why?
We had logs, metrics, traces, we had errors in our logs, but who is continuously watching all logs (10k log records every day - it’s ok)? We dig into logs, metrics, etc, when we are faced with an issue. But it’s very expensive when an issue finds your clients. That is why we should predict problems by some symptoms, and prevent issues before they affect something critical.
In other words. Because not enough attention was paid to setting up critical error notifications. DNS in the kubernetes infrastructure is the one of the critical systems.
So as I mentioned above, we should do some env checklist for the critical components and pass it step by step. Some points of this checklist should was like: we should that we have logs, metrics and notification for the critical issues and of course verify it in the real infrastructure.
As result, should be done next steps:
So, why I wrote this story today?
Everyone makes mistakes(ex: GitLab.com database incident), those who do nothing and do not learn from their mistakes do not achieve success!
Good luck! I hope this helps.
Please feel free to add any comments and connect to me https://kazakov.xyz.