If you are having issues with DNS in your cluster, you can use this quick guide to get more information.
DNS logs
Checking the coredns logs, you may see error entries:
$> kubectl logs --namespace=kube-system -l k8s-app=kube-dns
[ERROR] plugin/errors: 2 node.domain.com. A: read udp 1<NODEIP>:45209-><DNSSERVER>:53: i/o timeout
[ERROR] plugin/errors: 2 node.domain.com. AAAA: read udp <NODEIP>:52088-><DNSSERVER>:53: i/o timeout
[ERROR] plugin/errors: 2 node.svc.cluster.local.domain.com. A: read udp <NODEIP>:48902-><DNSSERVER>:53: i/o timeout
[ERROR] plugin/errors: 2 node.svc.cluster.local.domain.com. A: read udp <NODEIP>:38309-><DNSSERVER>:53: i/o timeout
If you do not see any errors in the coredns logs, you need to enable logging in the config map - see the kubernetes documentation. Once changed, run:
kubectl rollout restart -n kube-system deployment/coredns
Repeat the process that pointed to a DNS failure, and now you should see some failed entries in the log. You should see logging similar to the below:
[INFO] 192.168.224.131:56527 - 25854 "A IN google.com.cluster.local. udp 42 false 512" NXDOMAIN qr,aa,rd 135 0.000082441s
[INFO] 192.168.224.131:53611 - 29229 "A IN google.com. udp 28 false 512" NOERROR qr,rd,ra 54 0.002412358s
[INFO] 192.168.224.131:35919 - 32688 "A IN google.com.svc.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.00010114s
[INFO] 192.168.224.131:45732 - 54213 "A IN google.com. udp 28 false 512" NOERROR qr,aa,rd,ra 54 0.000064642s
If you believe your pods are not communicating correctly with your coreDNS pods, you can validate that the IP of the pods sending the request makes it to coreDNS. The initial entry, in this case, 192.168.224.131, relates to the pod that has generated the lookup. This can also be used to validate that the pod sending the request is getting the same response from coreDNS that it is receiving, and it is not being alerted along the way. If you are not seeing any requests from your pod's IP, you may have a networking issue preventing the pods from communicating correctly.
If pods are having trouble querying external URL's it may be due to the ndot configuration in the cluster. You can confirm this by checking for the URL you are querying after turning on debug logging, if this is the case you may see logging such as the below:
[INFO] <IP>:59175 - 29555 "A IN test.mytest.com.mytest.com. udp 55 false 512" NXDOMAIN qr,rd,ra 134 0.025318742s
In the above case, we can see an additional search entry added to our URL .mytest.com.mytest.com. This indicates that coreDNS is not seeing the URL we are passing as an FQDN. If this is the case, reducing the number of ndots required for a URL to be considered fully qualified or editing the search terms can help remedy the situation. Please note: configuring your cluster to have a ndot value of '1' for all pods will break pod-to-pod communication.
Check nodes
First, check that all the nodes have the same issue. SSH to all nodes and run a simple nslookup.
nslookup production.domain.com 192.168.0.1 # check internal DNS
nslookup submit.example.com 8.8.8.8 # check external DNS
cat /etc/resolv.conf
If the nslookup on the nodes fails, then the issue is with your internal routing, and should be corrected before continuing.
Check resolv.conf from a pod
Check and output a specific pod's resolv.conf. This file is usually mapped in from the host, so it can be a quick way to check that the resolv.conf file on the hosts is what you expect:
kubectl exec -it -n <namespace> <podname> -- /bin/bash -c "cat /etc/resolv.conf"
Check Pods
If the nodes are fine, now we can check the pods. To begin, you should pull the dnsutils container from GCR. An example yaml for creating a pod can be found on the kubernetes DNS debugging page.
With this pod, you can run simple diagnostic DNS tools to check out your environment, for example, run the following to check the resolvers file:
kubectl exec -it dnsutils -- cat /etc/resolv.conf
Now, run the following commands to get the lay of the land.
kubectl exec -it dnsutils -- nslookup kubernetes.default
kubectl exec -it dnsutils -- nslookup problemdomain.com 192.168.0.1 #check internal DNS
kubectl exec -it dnsutils -- nslookup problemdomain.com 8.8.8.8 #check external DNS
kubectl exec -it dnsutils -- dig -x nodeipaddress
"nslookup" has three failures to be aware of:
- "SERVFAIL" indicates a failure with the DNS server - it's running, but broken.
- "NXDOMAIN" indicates that the dns server doesn't know the domain name.
- "no servers could be reached" is normally indicative of a routing or other networking issue where the dns server is unreachable.
The "dig -x" command checks the reverse DNS entry of the IP address. If configured correctly, it should be a PTR record of the reverse octet of the IP address in URL form, ending with "ip-addr.arpa". For more information, see RFC 1035.
Validate DNS requests are not blocked
In some cases, it may appear that requests are not making it out of the pod to coreDNS. If you believe this to be the case, we can take a few steps to validate. To do this, we will run a TCP dump on the network interface associated with the pod. To do this, we need the PID associated with the pod in question. To do this, we can check the containers metadata using crictl on the node that it is running:
crictl ps -a
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
c6fe113dfb9ed a3447b26d32c7 3 minutes ago Running calico-node 0 2a29d79ebe0fa
crictl inspect c6fe113dfb9ed | grep pid
"pid": 2927,
"pid": 1
"type": "pid"
Once we have the PID of our container can run a tcpdump within the netns of our pod; doing this will let us gather information as close to our application as possible. With the PID in hand, we can run something similar to the below to collect our TCPDUMP. Please note that you will need to change ports and options as needed:
nsenter -t 2927 -n tcpdump udp and port 53
Upon examining the output, you should be able to validate that any DNS requests you see line up with DNS failures in your applications logging. If you see failures in the logs but the requests are not showing on the pod's interface, some security software may block the request, or a process is reading the pods /etc/hosts file, preventing lookups from succeeding.
Monitoring CoreDNS
Out of the box, DKP 2.X comes with a default CoreDNS dashboard in Grafana. Within this dashboard, there is a wealth of information such as the cache hit rate, response code as well as response duration, among other details:
If you are experiencing DNS issues in your cluster, this can be a starting point for any investigation. Along with this, additional metrics exported by Prometheus may help identify the underlying issue. Metrics such as health checks or port exhaustion are important to keep an eye on in the event you are experiencing DNS issues in your cluster. These can be accessed by simply navigating to the Prometheus UI and searching for 'CoreDNS':
If there is a particular metric your team finds useful, such as the upstream health check status, it is straightforward to add a tile to any dashboard. In that case, all that needs to be done is to add a panel to your desired dashboard, use Prometheus as your data source, and select the metric your team wishes to monitor.