Today my Jenkins builds were not working because all of the build slaves were offline. Digging around in the logs showed that the couldn’t connect because of name resolution failures. I use mDNS on my network (the slaves are Mac OS X VMs running on a Mac Mini), and so they were named something like xxxx-1.local and xxxx-2.local I tried just pinging the machines and that refused to resolve the name, too.
I verified that Avahi was running, and then used avahi-resolve --name xxxx-1.local to check the mDNS name resolution. It worked just great.
So why would mDNS be working fine network-wise, but no programs were resolving correctly? It struck me that I didn’t know (or couldn’t remember!) how mDNS tied in to the system. Who hooks in to the name resolution and knows to check for *.local using mdns?
It turns out it’s good old /etc/nsswitch.conf (I should have remembered that)! There’s a line in there:
hosts:          files mdns4_minimal [NOTFOUND=return] dns
That tells the libc resolver (that everything uses) that when it’s looking for a hostname, it should first look in the /etc/hosts file, then it should check mDNS, then if mDNS didn’t handle it, check regular DNS. Wait, so mDNS is built right in to libc??
Nope! On my Debian system there’s a package called libnss-mdns that has a few files in it:
/lib/x86_64-linux-gnu/libnss_mdns.so.2
/lib/x86_64-linux-gnu/libnss_mdns4.so.2
/lib/x86_64-linux-gnu/libnss_mdns4_minimal.so.2
/lib/x86_64-linux-gnu/libnss_mdns6.so.2
/lib/x86_64-linux-gnu/libnss_mdns6_minimal.so.2
/lib/x86_64-linux-gnu/libnss_mdns_minimal.so.2
Those are plugins to the libc name resolver so that random stuff like mDNS doesn’t have to be compiled into libc all the time. In fact, there’s a whole bunch of other libnss plugins in Debian that I don’t even have installed.
So my guess was that this nss-mdns plugin was causing the problem. There are no man pages in the package, but there are a couple README files. I poked around trying random things and reading and re-reading the READMEs many times before this snippet finally caught my eye:
If, during a request, the system-configured unicast DNS (specified in
/etc/resolv.conf) reports anSOArecord for the top-levellocalname, the request is rejected. Example:host -t SOA localreturns something other thanHost local not found: 3(NXDOMAIN). This is the unicast SOA heuristic.
Ok. I doubted that was happening but I decided to try their test anyway:
$ host -t SOA local
local. has SOA record ns1-etm.att.net. nomail.etm.att.net. 1 604800 3600 2419200 900
Crap.
Those bastards at AT&T set their DNS server up to hijack unknown domains! They happily give out an SOA for the non-existant .local TLD. So AT&T’s crappy DNS is killing my Jenkins jobs??? Grrrrr…
The worst part is that I tried to use Cloudflare’s 1.0.0.1 DNS. My router was configured for it. But two things happened: 1. I got a new modem after having connection issues recently, 2. I enabled IPv6 on my router for fun. The new modem seems to have killed 1.0.0.1. I can no longer connect to it at all. Enabling IPv6 gave me an AT&T DNS server through DHCP (or whatever the IPv6 equivalent is).
So, straightening out my DNS (I had to revert back to Google’s 8.8.8.8) caused NXDOMAIN responses to the .local SOA, and that caused mDNS resolution to immediately work, and my Jenkins slaves came back online. Fwew.