Availability-zone-aware DNS Service-Discovery with DNSMasq and Ansible

In opensooq we heavily implement SOA (Service-oriented architecture) and we have a lot of Microservices. And since we have a very scalable dynamic environment that keeps changing, we need some sort of service discovery.

DNS Service Discovery allow us to access our microservices and supportive services by name. It’s not an alternative to load-balancing but a complement to that. As a rule of thumb we place two load-balancers in each availability, and we have DNS round-robin load-balancing for the load balancers. We need each service to contact a service in its own region if available, our initial /etc/resolv.conf looked like this (in a host that belong availability zone named “1a”)

search 1a.opensooq.internal any.opensooq.internal opensooq.internal
nameserver 172.16.0.2

When some service tries to resolve a host named monkey it would actually try monkey.1a.opensooq.internal then monkey.any.opensooq.internal then monkey.opensooq.internal

But since Amazon DNS is far from being perfect and it’s too slow (sometimes it wastes 3ms on each hit), we need to setup DNSMasq on all of our servers.

We need an automated way to set the availability zone. Amazon has a service that returns the current instance metadata including availability zone for example if you curl the following url from inside an amazon instance it would return the availability zone

http://169.254.169.254/latest/meta-data/placement/availability-zone

You might as well use Ansible dynamic inventory but the link above was good enough.

If you want to add google DNS beside Amazon’s just add “strict-order” otherwise if google is faster internal domains won’t be resolved.

A simple Ansible task like the below did the job (tested on CentOS 7 and Ubuntu 16.04)

---
- hosts: web-production
  become: yes
  tasks:
  - name: get zone
    uri:
      url: http://169.254.169.254/latest/meta-data/placement/availability-zone
      return_content: yes
    register: zone
  - debug: 
      msg: "az is {{zone.content}} short as {{zone.content.split('-')[-1]}}"
  - name: /etc/resolv.dnsmasq
    copy:
      dest: /etc/resolv.dnsmasq
      content: |
            search {{zone.content.split('-')[-1]}}.opensooq.internal. any.opensooq.internal. opensooq.internal.
            nameserver 172.16.0.2
            nameserver 8.8.8.8
  - name: install dnsmasq package
    package: name=dnsmasq state=present
  - name: dnsmasq configuration
    copy:
      dest: /etc/dnsmasq.d/opensooq.conf
      content: |
            resolv-file=/etc/resolv.dnsmasq
            cache-size=10000
            listen-address=127.0.0.1
            strict-order
  - name: /etc/dhcp/dhclient.conf
    lineinfile:
      dest: /etc/dhcp/dhclient.conf
      line: 'prepend domain-name-servers 127.0.0.1;'
    when: ansible_os_family == "RedHat"
  - name: /etc/dhcp/dhclient.conf
    lineinfile:
      dest: /etc/dhcp/dhclient.conf
      regexp: '^supersede domain-search '
      line: 'supersede domain-search "{{zone.content.split("-")[-1]}}.opensooq.internal", "any.opensooq.internal", "opensooq.internal";'
    when: ansible_os_family == "RedHat"
  - name: enable service
    service:
      name: dnsmasq
      state: started
      enabled: yes
  - name: overwrite /etc/resolv.conf insteadof reboot or restart network
    copy:
      dest: /etc/resolv.conf
      content: |
            ; generated by /usr/sbin/dhclient-script
            search {{zone.content.split('-')[-1]}}.opensooq.internal. any.opensooq.internal. opensooq.internal.
            nameserver 127.0.0.1
  - name: restart dnsmasq service
    service:
      name: dnsmasq
      state: restarted

And that’s it. To make it perfect you just replace the last unconditional restart with triggered restart.