High-Performance DNS for The Cloud

DNS is a great example of a service that couldn't possibly work on paper, but performs spectacularly in practice, even with a hodge-podge of implementations all over the Internet. First, the authoritative DNS servers responds with a Time To Live (TTL) timestamp for every record (which you should keep fairly low in a virtual deployment environment), then the upstream DNS servers cache that same data, albeit usually with different policies (determined by the ISP). And finally, your router, OS, and the browser all have their own and independent DNS caches. (Firefox 2/3 caches all DNS records for a 1 minute, IE 5/6/7 for 15 minutes). Talk about a mess!

DNS as a cheap load-balancing tool

In theory, DNS information should not change very often, but in-practice a lookup of any high-traffic site often yields a surprisingly low TTL value: amazon.com (60s), cnn.com (300s), bbc.com (360s). Try it yourself:

$ dig amazon.com
# ; <<>> DiG 9.2.4 <<>> amazon.com
# ;; ANSWER SECTION:
# amazon.com. 60 IN A 72.21.206.5
# amazon.com. 60 IN A 72.21.210.11
# amazon.com. 60 IN A 72.21.203.1

# TTL value is the second value in the answer section. (60 seconds, in this case)

The reason for such a short period is because DNS can be used as a very cheap load balancer. When multiple A-records are specified, the DNS server responds with a random IP, effectively distributing the traffic between parallel nodes in the network - a really handy trick for high traffic, or geographic load balancing.

Keep DNS away from the cloud

A common worry for any virtual, or cloud deployed application is the dreaded 'what if I lose my IP address' scenario. Services like Amazon EC2 do not guarantee the same IP address assignment (although, I should mention that Amazon specifically has address this problem with the release of Elastic IP service), making DNS a somewhat fragile piece of the infrastructure.

If you're up to the task, you can manage your own DNS service up in the cloud, although it is probably far more reliable and easier to offload this piece of infrastructure to a company such as Nettica, or DynDNS, who manage DNS for a living. Their services are cheap, often come with API access, and offer a distributed DNS service for a peace of mind. All you have to do is set a low TTL (5-60 minutes, depending on your comfort level), and you're off to the races.

Seamless migrations & DNS failover

Running your application in the cloud has many advantages, however one pattern that we've found really useful at AideRSS is the ability to promote a staging beta environment to a production state at a push of a button - kudos to dynamic DNS. Whenever a major release is due for release, we simply duplicate our entire infrastructure (database, front-end, and all other require components - elastic computing at its best), perform all the testing on a live dataset, and then make it live:

For the most part, the migration path is incredibly simple, all we have to do is update a single DNS record to point to our new staging server. If all goes well, the user lands on the new server, and without seeing a single maintenance page, continues to interact with the site. Best part: because both versions are live, a rollback is no harder then reverting the IP address to its old value.

Working with the DNS cache

Ah, but what about the DNS cache, right? If your ISP provider decided to cache the IP address of the old server for 24 hours, won't this affect the user experience? To handle this scenario, we make sure to keep our old server around for at least 24 hours, and install the following redirect rules:

# Start proxying of port 80 to new server

$ iptables -A FORWARD -i eth0 -o eth0 -j ACCEPT
$ iptables -t nat -A PREROUTING -p tcp --dport 80 -i eth0 -j DNAT --to NEW IP
$ iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
$ echo 1 > /proc/sys/net/ipv4/ip_forward

The following rules effectively turn your old server into a transparent proxy. Anyone making a request to the old IP address (on port 80) will receive a legitimate response from the new server, until they update their DNS and begin talking to the new server directly. You can monitor the traffic on the old server, and retire it after the DNS caches expire - in practice, 24 hours is more than enough.

Ilya GrigorikIlya Grigorik is a web ecosystem engineer, author of High Performance Browser Networking (O'Reilly), and Principal Engineer at Shopify — follow on Twitter.