DNS round robin for web server failover

DNS round-robin for Web server failover

Because we were concerned about reliability of our webservers, we elected to experiment with the following procedure to "round-robin" between our two servers. After 7 months of service, we have seen no downside, in spite of much agitation by knowledgable people that DNS was unsuited for this task. We are particularly interested in what problems those experts anticipated - in our discussions they were not able or willing to articulate the exact problems they forsaw.

What we did

We added a second A record to www.example.com pointing to the present backup webserver:

www IN 10.0.0.1 www IN 10.0.0.2 Now our DNS server returns both IP addresses for each query, in random order - this has been standard in Bind since forever. There is an "rrset" option that would allow us to fix the order, but we haven't tried that yet. If both webservers are up, obviously no problem. If one is down. the questions are, will the browser try the second IP address. and how long does it wait to do so?

Successes with modern browsers

Using recent versions (2009) of MSIE (version 8), Opera, Safari 4.04, Firefox (3.4), Chrome and Konqueror the worst result was a delay of about 30 seconds before the browser elected to retry at the other IP address and loaded the page. Other than the pause, the process is user-transparent, and occurs only if the first server tried times out, and only for the first page requested from our site in any browser session.

Failures with obsolete browsers

I had access to a few older browsers such as FF 2.0 and Lynx that never switched. I was able to test Safari 4.0, FF 2.0 and Chrome 3.0 at Adobe Browserlab, and those also failed to switch. So I can't say exactly when the ability to switch to a working A record was added to each browser, but it has been added.

Is there a downside?

During periods when one server was down, users of non-switching browsers would have a 50% chance of getting the bad server in an individual browser session, but the chance of one of two servers being down is about double the chance of one server being down. This is close to a wash then, for the older browsers and a pretty big win for the newer ones. It is true that a user with an older browser could close his browser and wait 5 minutes (our DNS TTL) for another chance, but probably most users wouldn't do that.

This does split our logs over two systems, but that has not been a problem for us, and could be addressed if it were. You might think that this is something better handled by SRV records. We and some others agree. But browser authors have resisted SRV records, and round robin A records have improved our reliability greatly without introducing any new hardware, software or single point of failure, and without much complication of our configuration either. So we like them.

Thirty seconds is much longer than necessary. See: this posting from the ISC

Comments?

A Czech translation of this article is posted here Daniel Feenberg
feenberg@nber.org
11 October 2010