Quantitative Evaluation of DNSBLs
Here are counts of messages that would be blocked by each of 16 DNSBLs, out of 86,252 messages to actual users at nber.org during the last week of February 2005. Messages to non-existent mailboxes are ignored, as they don't actually inconvenience users. Unlike some seemingly similar charts, we have queried all the lists for every message, so the consultation order doesn't affect the result. Lists are queried within a few seconds of mail receipt.
So t1.dnsbl.net.au looks attractive - it blocks 66% of inbound mail, compared to mail-abuse.org (MAPS) which we currently subscribe to and which blocks only 39% of incoming mail.
But what about false positives? We don't have any accurate way of counting incorrectly rejected messages (there are essentially no complaints), and no way to make users cooperate in a mass identification, so we decided to take the list of 1,473 persons invited to our conferences over the last year or so, and check the MX servers for their addresses. If many of them were blocked, that would be a red flag indicating that a blocking list was overly enthusiastic. We realize that some ISPs may use separate servers for incoming and outgoing mail, so the estimate of blocked servers will be low, but hopefully not biased among the various DNSBLs.
Our conference participants are Ph.D. economists at universities and government agencies - we expect that they are less likely than average to be blacklisted, but they are representative of our most important (to us) correspondents. These are real people well known to us and with correct addresses.
So 28 (.7%) of our list of participants would be unable to write us if we use T1 as our blocking list, while MAPS does a bit better - blocking only 20 (.4%) participants. In spite of its controversial reputation, Spamcop does not seem aggressive in this test, with none of our correspondents blocked.
Some readers of this document have suggested that the false positive rates in Table 2 are implausibly low, and indeed they are probably not typical. We also have a list of registered users on our web site. This list has 39,217 email addresses including many more non-US and non-academic ones. Some are obviously throw-away accounts. Table 3 shows the number of hits for each DNSBL for the 39,294 MX hosts of those addresses.
Some of these numbers are so high I don't know what to make of them, and since we mostly don't know who these people are, I don't want to claim that high numbers here represent mistakes or bad faith by any of the lists. In fact, it is only the blocking of legitimate mail that gives ISPs an incentive to block spam emissions from their own mail servers. So strict lists are performing a public service. Nevertheless lists at the top of Tables 2 and 3 are too strict to be practical for our institution. The top and bottom for Tables 2 and 3 are the same lists, so as a relative guide, I think the proxy is probably OK. I'd like to do the test with a non-academic list, but that would be less relevant for our decision. Table 3 confirms our preference for Spamhaus, certainly.
Claims by supporters of anti-spam methodologies for very low false positive rates should be taken with a grain of salt. Any technique will have a low rate for its developer, but legitimate mail is much more varied than spam, and casual users are not proficient at tuning anti-spam engines. So end users will rarely match the near perfect record most techniques advertise.
Furthermore, the denominator of the error rate will include multiple messages from correspondents whose messages are correctly accepted, but rejected correspondents presumably don't write back after being ignored once. This leads to a unrealistically small quoted error rate. Our measure (which admittedly has other defects) doesn't have that problem, since each correspondent is counted only once.
There are five lists above with very low false positive rates and all of these have rejection rates in the 30-50% range. Apparently to get better spam control than that we would have to accept a significant number false positives. However, Spamhaus looks like a good compromise - blocking 50% of all mail, but only .02-.03% of good addresses. On the Spamhaus web page, they suggest that the list should block 65% of spam. This is consistent with the numbers above if one third of all mail is good mail.
One possible figure of merit is the ratio of the false positive proxy to the rejection rate. This ranges from .3% (Spamhaus, Abuseat) to 19% (cmsa.biz). That's quite a range, but it isn't really informative. More to the point, what is the marginal reduction in spam from moving from one list to another? In our case, the obvious alternative to Spamhaus would be T1. This would reduce our inbound traffic by 18,000 messages, but add 1,862 registered correspondents to our rejection list. That is high enough that it has discouraged us from switching. Please note that the absolute level of these marginal ratios depends upon the length of the sampling period (Table 1 is a flow, the others measure a stock), so the comparison is good only among these charts. That is, if we double the collection period, the numbers in Table 1 will about double, the numbers in Tables 2 and 3 won't change.
I have other charts (not included here) showing the effects of all possible combinations of the 16 lists. It is a lot of data, but in brief - combining two of the better lists is a lot like taking the higher number from each DNSBL, and therefore isn't particularly desirable. It does protect against a DDOS against one of the DNSBLs, but that is not a real problem for DNSBL users.
We also have runs where the DNSBLs were consulted several days after the mail was presented, and the blocking rates are substantially lower. This was a surprise to us, since the rationale for removing an address is not obvious. There is a statistical principle which says that if your detector detects only a small fraction of events changes in the observed event rate are more likely changes in the detection rate than changes in the underlying event rate. That would suggest that removing an address just because it hasn't mailed to a spamtrap lately is probably not justified.
I want to add that we prefer the DNSBL approach to spam control, compared to content analysis, because we don't feel comfortable dropping messages on the floor. With a DNSBL it is quite easy to reject a message, and all false positives will be returned to the sender (rather than disappearing into the ether). With content analysis, it isn't so easy to reject a message, and we don't believe that delivery to a spam folder is much help to the sender. Of course, sending a bounce to the (usually) forged return address is out of the question. Content analysis also reduces the pressure on ISPs to discourage outbound spam, which we are loath to do. Various sender authorization techniques also have that effect.
My thanks to Alex Aminoff for Perl programming, and John Reid for suggesting that we check the lists immediately upon receipt of the messages rather than waiting several days. The blacklists were chosen from Jeff Makey's comparison page. Comments are welcome, as are lists of MTAs to check. We are particularly interested in sources of known good mail, such as valid responses to list subscription challenges.
An updated copy of this memo will be kept at http://www.nber.org/sys-admin/dnsbl-comparison.html. 10 March 2005