Son of spam: 4 spam filtering packages tested

24 October 2003 01:40 PM

Tags: exhange, spamkiller, mcaffee, surfcontrol, mailessentials, mailmarshal, gfi, smtp



 Spam filtering sofware

 Anti-spam software:

 GFI MailEssentials
 NetIQ MailMarshal
 NAI/McAfee SpamKiller
 SurfControl

 Specifications
 How we tested
 Look out for...
 Final words
 About RMIT

Son on Spam: 4 Spam Filtering Packages Tested Can you trust software to block all the spam your company receives and let all your legitimate e-mail through? We evaluate four top spam filtering packages for their accuracy.

Three months ago in the July edition of Technology & Business we compiled an overview of five anti-spam filtering applications that were available at the time. That initial review addressed the introduction and overview of spam and its concepts and also the individual usability and technical implementations of those applications. However it did not look at the actual accuracy of those individual packages in filtering e-mail.

We are therefore now re-visiting the anti-spam issue with a more results-based review. We invited the same five vendors back for a head-to-head shootout to show the packages' accuracy in filtering unwanted e-mail while keeping as much useful e-mail as possible. All vendors accepted this challenge except for Clearswift, who cited the imminent release of a new redesigned application. We hope that in the next similar accuracy review, Clearswift will be involved.

As you will see in this review, testing these packages for accuracy is a tricky business and to do so fairly and accurately took several months.

As detailed in the previous review, anti-spam filters can be set up in any number of ways, utilising black lists, white lists, and custom made rule sets. Some applications come configured with basic rules, others come as a blank slate. Some also employ quite advanced learning techniques (touted by some vendors as heuristics or Bayesian analysis).

Not so simple
When we actually sat down to work out this test and what results we could achieve that were really correct, several issues presented themselves. Spam and spammers are dynamic, constantly evolving, and are always look to develop different techniques to get past the filters and deliver their message of "lose weight by eating more" and "XXX wholesale". Therefore the tests could only be a snapshot of the particular given period in time that the test was performed (naturally with some legacy "classic" spam messages thrown in for good measure). With this in mind, we collected a static data set over a couple of weeks to ensure that we had the "latest" in the spammers' arsenal.

In addition to running tests on this set of static data, we also needed to run the software on some live e-mail data to ensure similar results were achieved by the products, given the static test data may not be filtered exactly the same as it may be in a live environment.

In order to do this, each vendor needed to have their own test rig so that the live tests could be run simultaneously. Therefore we needed a domain name, sub domain name records in the name servers, and live public IP addresses etc to setup before the testing could commence.

The human factor
The last part of the testing which is often the most difficult--certainly when you consider the rules-based nature of these applications--is the human factor. This is why we took the measure of inviting the individual vendors to send their own engineers to the Labs, to install and configure the applications on the servers.

Sure, from a basic installation and administration point of view the Labs staff could have installed and configure the rule sets for all these applications as they did in the previous review. However, this is a far cry from being an "expert" in each application.

It is one thing to do a usability test to ensure that a person with a reasonable level of technical competency can install and configure an application to get it running. That's nothing like the skill of an engineer working for the company, who creates and maintains that application, and knows of the many little nuances and tweaks needed to be applied to achieve the best possible results. Remember, these are not basic antivirus applications that you can just install and download the latest definition file. The rules on many of these filtering systems are highly complex and evolved.

This is not particularly different from how it works in the real world, anyway. Because the anti-spam market is very competitive, vendors invest a great deal in keeping their products working efficiently. For instance, some vendors run training courses for your staff on the best ways to configure their product. And for your average medium-to-large installation, it's not at all out of the ordinary to have a technician come in to help you install and configure the product.

What we looked for
We designed this test with two overall tests in mind: firstly a static or controlled test using content we had gathered over a period of time that included:

  • defined unwanted e-mail (spam),
  • unsolicited circulars/newsletters (news spam),
  • legitimate e-mails (ham), and
  • solicited circulars/newsletters (news ham).

This ran to some 1800+ items of mail that we sent to each vendor's application. This static test was run through at least twice to ensure accuracy.

The second test was a "live" test combining several real world e-mail boxes into one and then splitting that box to each of the anti-spam filtering servers that the vendors had configured. This test ran for over two weeks, and we then took several days worth of collection and manually went through each e-mail that had arrived and sorted it according to its status.

This live testing period was useful to ensure that the static testing was doing it job correctly in a controlled environment. Naturally, if any large differences occurred, then that application and the testing methodology would need to come under closer scrutiny to find out where and why the differences had occurred. One would act as basically a validation of the other--but as it turned out there were no discrepancies.

Scoring
Once we ran through the static tests, we applied scores and the total overall score achieved at the end as follows:

    +1 point for each spam, e-mail, and solicited newsletter filtered correctly

    -2 points for every unwanted spam message allowed through (false negatives)

    -3 points for every unsolicited newsletter allowed through (false negatives) and

    -5 points for every legitimate e-mail blocked incorrectly (false positives).

The rationale behind this scoring is simple: spam allowed through is an annoyance, but legitimate e-mail blocked can have very serious repercussions. Ironically, it is the false negatives that are more likely to get administrators in trouble--especially if the boss receives a pornographic spam or the like--rather than the false positives, which can be a much more serious matter. But then how are people supposed to know they didn't receive an e-mail if they didn't receive it? While newsletters may be important, we acknowledge that they are more difficult to filter correctly and therefore have less points deducted for improper handling.

Live testing
As intended, the live testing did indeed prove that the static/controlled test results were correct. The live test results basically were identical given the volume of messages sent via both methods.

Due to the very nature of live testing there are also several variables that could be introduced, which potentially are beyond our control especially the "human" factor with counting and classifying the number of messages. Naturally the live testing could only be run once.

Interestingly, the vendors who noted that their applications apply "learning" principles to their filtering did indeed sometimes record different results during the course of the static testing when the same data sets were sent through. However since the captured test data was limited to less than 2000 messages, the variation would not have been sufficient to show any great differences in the test results here. However, this is a good sign that over the course of several months and thousands of messages, these packages may well get better at learning your e-mail pattens and filter better.

With that in mind, these applications did not always produce better results when the "smarts" were activated. In a couple of cases, the results went the other way, but only by one or two messages, and we're confident that with a combination of learning and tweaking, you could improve the accuracy of filtering.

Like this article? Click below to send it to your mobile for free!

Advertisement

Talkback 1 comments

  1. I am interested in knowing how the market is for spam filters in Australia, I work for FrontBridge Technologies (in Marina Del Rey, California, U.S.A.) and I recently followed up on an inquiry on us from a Sydney based company. As we spoke they mentioned Anonymous -- 24/01/04

    I am interested in knowing how the market is for spam filters in Australia, I work for FrontBridge Technologies (in Marina Del Rey, California, U.S.A.) and I recently followed up on an inquiry on us from a Sydney based company. As we spoke they mentioned that there are no perimeter based solutions doing business in Australia like ours.

    Just curious and thought the author of this article might like to discuss.


Back to top

Featured