|
|||||||||||||||
|
Winning the War on spam: Comparison of Bayesian spam filters
Introduction Spam e-mail has become an ever increasing problem, and these days it is next to impossible to use e-mail without receiving it in large amounts. Various techniques exist to combat the problem; keyword-based filters, source blacklists, signature blacklists, source verification and combinations of these to name a few. All of them have problems; keyword filters need to be constantly updated manually and are not very accurate; blacklists also need to be constantly updated, and will always lag behind spammers. Fortunately, just as we seemed to be losing the war on spam, a new technique appeared on the scene after a paper by Paul Graham: Bayesian filters, our last, best hope for spam-free inboxes. Without going into details on how they work (more information can be found here and here), they are based on statistical methods which give a probability for an e-mail belonging to a given class (usually just two classes are used; spam and not-spam, but this is not a limitation of the technique, and indeed, POPFile supports an arbitrary number of classes). The beauty of bayesian filtering is that the filter can be trained by each individual user simply by categorizing each received e-mail as either spam or not-spam; after the user has categorized a few e-mails the filter will begin to make this categorization by itself, and usually with a very high level of accuracy. If the filter makes a mistake, the user re-categorizes the e-mail; the filter learns from its mistakes. No complicated maintenance is required after the filter is installed; it's so easy even grandma can use it. Now, even though the basic technique is the same, several software packages exist and the problem is choosing between them; this was my problem when I decided to switch from a keyword-based filter to a bayesian filter. No good comparisons existed, so I decided to do my own, and I wrote this review so hopefully others can benefit from my testing as well. Choosing filters to test I started by making a list of requirements the filters had to meet:
The first requirement is because I wanted the results to be applicable to everyone, and although I personally use Linux most desktop users still run Windows. Also, for desktop users the most common way to download mail is still using the POP3 protocol; many corporate desktops will use IMAP, but SpamBayes does support this option as well. Having an easy way to categorize e-mail is very important, as it is necessary to train the filter from time to time to maintain a high level of accuracy. After searching on the net I ended up with four filters which seemed to fulfill the above requirements:
I will give a more detailed review of the two first after presenting the test results, but unfortunately I was not able to include the two last in my test. I was not able to get SpamTUNNEL to work properly under Linux (it might work better on Windows), and it also seems to have no easy way to categorize e-mail (it requires the user to manually export mail and place it in "good" and "bad" directories). PASP requires manual editing of text files to set up, and also seems to not have an easy way to categorize e-mail. For these reasons I excluded SpamTUNNEL and PASP from my test, and was left with only SpamBayes and POPFile. How I tested Since bayesian filters requires training, their accuracy will increase with time. For this reason I decided to test over a period of one month (July 1st to July 31st); now, I consider myself a fairly normal e-mail user, and I don't receive hundreds of spams a day as some unfortunate people do. However, I do receive a fair amount as can be seen below, and this test should give a very good indication on how these filters will behave also for larger amounts of e-mail (see the review section for a couple of important points on this). During the test period I checked my e-mail normally via POP3, but with both SpamBayes and POPFile, and trained both once a day using the web-based interface, noting the accuracy. I had no problems with either package. After the first two weeks I decided to stir things up a bit by subscribing to a couple of mailing lists to see how well the filters handled this case; neither filter was particularly impressed and didn't have any accuracy problems. The results There are two numbers which are important when comparing spam filters: the number of spams missed by the filter (the false-negative ratio) and the number of non-spams wrongly tagged as spam (the false-positive ratio). Of these, the false-positive ratio is by far the most important; if one spam should happen to slip by the filter it is easy to just hit the delete button. However, should a normal mail be tagged as spam you won't even see it, and if it was something important this is obviously a very bad thing. One solution is to look at the spams from time to time, but this quickly becomes tedious and time consuming if one receives a lot of spams, and somewhat defeats the purpose of spam filtering. Realizing that even bayesian filtering is not always perfect, SpamBayes implements a solution to this which turns out to work very well in practice: instead of categorizing e-mails as either spam or not-spam it adds a third category: unsure. Using the web-based configuration tool it is possible to set the cut-off values for when an e-mail goes from unsure to spam or not-spam; the default values seems to work well, thus there should be no need to change these. As we shall see, using this method it should never be necessary to manually look at the mails classified as spam.
First I present the most important numbers: Mails wrongly classified as spam
Although POPFile does fairly well, even one wrongly tagged spam is one too many; one can never be sure an important mail will not be mistaken for spam. SpamBayes is far superior here, never mistaking a real e-mail for spam after the initial training period, because of its "unsure" feature; however, it should be noted this also means the comparison is not entirely fair.
Number of spams missed
Again SpamBayes is far superior due to its "unsure" feature, never missing a single spam, but again the comparison is not entirely fair. To also give a more apples-to-apples comparison, in this graph of missed spams I have included the spams classified by SpamBayes as unsure:
We can see here that the basic bayesian filter is in fact more or less equal in the two packages; however, in real-world use the result of the two first graphs are more important. Although not really important, I also present the following graph: Ratio of normal and spam mails received
The jump in the number of normal mails after the first two weeks is due to my including a couple of mailing lists in the test as mentioned in the "How I tested" section. Ease-of-use: Classifying mail I will here give some comments on the day-to-day usability of both products: SpamBayes v1.0a4 First I want to mention I switched from SpamBayes v1.0a3 to SpamBayes v1.0a4 when it was released two weeks into my test; I didn't notice any significant changes.
SpamBayes, like POPFile, is controlled via an easy-to-use web interface. The most important screen, training of messages, looks like this (click for larger image):
Mails are divided into three sections: spam, ham (SpamBayes' name for non-spam mail) and unsure. As we saw from the test results, normally there should be no need to look at the spam and ham sections. Unfortunately, the unsure section is at the bottom (if any SpamBayes developers are reading this, the unsure section should really be at the top), so you have to scroll down to it to classify any unsure mails. The process is very quick and easy; just select either "spam" or "ham" using the radio buttons and click "Train". That's it! Although I didn't test it, I should also note that SpamBayes supports the IMAP protocol, which is becoming more popular due to the advantages of storing the mail on a central server. SpamBayes also supports integration with Microsoft Outlook on Windows (also supported by POPFile via an external plugin). POPFile v19.0 POPFile is also controlled via an easy-to-use web interface. The message training screen looks like this (click for larger image):
POPFile uses the concept of "buckets", which you can have an arbitrary number of. I used just two: "normal" and "spam". The "Filter by" is very useful as you can first look at all the mails classified as spam and check if there are any normal mails; if there are you use the "Classification" dropdown box to select the correct bucket. After doing this, hit the "Reclassify" button to train the filter. At the bottom there is a "Remove all" button which removes all displayed messages from the history. Afterwards one can look at the normal mails and see if there are any spams there. The process is relatively easy, but not as quick as SpamBayes; the default font is larger, which means more scrolling, and the dropdown box means two mouse clicks instead of one. Also, without filtering the spams and normal mails are not grouped together, which makes it more difficult to check if all mail is correctly classified. The filter feature of POPFile solves this problem, but again it means more mouse clicking than SpamBayes. POPFile also has a feature SpamBayes lacks: magnets. They are used to create simple rules (match on from/to/subject) which always assigns matches to a given bucket. Useful for creating whitelists; this somewhat alleviates the missing "unsure" feature, but normally a mail which SpamBayes classifies as unsure (and POPFile as spam) will be something a bit different from mails you have received before, and thus not likely to be whitelisted. Of course, it also means more manual work: I didn't use it and didn't miss it from SpamBayes. Conclusions Both products performed very well and managed to filter out almost all spams. As we can see from the graph of missed spams where the "unsure" spams from SpamBayes are included the basic bayesian filter engine in both products seem to be about equal. What really separates the two products is the "unsure" classification feature of SpamBayes: with POPFile, you can never be sure that an important e-mail is not classified as spam, and you really have to manually look over the mail classified as spam to be sure. With SpamBayes you only have to look at the "unsure" mail, which is only a very small percentage of all mail. In fact, I have configured my mail client to automatically delete spam, but not do anything special with the unsure mail. Thus, I will get a spam in my inbox from time to time, but since it happens quite rarely it is easy to just hit the delete button in this case, and I can be confident I will never miss a normal mail. My conclusion is that, although both products work very well, SpamBayes is currently the best solution to the spam problem. In fact, after I started using it I hardly notice there is a spam problem at all. The developers of SpamBayes really have created a very powerful weapon in the War on spam, and turned the tide from losing to winning; if every e-mail user started using this product I believe spammers would soon find themselves looking for real jobs! My recommendation: SpamBayes Of course, feel free to send me comments to this review. Update 2003-08-14 After my review was posted to Slashdot I received a lot of positive feedback (thanks to all who wrote me!). In this addendum to my review I will try to answer and comment on some issues raised by various people, but which I didn't mention in my original text.
"Unsure" bucket in POPFile Several people asked why I didn't simply create a third bucket in POPFile for "unsure" mail, but unfortunately this wouldn't work the same as SpamBayes' "unsure" feature. The reason is that, in reality, a message is never "unsure", it's either spam or not. "Unsure" is not a classification, it's the filters' way of saying it doesn't know which classification to assign to the message; thus it wouldn't make sense to assign a message to the "unsure" bucket. Since POPFile assigns messages to buckets based on previous messages assigned to that bucket no messages would ever to assigned to the "unsure" bucket. Now, although it is not possible to emulate the SpamBayes "unsure" feature in the current version of POPFile, I do hope the developers find it worthwhile to add this feature to a future version. POPFile is actually a quite well made and polished product, and the lack of said feature is the only major complaint I have. Multiple buckets in POPFile Several people commented that POPFiles' support for multiple buckets can actually be a very useful feature for sorting your normal mail. For example, if you work in a large company and receive mail from people in several different departments and want to sort your mail by department, it can be a lot of work to create static rules for this, and with hundreds of people in each department it may not even be practical. With a bucket for each department in POPFile the filter can pick up on subtle hints (SMTP server used, sender domain name, name of department in signature etc.) and in most cases correctly determine from which department the mail originates. Thus, the best current solution may actually be to use both SpamBayes and POPFile - SpamBayes as a first line of defense for filtering out the bulk of the spam, and then POPFile sorting afterwards. Since both function as POP3 proxies it's easy to set up the mail client to connect to POPFile, POPFile to connect to SpamBayes and SpamBayes to connect to your real POP3 server. Of course, it's a bit more work than only using one of them, but still better than wading through hundreds of spams looking for false positives in POPFile I would think. POPFile magnets Another feature which SpamBayes lacks is the POPFile magnets, which are simple rules used to create whitelists; usually used for not classifying mail from friends and family as spam (here integration with the mail client can be very useful as the address book can be used as the whitelist). My personal opinion is that the filter should be good enough to correctly filter mail weather it comes from people you know or not; however, I can understand many people consider this an important feature, and is not something I think would be hard for the SpamBayes developers to add should they choose to do so. POPFile accuracy In my test of POPFile I received a total accuracy of 97.75%, which I consider quite good. However, I got several responses from POPFile users reporting accuracy ratings around 99%. Now, I did get the impression that my accuracy rating was climbing during the last week of my test, but it may also be the case the the version of POPFile I tested, 19.0, can sometimes be less accurate than the previous version, 18.1 (note that this is just a guess as I haven't done a comparison). Since bayesian filtering is based on statistics and probabilities there is no one correct way to do things, and only real-world testing can evaluate the quality of the filter. SpamBayes puts a lot of emphasis on testing and includes extensive testing facilities (command-line based, but it's only meant for use by developers and advanced users); I don't know to which degree POPFile does this, but it would seem POPFile could benefit from more automatic testing for evaluating the performance of changes. The Mozilla mail client Several people mentioned that the Mozilla mail client actually includes a bayesian filter, and, of course, Mozilla is cross-platform. However, for people not already using Mozilla for their mail, switching mail client is a big step and not everyone will be comfortable with Mozilla (I prefer KMail for various reasons I won't go into here). I have tried the version of Mozilla, 1.3.1, included with my Linux distribution; unfortunately the spam filter feature didn't seem to be completely functional there, so I won't comment on it specifically. However, there is no doubt that better integration with the mail client makes for much easier day-to-day operation; simply hitting the "spam" button (or "Junk" as Mozilla calls it) is even easier than visiting a webpage, and I believe better integration is key for getting more people to use filters. The Opera M2 mail client The latest release of Opera includes the "revolutionary" M2 mail client, which supposedly includes a bayesian filter. I did a quick evaluation of M2, and although the spam filter works it produces a lot of false positives. Also, I didn't find a way to provide feedback to the filter (!), which of course limits its usefulness. Hopefully this will improve in the next version; however, M2 is quite revolutionary in other aspects. Some comments: M2 doesn't use mail folders, but stores all mail in a flat database. You create "views", which are really just a set of rules for which messages should be included in the view; views can also be inherited. You can also assign a label to each messages (the labels seems fixed in the current M2 version, but hopefully it will be possible to edit the labels in a future version; also it would be useful to be able to assign more than one label to each message), but unfortunately the views cannot filter on labels yet (nor is it possible to filter on arbitrary mail headers; this restriction may be because of speed considerations, but it does somewhat limit the usefulness as mailing lists often includes a List-id: header for example). M2 also lacks some other useful features; for example, I couldn't find a way to view the raw message source, and also some features related to encryption are missing. Now, I really like Opera as a web browser, and I do believe the flat database approach is the next step in mail client evolution (already other mail clients are starting to implement "views" or "virtual folders", but M2 is designed with this in mind from the start), and I do believe M2 has the potential to, perhaps not revolutionize, but at least significantly improve the way we deal with the ever-increasing amounts of e-mail. I will definitely take a look at the new version when it comes out of beta. Revisions
Kristian Eide <kreide@online.no> Last modified: 2004-03-04 16:19:03 +0100 (Thu, 04 Mar 2004) |
|||||||||||||||