Bayesian Spam Filters

So far (2003-09-11), SpamAssassin has stopped 39 spams and missed 8 (with no false positives). I would like to receive no spam at all. I could drop my threshold (the required_hits variable), but it's already at 5.0, and i don't want to get false positives.

I've read good things about these Bayesian spam filters, so i thought i might try one when i get >100 spam messages that have slipped past SpamAssassin. However, i know nothing about them; this page is a focus for my learning. And, yes, i know SpamAssassin has a Bayesian filtering option; i don't feel like using it!

The really important thing about these Bayesian filters is that they automatically deduce rules about spamicity from a human-classified pool of email, and that they recognise both spam and non-spam, and so can make a reliable distinction between the two. Contrast this to classical approaches, where there is a human-defined set of rules which are all of the 'this is spam' variety. Note that this approach isn't really Bayesian at all; some early work was sort of Bayesian if you looked at it in the right light, so it was called that, and the name has stuck.

Now, a little light reading:

A Bayesian Approach to Filtering Junk E-mail (1998-07), a paper from some boffins at Microsoft research, which seems to have gone totally unnoticed
A Plan for Spam (2002-08), the article by Paul Graham which kicked off the current interest in Bayesian filtering; there is also an accompanying FAQ
Gary Robinson's notes about Spam Detection (2002-09-16, much updated), in which, amongst many other things he makes the important and much-neglected point that none of this is really Bayesian at all
Better Bayesian Filtering (2003-01), Paul Graham's second article, with lots more stuff
A Statistical Approach to the Spam Problem (2003-03-01), Gary Robinson's detailed article on the statistics.
Paul Graham's own set of links on the subject

Paul Graham also maintains a helpful list of Bayesian spam filters.

Kristian Eide has done a Comparison of Bayesian spam filters, but amongst his requirements for the software were that it "support both Windows and Linux" and it "support POP3 proxying"; i have no need for Windows support, and certainly no need for POP3 proxying, but i do have a need for being able to run it from procmail. Hence, his comparison is of relatively little use to me.

spamprobe vs bayespam.

So, finally, the software. We have Bogofilter, SpamProbe, ifile, CRM114, SpamBayes, BMF, DSPAM and Annoyance Filter. Of these, it seems that Bogofilter is one of the most popular (being early, well-developed, technically sophisticated, and linked to ESR), SpamBayes is the most technically sophisticated of the Grahamite filters, DSPAM is also very sophisticated but a bit shadowy and CRM114 takes the prize for maximum full-on technological power; ifile gets an honourable mention for being very mature. SpamProbe, BMF and Annoyance Filter are probably all a bit rubbish.

Bogofilter

<http://bogofilter.sourceforge.net/>

Based on Graham's article with Robinson's improvements. Not clear if it uses the Graham improvements too. Handles plain text, HTML, mime/multipart, bas64, quoted-printable and uuencoding. Written in C, uses BDB; fast.

Doesn't handle Asian (ie multibyte) character sets well.

Originally by ESR.

SpamProbe

<http://spamprobe.sourceforge.net/>

Based on Graham's article; no mention of any improvements. Written in C++, uses BDB; fast.

ifile

<http://www.nongnu.org/ifile/>

A general-purpose email filter. Predates Graham by a long chalk - very mature. Written in C.

CRM114

<http://crm114.sourceforge.net/>

The Controllable Regex Mutilator. Mad bad crazy insane - sparse binary polynomial hash style! Think of it as the Happy Fun Ball of Bayesian filters.

It's not a filter, it's a general framework for text classification schemes or something, but it comes with a filter that works well.

Slow.

Paul Graham is impressed.

Named, of course, after the CRM-114 discriminator from Dr Strangelove. Wing attack plan R!

SpamBayes

<http://spambayes.sourceforge.net/>

Has moved beyond Graham's approach - exciting new features and stuff. Seems to be the most technically advanced of the Grahamite filters.

Written in Python.

Not terribly technically sophisticated, it seems, but is has hella backstory - it's written by the guy behind AutoCAD, who, it seems, is now so rich that he spends his time writing science fiction, searching for extraterrestrial intelligence in unlikely places, and writing spam-filtering software - all of which are mysteriously interconnected.