Another grumpy sysadmin

[ Home | RSS 2.0 | ATOM 1.0 ]

Mon, 12 Jul 2021

Spam Filtering on Postfix/Dovecot, pt 3: Per-user Adaptive filtering

In part 2, I got Postfix to check for obvious spam and reject it altogether.

For emails that are only probably spam, I want to filter them into the user's Junk folder. Also, I want the system to track email that the user manually moves into and out of their Junk folder, and use that to make the filter better over time. This type of filtering was first proposed by Paul Graham in his 2002 essay A Plan for Spam.

One point mentioned in A Plan for Spam is:

each user should have his own per-word probabilities based on the actual mail he receives. This (a) makes the filters more effective, (b) lets each user decide their own precise definition of spam, and (c) perhaps best of all makes it hard for spammers to tune mails to get through the filters.

I have seen some suggest that Graham's point a) is incorrect, becase adaptive filters become more effective the more email they see. Therefore, a shared adaptive filter which sees all the email on a system will learn quicker and be more effective than one that sees only one user's email. This does make sense in the general case (although I have not seen any hard numbers comparing the approaches) but I disagree with the idea of using a shared filter based on Graham's point b) that each user should have their own precise definition of spam.

Whose spam is it anyway?

For one thing, some people have a genuine interest in the things that other people might mark as spam. As I mentioned in part 1, I get a lot of buy-to-let spam, and I'd like my personal spam filter to bin all of it. But some people (even some honest, upstanding citizens) work in the buy-to-let industry, and would definitely not want email about that going to their Junk folder.

Another reason is that different people use spam filters in different ways. Some people sign up for mailing lists, on purpose, and then after some time simply become uninterested in those emails. Unfortunately, some of those mailing lists can be hard to unsubscribe from. Therefore some people "unsubscribe" from mailing lists by marking them as spam until those emails stop showing up in their Inbox.

There is an argument that this is "wrong". Those emails are not technically spam, because spam is unsolicited, and as the user signed up to that mailing list those emails don't qualify as spam. As a techie (and therefore a born pedant) I have some sympathy with that view.

On the other hand, it is the mark of a good tool that people adapt it to perform tasks unimagined by its original creator. If people create a workflow to manage their environment in a way that works for them, I don't think it's very helpful to tell them that they shouldn't do that. Firstly because it's generally ineffective (some people will do it anyway), but mostly because it gives rise to the kind of resentment users can have toward IT admins for rules that don't seem to benefit anyone, but still make their lives worse. At the very least, you should try to provide users with an alternative tool that solves their problems in a better way, before asking them to stop using the tool they already have in the way you don't like.

Therefore, if two people on the same system sign up to a mailing list, and one person wants to treat it as spam when the other doesn't, let them. Give each user their own spam filter.

Dovecot Antispam

While I'm using Postfix to transfer email to and from the internet (the Mail Transfer Agent), it's not responsible for delivering and managing email on the local system. For that I'm using Dovecot as both the local delivery agent and as the IMAP server which users connect to to read and manage their email.

Dovecot has a plugin, dovecot-antispam, which allows it to train a spam filter based upon users moving emails into and out of the Junk folder. This plugin has a few different backends for working with different spam filters.

CRM114 - the Controllable Regex Mutilator

(Named after The CRM 114 Discriminator from Dr. Strangelove.)

First stop after installation, the crm(1) man page:

CRM114 is a language designed to write filters in. It caters to filtering email, system log streams, html, and other marginally human-readable ASCII that may occasion to grace your computer.

...followed by a lot of information about how the CRM programming langauge works.

Huh. So, CRM114 isn't a spam filter. It's a programming language for writing many types of filters - including spam filters. Anyway, right at the bottom of the man page, it says:

This manpage describes the crm114 utility as it has been described by QUICKREF.txt, shipped with crm114-20040212-BlameJetlag.src.tar.gz. The DESCRIPTION section is copy-and-pasted from INTRO.txt as distributed with the same source tarball.

Let's track down and have a look at INTRO.txt:

If you are reading this to get information on how to install CRM114 as a mailfilter, you have the _wrong_ document.

But fear not, we _do_ have the document you want. The document you want if you want to know how to install CRM114 as a mailfilter is:

    CRM114_Mailfilter_HOWTO.txt

*sigh* Fine. From The CRM114 & Mailfilter HOWTO:

CRM114 is a *language* designed to write text filters and classifiers in. It makes it easy to tweak code.

Mailfilter is just _one_ of the possible filters; there are many more out there and if Mailfilter doesn't do what you want, it's easy to create one that does.

Mailreaver is another one of the filters, with different (and better, I hope) designs, that can use Mailtrainer (yet another filter) to build even better statistics files.

There are yet other filters written in CRM114; you can read all about them on the web page:

    crm114.sourceforge.net

(and if you create one, and want to share it, put it on a web page and send me an email so I can add a pointer.)

...followed by a list of the 8 (yes, 8) major steps to using CRM114 mailfilter/mailreaver.

I just wanted a spam filter. I got a langauge for writing spam filters, two different spam filters with different designs, a third filter to do something slightly different, a link to a web page with even more filters on it, and the option of not just configuring the filters I already have, but the offer of actually rewriting them, or even of writing my own, in a language I've never used before.

I'm definitely getting some choice overload vibes here.

CRM114 mailreaver

The Mailfilter HOWTO quoted above seems to suggest that mailreaver is the preferred CRM114 spam filter, and the dovecot-antispam Configuration documentation includes a reference to it too, so let's skip trying to evaluate all the different possible CRM114 filters for now, and just go with that and see where it leads.

Looking closely, mailreaver is an actual program. When installed, it's marked as executable and includes a shebang line that allows it to run like any other Unix script. The problem is, it's installed outside of the normal PATH, doesn't come with it's own man page, and doesn't provide usage information if run as /usr/share/crm114/mailreaver --help. That does not inspire confidence.

Then, from reading the HOWTO, configuring it is really awkward.

What-a-mess!

CRM114/mailreaver doesn't feel like a finished product to me.

It feels like a research project someone got working to scratch their own itch. Then, when it did what they needed, and they knew how to modify it to do anything else they wanted it to, they stopped improving it further. It seems that, despite its technical capabilities, no community formed around it to give it the push to become a genuinely usable bit of software, by people who aren't particularly interested in the technology behind how it does what it does. Normally that push might come from either direct contributions from that community, or just from feedback where multiple users all describe the same problems.

It looks like CRM114 is a programming language that has had less than a dozen programs written in it - ever. With this, I had a look around the project's release history, and I don't think it's been updated since 2009.

CRM114 may do a really good job of solving its author's problem, which is fine, but it's not going to work for me.

SpamAssassin's Bayesian Classifier to the rescue

Having been defeated by CRM114, I now need to find a different adaptive spam filter that I can plug into dovecot-antispam via it's "pipe" backend.

Fortunately, SpamAssassin - which I already have installed and am using for definite global spam rejection - has such a filter in it's Bayesian Classifier, sa-learn.

I'll look at that in part 4.

posted at: 11:52 | path: / | permanent link to this entry

Made with Pyblosxom