Another grumpy sysadmin : /2021-07-12-spamfiltering-3-adaptive.html

In part 2, I got Postfix to check for obvious spam and reject it altogether.

For emails that are only probably spam, I want to filter them into the user's Junk folder. Also, I want the system to track email that the user manually moves into and out of their Junk folder, and use that to make the filter better over time. This type of filtering was first proposed by Paul Graham in his 2002 essay A Plan for Spam.

One point mentioned in A Plan for Spam is:

each user should have his own per-word probabilities based on the actual mail he receives. This (a) makes the filters more effective, (b) lets each user decide their own precise definition of spam, and (c) perhaps best of all makes it hard for spammers to tune mails to get through the filters.

I have seen some suggest that Graham's point a) is incorrect, becase adaptive filters become more effective the more email they see. Therefore, a shared adaptive filter which sees all the email on a system will learn quicker and be more effective than one that sees only one user's email. This does make sense in the general case (although I have not seen any hard numbers comparing the approaches) but I disagree with the idea of using a shared filter based on Graham's point b) that each user should have their own precise definition of spam.

Whose spam is it anyway?

For one thing, some people have a genuine interest in the things that other people might mark as spam. As I mentioned in part 1, I get a lot of buy-to-let spam, and I'd like my personal spam filter to bin all of it. But some people (even some honest, upstanding citizens) work in the buy-to-let industry, and would definitely not want email about that going to their Junk folder.

Another reason is that different people use spam filters in different ways. Some people sign up for mailing lists, on purpose, and then after some time simply become uninterested in those emails. Unfortunately, some of those mailing lists can be hard to unsubscribe from. Therefore some people "unsubscribe" from mailing lists by marking them as spam until those emails stop showing up in their Inbox.

There is an argument that this is "wrong". Those emails are not technically spam, because spam is unsolicited, and as the user signed up to that mailing list those emails don't qualify as spam. As a techie (and therefore a born pedant) I have some sympathy with that view.

On the other hand, it is the mark of a good tool that people adapt it to perform tasks unimagined by its original creator. If people create a workflow to manage their environment in a way that works for them, I don't think it's very helpful to tell them that they shouldn't do that. Firstly because it's generally ineffective (some people will do it anyway), but mostly because it gives rise to the kind of resentment users can have toward IT admins for rules that don't seem to benefit anyone, but still make their lives worse. At the very least, you should try to provide users with an alternative tool that solves their problems in a better way, before asking them to stop using the tool they already have in the way you don't like.

Therefore, if two people on the same system sign up to a mailing list, and one person wants to treat it as spam when the other doesn't, let them. Give each user their own spam filter.

Dovecot Antispam

While I'm using Postfix to transfer email to and from the internet (the Mail Transfer Agent), it's not responsible for delivering and managing email on the local system. For that I'm using Dovecot as both the local delivery agent and as the IMAP server which users connect to to read and manage their email.

Dovecot has a plugin, dovecot-antispam, which allows it to train a spam filter based upon users moving emails into and out of the Junk folder. This plugin has a few different backends for working with different spam filters.

pipe - Sends (pipes) email to a spam filter. This is a generic backend that can use any spam filter, but you need to tell dovecot-antispam what program to run to classify emails, what parameters to pass it, and how to tell the filter if an email is spam or ham. It's flexible, but requires some careful configuration.
spool2dir - This is a generic backend like "pipe", but instead of sending emails to the spam filter, it drops them into different directories for spam/ham. You then have to set up "some other system" to monitor those directories and run the spam filter on any emails that show up there. That has all the complexity of the "pipe" backend, and then some more on top, so I'm avoiding that one.
dspam - is a spam filter which is not packaged for Debian, and by the looks of the project was last updated in 2012. I'd have to built it myself from sources. This is not appealing.
CRM114 - "a system to examine incoming e-mail [...] and to sort, filter, or alter the incoming files or data streams according to the user's wildest desires." It's packaged for Debian, and dovecot-antispam has been adapted to work specifically with it. Let's have a closer look at that.

CRM114 - the Controllable Regex Mutilator

(Named after The CRM 114 Discriminator from Dr. Strangelove.)

First stop after installation, the crm(1) man page:

CRM114 is a language designed to write filters in. It caters to filtering email, system log streams, html, and other marginally human-readable ASCII that may occasion to grace your computer.

...followed by a lot of information about how the CRM programming langauge works.

Huh. So, CRM114 isn't a spam filter. It's a programming language for writing many types of filters - including spam filters. Anyway, right at the bottom of the man page, it says:

This manpage describes the crm114 utility as it has been described by QUICKREF.txt, shipped with crm114-20040212-BlameJetlag.src.tar.gz. The DESCRIPTION section is copy-and-pasted from INTRO.txt as distributed with the same source tarball.

Let's track down and have a look at INTRO.txt:

If you are reading this to get information on how to install CRM114 as a mailfilter, you have the _wrong_ document.

But fear not, we _do_ have the document you want. The document you want if you want to know how to install CRM114 as a mailfilter is:

CRM114_Mailfilter_HOWTO.txt

*sigh* Fine. From The CRM114 & Mailfilter HOWTO:

CRM114 is a *language* designed to write text filters and classifiers in. It makes it easy to tweak code.

Mailfilter is just _one_ of the possible filters; there are many more out there and if Mailfilter doesn't do what you want, it's easy to create one that does.

Mailreaver is another one of the filters, with different (and better, I hope) designs, that can use Mailtrainer (yet another filter) to build even better statistics files.

There are yet other filters written in CRM114; you can read all about them on the web page:

crm114.sourceforge.net

(and if you create one, and want to share it, put it on a web page and send me an email so I can add a pointer.)

...followed by a list of the 8 (yes, 8) major steps to using CRM114 mailfilter/mailreaver.

I just wanted a spam filter. I got a langauge for writing spam filters, two different spam filters with different designs, a third filter to do something slightly different, a link to a web page with even more filters on it, and the option of not just configuring the filters I already have, but the offer of actually rewriting them, or even of writing my own, in a language I've never used before.

I'm definitely getting some choice overload vibes here.

CRM114 `mailreaver`

The Mailfilter HOWTO quoted above seems to suggest that mailreaver is the preferred CRM114 spam filter, and the dovecot-antispam Configuration documentation includes a reference to it too, so let's skip trying to evaluate all the different possible CRM114 filters for now, and just go with that and see where it leads.

Looking closely, mailreaver is an actual program. When installed, it's marked as executable and includes a shebang line that allows it to run like any other Unix script. The problem is, it's installed outside of the normal PATH, doesn't come with it's own man page, and doesn't provide usage information if run as /usr/share/crm114/mailreaver --help. That does not inspire confidence.

Then, from reading the HOWTO, configuring it is really awkward.

There's no default configuration file installed anywhere under /etc, where configuration files are normally found. There is an example configuration file in /usr/share/crm114/mailfilter.cf, but you need to edit it before you can use it. The complication here is that you can't edit that version because your changes will get overwritten if the package gets updated for any reason. So you need to copy it yourself to somewhere like /etc and edit that version.
The configuration file syntax seems unique to CRM114/mailreaver and is just strange. Most configuration files would allow you to set an option, like adding spam headers to scanned emails, with something like add_headers=yes. The way to do this with mailreaver is :add_headers: /yes/.
At least, that's one configuration file. Along with the mailfilter.cf, there's also the rewrites.mfp configuration file, which is required, but can be empty, but if it's not empty its syntax looks like yourname@yourdomain.yourplace>->MyEmailAddress where ">->" is the "rewrite operator", whatever that is.
You also have to create the spam/ham statistics files, even though they also can be empty. mailreaver apparently can't create these files itself on first use if they don't exist, so when you create a new user on your mailserver, you have to create empty statistics files for them otherwise the filtering won't work.
The spam/ham statistics files have the extension .css, despite mailfilter being created around 2004, and Cascading Stylesheets (CSS) having been an important part of HTML4 when it was released in 1997, 7 years and a dot-com bubble previously. According to The CRM114 FAQ it's because they were originally "CRM Sparse Spectra" files, and I know it's a petty complaint, but reusing a well-established file extension still bothers me.
A lot of the documentation insists that the configuration files and the statistics files have to be in the same directory. This means you couldn't set up a global configuration with per-user filters. Looking a lot further down through the Mailfilter HOWTO, when it finally gives a reference for the mailreaver command options, it does say that there is a separate --config option for using a file other than mailfilter.cf as the config file - but implies that only the file name, rather than a full path to the config file, can be given.

Looking back at the dovecot-antispam sample configuration for mailreaver, it looks like you can actually give a full path to a config file. Also, examining the mailreaver source code maybe backs that up, but the CRM114 language is unlike any other programming language I've ever seen, and at this point I'm fed up enough that I seriously can't be bothered to learn it sufficiently to figure it out for certain.

What-a-mess!

CRM114/mailreaver doesn't feel like a finished product to me.

It feels like a research project someone got working to scratch their own itch. Then, when it did what they needed, and they knew how to modify it to do anything else they wanted it to, they stopped improving it further. It seems that, despite its technical capabilities, no community formed around it to give it the push to become a genuinely usable bit of software, by people who aren't particularly interested in the technology behind how it does what it does. Normally that push might come from either direct contributions from that community, or just from feedback where multiple users all describe the same problems.

It looks like CRM114 is a programming language that has had less than a dozen programs written in it - ever. With this, I had a look around the project's release history, and I don't think it's been updated since 2009.

CRM114 may do a really good job of solving its author's problem, which is fine, but it's not going to work for me.

SpamAssassin's Bayesian Classifier to the rescue

Having been defeated by CRM114, I now need to find a different adaptive spam filter that I can plug into dovecot-antispam via it's "pipe" backend.

Fortunately, SpamAssassin - which I already have installed and am using for definite global spam rejection - has such a filter in it's Bayesian Classifier, sa-learn.

I'll look at that in part 4.

posted at: 11:52 | path: / | permanent link to this entry

Another grumpy sysadmin

Mon, 12 Jul 2021

Spam Filtering on Postfix/Dovecot, pt 3: Per-user Adaptive filtering

Whose spam is it anyway?

Dovecot Antispam

CRM114 - the Controllable Regex Mutilator

CRM114 `mailreaver`

What-a-mess!

SpamAssassin's Bayesian Classifier to the rescue

Another grumpy sysadmin

Mon, 12 Jul 2021

Spam Filtering on Postfix/Dovecot, pt 3: Per-user Adaptive filtering

Whose spam is it anyway?

Dovecot Antispam

CRM114 - the Controllable Regex Mutilator

CRM114 mailreaver

What-a-mess!

SpamAssassin's Bayesian Classifier to the rescue

CRM114 `mailreaver`