Another grumpy sysadmin

[ Home | RSS 2.0 | ATOM 1.0 ]

Mon, 23 Aug 2021

Spam Filtering on Postfix/Dovecot, pt 4: Per-user Adaptive filtering with sa-learn

In part 3, I tried to get Dovecot with the dovecot-antispam plugin to enable per-user adaptive spam filtering with CRM114.

So, sometimes I have an idea about how I want to solve a problem, which is apparently different from how everyone else does it. That tends to be awkward, because all the documentation and write-ups other people have done aren't geared to my use case. This can be because I don't know what I'm doing, but when I figure that out, the way everyone else does it makes sense and I end up doing that. This time I tried not to have a strong opinion on how to set up dovecot-antispam, and instead let the documentation and examples guide me to the typical use case.

Unfortunately, that didn't work out.

Now I have to pick an adaptive filtering system that isn't CRM114, and make that work instead.

Choices, choices everywhere.

Looking through the Debian package repository for spam filters, it turns out that there are quite a few. I don't know much about most of them.

Fortunately, Debian has the Popularity Contest. It's an opt-in feature that allows Debian users to have their systems report which packages they have installed, so that data can be aggregated to get an idea for how popular each package is. For each spam filter I could find, I had a look at the popcon numbers, and which versions are present in the last three releases of Debian to get an idea of whether they're still under active development:

Package Popcon Bullseye Buster Jessie Notes
bmf 6 0.9.4 0.9.4 0.9.4
bogofilter 43,823 1.2.5 1.2.4 1.2.4 Dependency of evolution-plugin-bogofilter (popcon 31,690), which is recommended by evolution (popcon 44,217)
bsfilter 494 1.0.19 1.0.19 1.0.19
crm114 128 20100106 20100106 20100106
ifile 8 1.3.9 1.3.9 1.3.9
mailfilter 30 0.8.7 0.8.6 0.8.6
qsf 8 - 1.2.7 1.2.7
scmail 4 1.3.4 1.3.4 1.3.4
spamassassin 7,496 3.4.6 3.4.2 3.4.2
spambayes 95 - 1.1 1.1
spamoracle 13 1.6 1.4 1.4
spamprobe 426 1.4 1.4 1.4
sylfilter 286 0.8 0.8 0.8

bogofilter tops the popularity list by a huge margin. So much so in fact that I had to look into why that might have been, and discovered that it works as a plugin with a couple of email clients. Notably, it is installed by default with Evolution, which is very popular and included as part of some of the Desktop Environments shipped by Debian, which accounts for a large majority of its popcon numbers.

As far as non-email-client installs goes, bogofilter and spamassassin seem to be in roughly the same ballpark of popularity as each other, and far ahead of all the other adaptive spam filters - including being 50× more popular than CRM114. They have also both received an update for the recent Debian bullseye release. And I already have spamassassin installed.

Setting up dovecot-antispam with SpamAssassin's sa-learn.

Although the pipe backend for dovecot-antispam is generic and needs to be told how to run the relevant training program, it's not actually that hard to do so. It also helps that there's a sample of all the required options in the man page, under Configuration. The entire configuration file I ended up with (/etc/dovecot/90-plugin-antispam) is:

plugin {
    ## Generic options
    antispam_backend = pipe

    antispam_trash = Trash
    antispam_spam = Junk

    ## Pipe plugin options
    antispam_pipe_program = /usr/bin/sa-learn
    antispam_pipe_program_args = --siteconfigpath=/etc/spamassassin/delivery;--local;--username=%u
    antispam_pipe_program_spam_arg = --spam
    antispam_pipe_program_notspam_arg = --ham

    antispam_pipe_tmpdir = /tmp
}

The only part that might need a bit of explanation there is the --siteconfigpath=/etc/spamassassin/delivery - because I want the local delivery spam options to be different from the global prefilter options. In particular, I want the global prefilter to have adaptive filtering disabled, whereas thats the whole point of the local delivery filter. But also, while I want the global prefilter to make use of online spam tests (AskDNS, SPF, etc...), I need --local to have the local delivery filter use local tests only - and you can't specify that in the config file!

Errata 2022-06-22:

This post originally had --configpath=/etc/spamassassin/delivery in the above - it should have been --siteconfigpath.

Are we nearly there yet?

There's still one piece missing from the puzzle. The dovecot-antispam plugin checks if email is moved into or out of the Junk folder, and runs the training program accordingly. But how does it check if incoming email is spam, and move it into the Junk folder?

It doesn't. You have to set that up separately.

Moving emails with sieve

Second part first - moving incoming spam into the Junk folder. This isn't too hard, if you already have sieve filtering enabled. You can configure a sieve_after directory and drop a sieve file in there to look for the SpamAssassin spam header and move the email appropriately. My /etc/dovecot/sieve-after/99-spamfilter.sieve script is just:

require ["fileinto","regex"];

if header :regex "X-Spam-Status" "^Yes"
{
    fileinto "Junk";
    stop;
}

spamc/spamd, again.

The first part, checking if incoming email is spam with SpamAssassin, is trickier. Because of course it is.

The first wrinkle is that emails might (still) be larger than SpamAssassin's recommended limit of 500kB. Therefore you can't pass the emails to spamassassin, but have to pass them to spamc instead. The wrinkle turns into an actual problem when you realise that spamc has no way of passing configuration options (or a configuration path) to spamd. That means there's no way of getting spamc for per-user adaptive filtering to use different options than spamc for global rejection filtering.

The solution? You have to run two spamd daemons, with different configurations, and get the different invocations of spamc to talk to the right one.

As someone who didn't want to run one spamd daemon, this is getting annoying. But I'm far enough down this rabbit hole that I've kind of stopped caring, and don't really have the energy to go back and start evaluating yet another spam filter. So I set up a second daemon, with a different configuration file, and different command-line options (again, because some options can't be specified in the config file), listening on a different port, using a different PID file (because we really do want to run two instances at the same time), and that seems to be working.

Making the parts finally fit together

The last part of the last problem is that Dovecot's delivery agent doesn't seem to have a way of passing incoming emails to another program like spamc to actually perform the spam check. And I only want to pass emails to spamc if they're about to be delivered to a local user - not if they're just passing through the email system. Fortunately, I can still fix that on the Postfix side by changing the delivery command from

mailbox_command = /usr/lib/dovecot/dovecot-lda -f "$SENDER" -a "$RECIPIENT"

to

mailbox_command = /usr/bin/spamc -p 785 -u "$USER" -e /usr/lib/dovecot/dovecot-lda -f "$SENDER" -a "$RECIPIENT"

That's it? Finally?

Pretty much. Yeah. I think I've finally got a setup that does what I originally wanted to do.

I have questions

This has felt like a tortuous and torturous path. Part of me thinks that this shouldn't be as hard as it has been. Spam filtering is a really important part of dealing with emails these days - and has been for a couple of decades. So why hasn't dealing with it been made more streamlined by now?

Part of me wonders if I'm just doing it completely wrong. That maybe there is a really straightforward way of setting this up that I've just missed entirely.

Even so, I have read some of the documentation for the tools I've worked with here, notably SpamAssassin, and I still have some big-picture questions:

Look, I'm a software developer. I know that features don't just happen and someone has to write them, and that every proposed feature starts with minus 100 points. But still, SpamAssassin is 20 years old, and it's not like these are new features that no-one is quite sure how they'd work. These are features that either already exist, or that people have already put a fair amount of work on improving. They're just not consistently available, or quite finished yet.

How long does that last layer of polish take though?

posted at: 14:31 | path: / | permanent link to this entry

Made with Pyblosxom