Blog

Monday, 04 October

16:31

Email autoconfiguration, standards, and legacy solutions [Another grumpy sysadmin]

Setting up your email, either on your phone, or on a traditional desktop mail client (yes, some of still use those), used to be a real pain. You'd need to find the documentation from your email provider that told you which protocols they supported, which servers to connect to (that can be different for sending and receiving email), which port numbers to use, whether to use a secure connection (!), and then figure out where on your email client you needed to put all this information.

These days, it's much easier. There's a standard, RFC-6186, which email providers can use to publish the relevant settings as part of their DNS records. That way, email clients can look all of that information up just from the domain portion of your email address. Simples.

So why am I writing about it?

Well, before RFC-6186 was created, different email clients came up with their own methods of autoconfiguration. Normally, this was done by publishing the details, in a format they invented, in a special place on a website associated with the email domain.

Now, while the recent versions of these email clients are mostly capable of automatically configuring themselves with RFC-6186 style information, old versions of those email clients still exist in the wild. Also, users who are unable (for whatever reason) to upgrade to newer versions of software - like email clients - are probably the ones you want to be able to help the most. Therefore, supporting these legacy autoconfiguration systems would be really useful.

Don't Repeat Yourself

One principle widely regarded as important, or a "best practice" in software development and system administration, is Don't Repeat Yourself. Briefly, if you have multiple copies of some code or data somewhere, then if they ever need to change it's possible that they won't all be changed together. Copies can get out of synchronisation. Therefore, each piece of code or data should have one canonical location, and each other use should refer to that same one. That way, it's impossible for copies to get out of sync.

Hence, when setting up legacy autoconfiguration systems, they should not require their own copies of the email settings, but load the settings from the canonical RFC-6186 data as-needed.

Outlook and Thunderbird

The two most commonly-used legacy autoconfiguration methods I'm aware of (and the two I wanted to actually support), are Microsoft's Outlook, and Mozilla's Thunderbird.

Outlook's legacy autoconfiguration details can be found here, whereas Thunderbird's can be found here. They both involve fetching an XML file containing the settings from a given location on a website, but each uses a different XML schema, a different location, and sometimes a different website.

Fortunately, it's possible to generate XML files dynamically, while loading external information (like RFC-6186 records), without too much difficulty.

emailautoconf

Enter emailautoconf, which I put together to generate those files in that way.

If you own the domain example.com and have set up email for it, have configured your RFC-6186 records appropriately, and are running a website at https://example.com/ (note: no www.) with the Apache2 web server and PHP, then setting up emailautoconf can be as easy as grabbing a copy of the sources, installing it with sudo make install, and adding the following line to your Apache2 configuration file:

Include /usr/local/share/emailautoconf/apache2/main.conf

More information can be found in the README, including about how to configure it on the autoconfig and autodiscover subdomains which are also specified by those configuration systems.

Note that while emailautoconfig currently only supports the Outlook and Thunderbird legacy autoconfiguration systems, and the Apache2 web server, that's only because those are all I needed right now. I'm open to support other systems or webservers, if anyone is interested in contributing patches.

Monday, 23 August

14:31

Spam Filtering on Postfix/Dovecot, pt 4: Per-user Adaptive filtering with sa-learn [Another grumpy sysadmin]

In part 3, I tried to get Dovecot with the dovecot-antispam plugin to enable per-user adaptive spam filtering with CRM114.

So, sometimes I have an idea about how I want to solve a problem, which is apparently different from how everyone else does it. That tends to be awkward, because all the documentation and write-ups other people have done aren't geared to my use case. This can be because I don't know what I'm doing, but when I figure that out, the way everyone else does it makes sense and I end up doing that. This time I tried not to have a strong opinion on how to set up dovecot-antispam, and instead let the documentation and examples guide me to the typical use case.

Unfortunately, that didn't work out.

Now I have to pick an adaptive filtering system that isn't CRM114, and make that work instead.

Choices, choices everywhere.

Looking through the Debian package repository for spam filters, it turns out that there are quite a few. I don't know much about most of them.

Fortunately, Debian has the Popularity Contest. It's an opt-in feature that allows Debian users to have their systems report which packages they have installed, so that data can be aggregated to get an idea for how popular each package is. For each spam filter I could find, I had a look at the popcon numbers, and which versions are present in the last three releases of Debian to get an idea of whether they're still under active development:

Package Popcon Bullseye Buster Jessie Notes
bmf 6 0.9.4 0.9.4 0.9.4
bogofilter 43,823 1.2.5 1.2.4 1.2.4 Dependency of evolution-plugin-bogofilter (popcon 31,690), which is recommended by evolution (popcon 44,217)
bsfilter 494 1.0.19 1.0.19 1.0.19
crm114 128 20100106 20100106 20100106
ifile 8 1.3.9 1.3.9 1.3.9
mailfilter 30 0.8.7 0.8.6 0.8.6
qsf 8 - 1.2.7 1.2.7
scmail 4 1.3.4 1.3.4 1.3.4
spamassassin 7,496 3.4.6 3.4.2 3.4.2
spambayes 95 - 1.1 1.1
spamoracle 13 1.6 1.4 1.4
spamprobe 426 1.4 1.4 1.4
sylfilter 286 0.8 0.8 0.8

bogofilter tops the popularity list by a huge margin. So much so in fact that I had to look into why that might have been, and discovered that it works as a plugin with a couple of email clients. Notably, it is installed by default with Evolution, which is very popular and included as part of some of the Desktop Environments shipped by Debian, which accounts for a large majority of its popcon numbers.

As far as non-email-client installs goes, bogofilter and spamassassin seem to be in roughly the same ballpark of popularity as each other, and far ahead of all the other adaptive spam filters - including being 50× more popular than CRM114. They have also both received an update for the recent Debian bullseye release. And I already have spamassassin installed.

Setting up dovecot-antispam with SpamAssassin's sa-learn.

Although the pipe backend for dovecot-antispam is generic and needs to be told how to run the relevant training program, it's not actually that hard to do so. It also helps that there's a sample of all the required options in the man page, under Configuration. The entire configuration file I ended up with (/etc/dovecot/90-plugin-antispam) is:

plugin {
    ## Generic options
    antispam_backend = pipe

    antispam_trash = Trash
    antispam_spam = Junk

    ## Pipe plugin options
    antispam_pipe_program = /usr/bin/sa-learn
    antispam_pipe_program_args = --siteconfigpath=/etc/spamassassin/delivery;--local;--username=%u
    antispam_pipe_program_spam_arg = --spam
    antispam_pipe_program_notspam_arg = --ham

    antispam_pipe_tmpdir = /tmp
}

The only part that might need a bit of explanation there is the --siteconfigpath=/etc/spamassassin/delivery - because I want the local delivery spam options to be different from the global prefilter options. In particular, I want the global prefilter to have adaptive filtering disabled, whereas thats the whole point of the local delivery filter. But also, while I want the global prefilter to make use of online spam tests (AskDNS, SPF, etc...), I need --local to have the local delivery filter use local tests only - and you can't specify that in the config file!

Errata 2022-06-22:

This post originally had --configpath=/etc/spamassassin/delivery in the above - it should have been --siteconfigpath.

Are we nearly there yet?

There's still one piece missing from the puzzle. The dovecot-antispam plugin checks if email is moved into or out of the Junk folder, and runs the training program accordingly. But how does it check if incoming email is spam, and move it into the Junk folder?

It doesn't. You have to set that up separately.

Moving emails with sieve

Second part first - moving incoming spam into the Junk folder. This isn't too hard, if you already have sieve filtering enabled. You can configure a sieve_after directory and drop a sieve file in there to look for the SpamAssassin spam header and move the email appropriately. My /etc/dovecot/sieve-after/99-spamfilter.sieve script is just:

require ["fileinto","regex"];

if header :regex "X-Spam-Status" "^Yes"
{
    fileinto "Junk";
    stop;
}

spamc/spamd, again.

The first part, checking if incoming email is spam with SpamAssassin, is trickier. Because of course it is.

The first wrinkle is that emails might (still) be larger than SpamAssassin's recommended limit of 500kB. Therefore you can't pass the emails to spamassassin, but have to pass them to spamc instead. The wrinkle turns into an actual problem when you realise that spamc has no way of passing configuration options (or a configuration path) to spamd. That means there's no way of getting spamc for per-user adaptive filtering to use different options than spamc for global rejection filtering.

The solution? You have to run two spamd daemons, with different configurations, and get the different invocations of spamc to talk to the right one.

As someone who didn't want to run one spamd daemon, this is getting annoying. But I'm far enough down this rabbit hole that I've kind of stopped caring, and don't really have the energy to go back and start evaluating yet another spam filter. So I set up a second daemon, with a different configuration file, and different command-line options (again, because some options can't be specified in the config file), listening on a different port, using a different PID file (because we really do want to run two instances at the same time), and that seems to be working.

Making the parts finally fit together

The last part of the last problem is that Dovecot's delivery agent doesn't seem to have a way of passing incoming emails to another program like spamc to actually perform the spam check. And I only want to pass emails to spamc if they're about to be delivered to a local user - not if they're just passing through the email system. Fortunately, I can still fix that on the Postfix side by changing the delivery command from

mailbox_command = /usr/lib/dovecot/dovecot-lda -f "$SENDER" -a "$RECIPIENT"

to

mailbox_command = /usr/bin/spamc -p 785 -u "$USER" -e /usr/lib/dovecot/dovecot-lda -f "$SENDER" -a "$RECIPIENT"

That's it? Finally?

Pretty much. Yeah. I think I've finally got a setup that does what I originally wanted to do.

I have questions

This has felt like a tortuous and torturous path. Part of me thinks that this shouldn't be as hard as it has been. Spam filtering is a really important part of dealing with emails these days - and has been for a couple of decades. So why hasn't dealing with it been made more streamlined by now?

Part of me wonders if I'm just doing it completely wrong. That maybe there is a really straightforward way of setting this up that I've just missed entirely.

Even so, I have read some of the documentation for the tools I've worked with here, notably SpamAssassin, and I still have some big-picture questions:

  • If spamc is meant to be a drop-in replacement for spamassassin, why can't spamc take the same command-line options as spamassassin? Why doesn't spamc have e.g. a --local option?

  • If spamc is meant to be a drop-in replacement for spamassassin, why can't spamassassin take the same command-line options as spamc? Why doesn't spamassassin have e.g. a --max-size option?

  • If spamassassin has been written, by the developers, to be able to work with multiple predefined configurations, as evidenced by the existence of the --siteconfigpath option, why can't spamd work with multiple predefined configurations?

  • Why can't all options be specified (or at least have their defaults set) in predefined configuration files? e.g. there appears to be no config file equivalent of spamd's --local option to use local tests only, or its --nouser-config to prevent trying to load a per-user configuration.

  • Why hasn't the slow startup time been fixed? Python has __pycache__. Emacs has dump/unexec (for its sins). Most OpenGL implementations have a shader cache.

  • What's with sa-compile?

    sa-compile uses "re2c" to compile the site-wide parts of the SpamAssassin ruleset. No part of user_prefs or any files included from user_prefs can be built into the compiled set.

    This compiled set is then used by the "Mail::SpamAssassin::Plugin::Rule2XSBody" plugin to speed up SpamAssassin's operation

    Additionally, "sa-compile" will not restart "spamd"

    Congratulations, you've added the complexity of a cache, but without actually fixing the problem enough that you can eliminate any of the other awkward workarounds that are in place.

Look, I'm a software developer. I know that features don't just happen and someone has to write them, and that every proposed feature starts with minus 100 points. But still, SpamAssassin is 20 years old, and it's not like these are new features that no-one is quite sure how they'd work. These are features that either already exist, or that people have already put a fair amount of work on improving. They're just not consistently available, or quite finished yet.

How long does that last layer of polish take though?

Monday, 12 July

11:52

Spam Filtering on Postfix/Dovecot, pt 3: Per-user Adaptive filtering [Another grumpy sysadmin]

In part 2, I got Postfix to check for obvious spam and reject it altogether.

For emails that are only probably spam, I want to filter them into the user's Junk folder. Also, I want the system to track email that the user manually moves into and out of their Junk folder, and use that to make the filter better over time. This type of filtering was first proposed by Paul Graham in his 2002 essay A Plan for Spam.

One point mentioned in A Plan for Spam is:

each user should have his own per-word probabilities based on the actual mail he receives. This (a) makes the filters more effective, (b) lets each user decide their own precise definition of spam, and (c) perhaps best of all makes it hard for spammers to tune mails to get through the filters.

I have seen some suggest that Graham's point a) is incorrect, becase adaptive filters become more effective the more email they see. Therefore, a shared adaptive filter which sees all the email on a system will learn quicker and be more effective than one that sees only one user's email. This does make sense in the general case (although I have not seen any hard numbers comparing the approaches) but I disagree with the idea of using a shared filter based on Graham's point b) that each user should have their own precise definition of spam.

Whose spam is it anyway?

For one thing, some people have a genuine interest in the things that other people might mark as spam. As I mentioned in part 1, I get a lot of buy-to-let spam, and I'd like my personal spam filter to bin all of it. But some people (even some honest, upstanding citizens) work in the buy-to-let industry, and would definitely not want email about that going to their Junk folder.

Another reason is that different people use spam filters in different ways. Some people sign up for mailing lists, on purpose, and then after some time simply become uninterested in those emails. Unfortunately, some of those mailing lists can be hard to unsubscribe from. Therefore some people "unsubscribe" from mailing lists by marking them as spam until those emails stop showing up in their Inbox.

There is an argument that this is "wrong". Those emails are not technically spam, because spam is unsolicited, and as the user signed up to that mailing list those emails don't qualify as spam. As a techie (and therefore a born pedant) I have some sympathy with that view.

On the other hand, it is the mark of a good tool that people adapt it to perform tasks unimagined by its original creator. If people create a workflow to manage their environment in a way that works for them, I don't think it's very helpful to tell them that they shouldn't do that. Firstly because it's generally ineffective (some people will do it anyway), but mostly because it gives rise to the kind of resentment users can have toward IT admins for rules that don't seem to benefit anyone, but still make their lives worse. At the very least, you should try to provide users with an alternative tool that solves their problems in a better way, before asking them to stop using the tool they already have in the way you don't like.

Therefore, if two people on the same system sign up to a mailing list, and one person wants to treat it as spam when the other doesn't, let them. Give each user their own spam filter.

Dovecot Antispam

While I'm using Postfix to transfer email to and from the internet (the Mail Transfer Agent), it's not responsible for delivering and managing email on the local system. For that I'm using Dovecot as both the local delivery agent and as the IMAP server which users connect to to read and manage their email.

Dovecot has a plugin, dovecot-antispam, which allows it to train a spam filter based upon users moving emails into and out of the Junk folder. This plugin has a few different backends for working with different spam filters.

  • pipe - Sends (pipes) email to a spam filter. This is a generic backend that can use any spam filter, but you need to tell dovecot-antispam what program to run to classify emails, what parameters to pass it, and how to tell the filter if an email is spam or ham. It's flexible, but requires some careful configuration.

  • spool2dir - This is a generic backend like "pipe", but instead of sending emails to the spam filter, it drops them into different directories for spam/ham. You then have to set up "some other system" to monitor those directories and run the spam filter on any emails that show up there. That has all the complexity of the "pipe" backend, and then some more on top, so I'm avoiding that one.

  • dspam - is a spam filter which is not packaged for Debian, and by the looks of the project was last updated in 2012. I'd have to built it myself from sources. This is not appealing.

  • CRM114 - "a system to examine incoming e-mail [...] and to sort, filter, or alter the incoming files or data streams according to the user's wildest desires." It's packaged for Debian, and dovecot-antispam has been adapted to work specifically with it. Let's have a closer look at that.

CRM114 - the Controllable Regex Mutilator

(Named after The CRM 114 Discriminator from Dr. Strangelove.)

First stop after installation, the crm(1) man page:

CRM114 is a language designed to write filters in. It caters to filtering email, system log streams, html, and other marginally human-readable ASCII that may occasion to grace your computer.

...followed by a lot of information about how the CRM programming langauge works.

Huh. So, CRM114 isn't a spam filter. It's a programming language for writing many types of filters - including spam filters. Anyway, right at the bottom of the man page, it says:

This manpage describes the crm114 utility as it has been described by QUICKREF.txt, shipped with crm114-20040212-BlameJetlag.src.tar.gz. The DESCRIPTION section is copy-and-pasted from INTRO.txt as distributed with the same source tarball.

Let's track down and have a look at INTRO.txt:

If you are reading this to get information on how to install CRM114 as a mailfilter, you have the _wrong_ document.

But fear not, we _do_ have the document you want. The document you want if you want to know how to install CRM114 as a mailfilter is:

    CRM114_Mailfilter_HOWTO.txt

*sigh* Fine. From The CRM114 & Mailfilter HOWTO:

CRM114 is a *language* designed to write text filters and classifiers in. It makes it easy to tweak code.

Mailfilter is just _one_ of the possible filters; there are many more out there and if Mailfilter doesn't do what you want, it's easy to create one that does.

Mailreaver is another one of the filters, with different (and better, I hope) designs, that can use Mailtrainer (yet another filter) to build even better statistics files.

There are yet other filters written in CRM114; you can read all about them on the web page:

    crm114.sourceforge.net

(and if you create one, and want to share it, put it on a web page and send me an email so I can add a pointer.)

...followed by a list of the 8 (yes, 8) major steps to using CRM114 mailfilter/mailreaver.

I just wanted a spam filter. I got a langauge for writing spam filters, two different spam filters with different designs, a third filter to do something slightly different, a link to a web page with even more filters on it, and the option of not just configuring the filters I already have, but the offer of actually rewriting them, or even of writing my own, in a language I've never used before.

I'm definitely getting some choice overload vibes here.

CRM114 mailreaver

The Mailfilter HOWTO quoted above seems to suggest that mailreaver is the preferred CRM114 spam filter, and the dovecot-antispam Configuration documentation includes a reference to it too, so let's skip trying to evaluate all the different possible CRM114 filters for now, and just go with that and see where it leads.

Looking closely, mailreaver is an actual program. When installed, it's marked as executable and includes a shebang line that allows it to run like any other Unix script. The problem is, it's installed outside of the normal PATH, doesn't come with it's own man page, and doesn't provide usage information if run as /usr/share/crm114/mailreaver --help. That does not inspire confidence.

Then, from reading the HOWTO, configuring it is really awkward.

  • There's no default configuration file installed anywhere under /etc, where configuration files are normally found. There is an example configuration file in /usr/share/crm114/mailfilter.cf, but you need to edit it before you can use it. The complication here is that you can't edit that version because your changes will get overwritten if the package gets updated for any reason. So you need to copy it yourself to somewhere like /etc and edit that version.

  • The configuration file syntax seems unique to CRM114/mailreaver and is just strange. Most configuration files would allow you to set an option, like adding spam headers to scanned emails, with something like add_headers=yes. The way to do this with mailreaver is :add_headers: /yes/.

  • At least, that's one configuration file. Along with the mailfilter.cf, there's also the rewrites.mfp configuration file, which is required, but can be empty, but if it's not empty its syntax looks like yourname@yourdomain.yourplace>->MyEmailAddress where ">->" is the "rewrite operator", whatever that is.

  • You also have to create the spam/ham statistics files, even though they also can be empty. mailreaver apparently can't create these files itself on first use if they don't exist, so when you create a new user on your mailserver, you have to create empty statistics files for them otherwise the filtering won't work.

  • The spam/ham statistics files have the extension .css, despite mailfilter being created around 2004, and Cascading Stylesheets (CSS) having been an important part of HTML4 when it was released in 1997, 7 years and a dot-com bubble previously. According to The CRM114 FAQ it's because they were originally "CRM Sparse Spectra" files, and I know it's a petty complaint, but reusing a well-established file extension still bothers me.

  • A lot of the documentation insists that the configuration files and the statistics files have to be in the same directory. This means you couldn't set up a global configuration with per-user filters. Looking a lot further down through the Mailfilter HOWTO, when it finally gives a reference for the mailreaver command options, it does say that there is a separate --config option for using a file other than mailfilter.cf as the config file - but implies that only the file name, rather than a full path to the config file, can be given.

    Looking back at the dovecot-antispam sample configuration for mailreaver, it looks like you can actually give a full path to a config file. Also, examining the mailreaver source code maybe backs that up, but the CRM114 language is unlike any other programming language I've ever seen, and at this point I'm fed up enough that I seriously can't be bothered to learn it sufficiently to figure it out for certain.

What-a-mess!

CRM114/mailreaver doesn't feel like a finished product to me.

It feels like a research project someone got working to scratch their own itch. Then, when it did what they needed, and they knew how to modify it to do anything else they wanted it to, they stopped improving it further. It seems that, despite its technical capabilities, no community formed around it to give it the push to become a genuinely usable bit of software, by people who aren't particularly interested in the technology behind how it does what it does. Normally that push might come from either direct contributions from that community, or just from feedback where multiple users all describe the same problems.

It looks like CRM114 is a programming language that has had less than a dozen programs written in it - ever. With this, I had a look around the project's release history, and I don't think it's been updated since 2009.

CRM114 may do a really good job of solving its author's problem, which is fine, but it's not going to work for me.

SpamAssassin's Bayesian Classifier to the rescue

Having been defeated by CRM114, I now need to find a different adaptive spam filter that I can plug into dovecot-antispam via it's "pipe" backend.

Fortunately, SpamAssassin - which I already have installed and am using for definite global spam rejection - has such a filter in it's Bayesian Classifier, sa-learn.

I'll look at that in part 4.

Monday, 21 June

14:34

Spam Filtering on Postfix/Dovecot, pt 2b: The Road Almost Taken [Another grumpy sysadmin]

In part 2, having rejected using a Postfix Content Filter to check for spam because it can't reject emails that are sufficiently spammy, I set up spam filtering with SpamAssassin/spamass-milter. However, that doesn't tell the whole story, and I ended up there by more of a roundabout route than that post lets on.

Spam filtering round 2 - Reject with spamass-milter?

I did get to the point of setting up spamass-milter. I was most of the way through configuring it when I re-checked the Postfix Milter documentation, and came across the following limitation:

When you use [some other Postfix feature], Milter applications have access only to the SMTP command information; they have no access to the message header or body, and cannot make modifications to the message or to the envelope.

...which kind of breaks spam filtering.

I mean, sure, you can do some spam detection on the SMTP command information, such as the sender address, but the data that's best for determining if an email is spam is the content of email itself. Without that, your spam classification just isn't going to be very good at all.

Combining that problem with the "2 daemons connected by another program" setup that I already wasn't very happy with, I decided to look into another approach.

Spam filtering round 3 - Before-Queue Filter with spampd

Looking further through the Postfix documentation, I realised that I wanted a Before-Queue Content Filter proxy. This is a program that Postfix passes the email to, which can filter or transform it as it wants, and then either tell Postfix to reject the email, or pass it back into the Postfix queue.

Fortunately, the Spam Proxy Daemon spampd exists, which uses SpamAssassin to do exactly that. Also:

spampd was initially designed as a content filter mechanism for use with the Postfix MTA.

...which is good. But also, from the spampd man page, under Installation it says:

Note that spampd replaces spamd from the SpamAssassin distribution in function. You do not need to run spamd in order for spampd to work.

...and that's great! I installed it, got it up and running, and started to configure it appropriately to reject unwanted emails. But then, looking further through the spampd man page, right at the bottom, under To Do it says:

Add configurable option for rejecting mail outright based on spam score.

So, while Postfix makes it possible for a proxy to reject emails, this particular one can't. Which is the only thing I wanted it for.

...well, that's just fantastic. *sigh*

Spam filtering round 4 - Before-Queue Filter with custom proxy

a.k.a. Fine, I'll do it myself

Having familiarised myself with how the proxy setup works with spampd, I thought I might be able to create one myself without too much effort.

According to my reading of the section How Postfix talks to the before-queue content filter, it sounded to me like Postfix just opens a connection to the proxy, and passes the whole email command queue and message to it in one long stream of data. Then it expects a single response, and for the proxy to pass the whole stream back into Postfix as-is if the email is accepted. That didn't actually sound too hard.

In particular, Linux has programs which mean you don't have to deal with the complexity of handling network connections between Postfix and the proxy yourself. There are some which will listen for incoming connections for you, and pass the data as if it were coming from a file. Traditionally this was done with one of the many implementations/replacements of inetd, but more recently this functionality is part of the systemd service manager, with systemd.socket activation. Then, there are programs that will accept data as if you were writing it to a file, and send it on to a network connection instead, such as the many implementations of netcat.

The plan was: Read the whole email command stream and message from Postfix, pass it to SpamAssassin, and then based on the spam score either reject it, or accept it and pass the whole thing back into Postfix again.

That's about 12 lines of shell script. No problem.

It always takes longer than you expect...

...even when you take into account Hofstadter's Law

-- Hofstadter's Law

Having got a (roughly) 12-line shell script prototype up and running, I got Postfix to connect to it, but it wasn't receiving any data. Playing around with some ideas, I tried writing the response immediately, and I started to be able to read the command stream. If I wrote more data back to the program that was sending it, I got more of the command stream.

I realised that Postfix actually wanted to talk the full SMTP protocol to the proxy.

This means sending a "service ready" message as soon as Postfix connects, and then acknowledge each command Postfix sends me before it would send another, until it finally sends the email data. This isn't really that hard (hurrah for text-based protocols!) but it was more complex than I was expecting. If I'd realised this at the start, I wouldn't have picked shell scripting as the language to do it in.

But by the time I'd figured all this out, I'd written the first 90% of the proxy anyway. Given that, I thought I might as well finish up the second 90% of the work.

The result was spamprox - The barely SMTP-capable spam proxy. It's kind of a bodge job, doesn't really understand what it's doing, and would probably fail badly if it had to deal with anything other than the exact commands Postfix normally sends. On the plus side, it doesn't have to deal with anything else, and for the one role it has, it works!

All the ways that other people had come up with to do spam filtering weren't good enough, but I had managed to create one that was.

A haughty spirit before a fall

Having got through all that, I started to write everything up. After lots of writing and editing and changing tone, as I was getting close to finishing, I went back through all the documentation so I could link to anything relevant and include quotes where appropriate. I dug up the Postfix Milter documentation to get the exact text of the "you can't rely on having the email contents in milters" limitation and found:

When you use the before-queue content filter for incoming SMTP mail (see SMTPD_PROXY_README), Milter applications have access only to the SMTP command information...

Wait, what?

The "[some other Postfix feature]" that I'd dismissed as being an internal filter or feature I might have wanted to enable at some point, turned out to be "an external proxy" which you might use for something like... a spam filter. i.e. The reason that a milter wouldn't have the data it needed for spam filtering, would only be because I had also set up a proxy to do spam filtering? Except, I'd never need to set up both, so my fear about milters not working was completely unfounded. I just didn't have the knowledge I needed to understand that properly when I first read the docs.

I still wasn't entirely happy with the "2 daemons connected by another program" milter setup, but given that the alternative was a home-grown bodge job, written in a language that didn't turn out to be well-suited for the complexity of the task, I thought that maybe I ought to reconsider my approach.

The benefit of going on a wild goose chase...

...is that at least you get plenty of exercise.

-- Author unknown, paraphrased

Once I'd got past my reluctance to throw away the work I'd put into spamprox, and remembered that I'd prefer not to have to maintain my own custom parts in my email setup, that's when I decided to go back to the spamass-milter setup - as described in Part 2.

That decision also involved abandoning a lot of the blog write-up I'd already done for that work. The tone I'd gone with, of getting increasingly exasperated with the failure of one approach after another to try and make existing components work in what seemed to be the most obvious way, didn't make sense anymore. The run-up to getting the milter approach working (i.e. Part 2 of the series) didn't have enough failures to make the point, and once I'd realised that the chain of failures was due to my own misunderstanding, I needed to rewrite everything that came after (i.e. this post) to be a lot more humble.

Still, I learned quite a bit. So it wasn't a complete waste.

Anyway, in between going through 4 different approaches to spam filtering, including writing my own proxy to be part of that, writing nearly all of it up before discovering my mistake, wavering over the decision to abandon that and go back to a previously dismissed approach, and then finding the motivation to rewrite the writeup, accounts for a reasonable proportion of the delay between Part 1 and Part 2.

Hopefully Part 3 will be completed much more promptly. See you then...

Monday, 14 June

09:57

Spam Filtering on Postfix/Dovecot, pt 2: Rejection. [Another grumpy sysadmin]

In Part 1 I set up spam filtering for the Postfix mail server with SpamAssassin, according to the Debian SpamAssassin default instructions for doing so.

Keep spam off the system

I don't want to waste any more resources than absolutely necessary on spam. "Resources" includes processing time and disk space, but also human attention, like the time it takes to check through your spam folder if you suspect something may have ended up there. Therefore, I'd like to keep as much spam as possible off the system entirely.

If I receive a spam email, I have 3 options for not keeping it.

1. Throw it away

The simplest, but least polite option, is simply to delete any email that gets a high enough spam score. However, email is meant to be a reliable delivery service, as described in the email standards document RFC 5321 §6:

When the receiver-SMTP accepts a piece of mail (by sending a "250 OK" message in response to DATA), it is accepting responsibility for delivering or relaying the message. It must take this responsibility seriously. It MUST NOT lose the message for frivolous reasons, such as because the host later crashes or because of a predictable resource shortage.

[...]

As discussed in Section 7.8 and Section 7.9 below, dropping mail without notification of the sender is permitted in practice. However, it is extremely dangerous and violates a long tradition and community expectations that mail is either delivered or returned. If silent message-dropping is misused, it could easily undermine confidence in the reliability of the Internet's mail systems.

So, don't do that.

2. Bounce it

Again from standards doc RFC 5321 §6, linked above:

Utility and predictability of the Internet mail system requires that messages that can be delivered should be delivered, regardless of any syntax or other faults associated with those messages and regardless of their content. If they cannot be delivered, and cannot be rejected by the SMTP server during the SMTP transaction, they should be "bounced" (returned with non-delivery notification messages) as described above.

If you've ever received an email titled "Delivery failure" or similar, this is a bounce message.

The trouble with bounces is that spammers often use hijacked computers to send their emails, and lie about the return address. So if you've ever received an email titled "Delivery failure" for an email you never sent, this is because a spammer tried to send an email to someone, and put your email address as the sender. When the recipient's system bounced the email, the bounce notification got delivered to you.

As the Debian SpamAssassin Notes point out:

The problem is, spammers (and viruses) routinely forge the from address on the envelope. This means that if there is a bounce generated, it will go to this address, which can be randomly generated, or worse, an innocent third party.

Therefore, it is very important that your system doesn't generate a bounce.

Similarly, from the Postfix content filter instructions:

NOTE: in this time of mail worms and spam, it is a BAD IDEA to send known viruses or spam back to the sender, because that address is likely to be forged.

So, don't do that either.

3. Reject it

The third option is to reject the email without accepting it in the first place.

The way email works is that the sender's system connects to the recipient's system, and passes all the email data to it. As described above, at the end of this process the recipient's system says "OK", and at that point it takes responsibility for the email. However, it also has the option of saying "No" for some reason. For example, if the email is too large, or if the recipient's mailbox has reached its quota limit. (Remember mailbox quotas? They still exist in some places.)

Then it's the sending system's responsibility to handle the problem of the email not being delivered - and it knows where the email really came from.

This is the best way to handle emails you want to keep off your system.

The problem

The problem is that, as I was trying to figure out how to make the content filter tell Postfix to reject emails with a high enough spam score, I learned that Postfix's content filters are applied after the email has already been accepted by the mail server. At that point, it's too late to reject it. Therefore, I need to find a different way to configure SpamAssassin to work with Postfix.

Spam filtering round 2: Reject with spamass-milter

A milter (mail filter) is a type of external program that some mail servers can use to implement custom filtering mechanisms. Postfix has support for milters, and can use them early enough in the process pipeline that they can be used to reject emails.

Further, there is a milter wrapper for SpamAssassin, spamsass-milter, that allows SpamAssassin to be used as one. Great!

We can solve any problem by introducing an extra level of indirection...

...except for the problem of too many levels of indirection.

-- The Fundamental theorem of software engineering

One thing about how milters work is that they cannot be run on-demand, but need to run as a daemon that the mail server connects to. You might remember from Part 1 that SpamAssassin is also best run as a daemon (spamd) so that its slow startup can be done only once, ahead of time. Given this, it would be reasonable to expect that spamass-milter would integrate SpamAssassin into itself to absorb this cost. Sadly, reasonable expectations often appear to be something of an unattainable luxury in this kind of endeavour.

Therefore, to use spamass-milter, you need to have both the spamass-milter and spamd daemons running all the time. Worse, spamass-milter can't even connect directly to spamd itself, but needs to run the external spamc program to do so on it's behalf. So, rather than just have the mail server:

  • Run a program to check if an email is spam

Instead, it:

  • Connects to one daemon, which
  • Runs a program, which
  • Connects to a second daemon to check if the email is spam.

I realise that part of this is just a consequence of the way milters work, but still, it seems like a lot more moving parts than should be necessary.

It works on my machine, ship it!

Aesthetic considerations about the elegance of the solution aside, once the milter was set up as described in the Debian SpamAssassin Postfix Milter instructions, the system was classifying spam as before, but rejecting any sufficiently spammy emails before they were accepted. Hurrah!

After the work it took to get here, that's good enough for now.

But... we're not done yet. Stick around for Part 3! Well, don't stick around - given how long it took me to get this post out you might be in for a bit of a wait. Go do something else, and maybe check back in a week or so. There might even be a Part 2b first.

Friday, 19 February

14:11

Spam Filtering on Postfix/Dovecot, pt 1. [Another grumpy sysadmin]

I've been starting to get an annoying amount of spam here (with a big thank you to UK Buy-to-Lets and UK Property Offers, you utter spam artists), so it's time to set up some spam filtering. In particular, SpamAssassin.

I've used it on other servers before, but they've been running the Exim mail server. The nice thing about Exim is that it has a SpamAssassin plugin which you can install it and set a single configuration option, and everything automagically Just Works™.

This system is running Postfix, which does not have any special integration with SpamAssassin. However, it does have the ability to run incoming emails through external filters in a couple of different ways, and SpamAssassin can be used as such a filter, so we can work with it that way. Fortunately there are a few tutorials on how to set this up out there, including the SpamAssassin Debian Wiki page.

The most straightforward way to integrate SpamAssassin into Postfix seems to be setting it up as a "content filter".

spamc and spamd.

Looking at the Debian wiki page, the way they recommend to set this up is by setting up the spamd daemon, and to have Postfix call spamc to connect to it to check for spamminess.

I've had a look at the spamd man page and the README, and the reason for the spamc/spamd setup is for performance. SpamAssassin takes a comparatively long time to startup compared to the time it takes to classify an email, so by running a daemon which does all the startup ahead of time, the smaller client can connect to it and ask it to classify an email much more quickly.

However, because there ain't no such thing as a free lunch this comes with a couple of costs. First is that the daemon takes up a certain amount of memory all the time, not just when emails are being classified. Second, there are some small but non-zero security concerns involved (see the spamd README linked above for more info).

Further, email is not an instant messaging system. Email is a system designed to tolerate delays and bad connections, with mails sometimes being held on one system, waiting for a time when they can be forwared to another. Therefore, email delivery is not a time-sensitive operation.

As a result of the above, and because I don't receive that much email on this system, and a general desire on my part to only run services that are strictly necessary, I would prefer to take the tradeoff of accepting a long startup time in exchange for not running the spamd service all the time. I therefore want to configure postfix to call spamassassin directly instead of spamc.

Running spamassassin directly.

Except. Looking at the spamassassin man page, it says:

Please note that SpamAssassin is not designed to scan large messages. Don't feed messages larger than about 500 KB to SpamAssassin, as this will consume a huge amount of memory.

Hmmm... how do I limit the size of the messages that spamassassin processes then? Looking through all the other documentation I can find... I can't. SpamAssassin apparently has no option to say "Do not scan messages above a given maximum size", and Postfix apparently has no way to send messages of different sizes to different content filters. How does the recommended configuration handle this? Well, the spamc man page shows that spamc has a -s/--max-size option to limit the maximum message size to pass to spamd. So spamc is the way to limit the maximum size of a message that is passed to spamassassin.

OK, does spamc have to access spamassassin through spamd? Is there any way to make it run spamasssassin directly, if all the other criteria are met?

...No. (At least, not that I've found.)

So, despite the spamd README saying:

Then, configure your system to run spamd in the background, and where your mailer invokes 'spamassassin' instead invoke 'spamc'.

it appears that the original default setup "where your mailer invokes 'spamassassin'", is no longer feasible if you expect to routinely receive emails larger than 500kB. Which you really should, in this day and age. But given that the author of that README did their performance tests on a 400MHz K2-6 CPU (which was discontinued in 2003) with emails between around 10kB - 100kB in size, I guess I can't really blame the advice for not being particularly relevant 20 years later.

Go with the flow.

After all that, I am left with no option but to enable the spamd daemon and connect with spamc in the recommended fashion.⁰

In some ways, this is a good thing. The recommended configuration is one that most users will be running (duh!) so it gets the most testing and will have the fewest number of surprises. Also, if I do run into a problem and need to ask around, I'm more likely to be able to find someone who had the same problem and already been told how to fix it (e.g. on Server Fault), or, if my problem is new, I'm more likely to be able to find someone with a similar setup to help me troubleshoot.

Fortunately, when I did that and sent myself an email with the Generic Test for Unsolicited Bulk Email (GTUBE) text in it, it came through with the following headers:

    X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on localhost
    X-Spam-Flag: YES
    X-Spam-Level: **************************************************
    X-Spam-Status: Yes, score=999.0 required=5.0 tests=ALL_TRUSTED,GTUBE
     autolearn=no autolearn_force=no version=3.4.2

which is sufficient to set up filters that would put spam in users' Spam folders, instead of their Inbox.

That's great! Unfortunately, it's not quite good enough. To be continued in part 2...


⁰ Actually, that's not entirely true. I could write my own helper program that reads its input and only invokes spamassassin if the input is small enough, or passes it straight to the next stage of processing if it's too large. It wouldn't even be that complicated. Probably. But adding my own programs into the email processing pipeline is taking customisation a bit too far for my liking at this stage.

Thursday, 14 June

23:14

Google leveraging search dominance to drive sign-ups [Another grumpy sysadmin]

Google has a webpage where you can tell Google about new websites that you think Google should start crawling and indexing. It's had this page for years - decades even at this point. It's at https://www.google.com/addurl.

At some point, Google made it so that you need to sign in to a Google account to do this.

Search engines rely on having as complete an index as possible. The more websites they index, the more useful they are, and the more users they'll get. If there are websites they don't index, users will go to another search engine to find them, and there's a real possiblity they won't come back.

Imagine if DuckDuckGo did that. If you needed a DuckDuckGo account to tell DuckDuckGo about your new website, no-one would ever submit a new website to their index. They'd just not bother and DuckDuckGo would be worse off.

Google, on the other hand, knows that you need it more than it needs you.

Google is so ubiquitous, is so widely used, and has been so well verbed, that it knows that individual websites need to be on Google more than Google needs individual websites. At this point, if your website isn't on Google that's going to be seen as more of a problem with your website than it is a problem with Google. And Google knows this. So, if you want to add your website to Google's search engine you need a Google account. Which means you need to agree to Google's eleventy-billion page terms-of-service, which you're not going to read, and even if you did you've got no way to actually object to any of the clauses, and means you "voluntarily consent" to Google's capture, storage, processing and sharing of whatever data about you Google can get its mitts on, in whatever ways it thinks it can get away with.

Talk about leveraging market dominance and success in one area of business to strong-arm people in another.

Gits.

Thursday, 07 June

16:28

Apache Icons [Another grumpy sysadmin]

So, there I was, setting up the website, and I decided to add some icons to it to break up the text.

I tried setting up a symlink to make /icons on the webserver point to /usr/share/icons on my system, but that didn't work for some reason. I didn't want to waste too much time figuring out exactly what the problem was, so I tried adding an "Alias" directive for /icons to the Apache config, and that worked, so I moved on and decided to come back to the problem later.

The other night, I went back to it. I removed the "Alias" directive, leaving only the symlink in place, and tried to figure out what was happening. I kept getting HTTP 404 "Not Found" errors for the icons, and an HTTP 403 "Forbidden" error for the directory itself. Clearly, something was not right.

I checked that the symlink was correct - yes. I checked that the user that the webserver ran under had permissions to read the /usr/share/icons directory on the underlying filesystem - yes. I added a "Directory" directive for /usr/share/icons and ensured that the "Options" were set to follow symlinks, and to allow indexes (directory listings) - yes. Still not working.

So I checked everything again. Then I had a browse round StackExchange and its member sites for issues with symlinks or directory listings not working. Still nothing.

Eventually, I did what I probably should have done some time before, and checked the error log. There I found the significant clue: Cannot serve directory /usr/share/apache2/icons/.

It turns out that the default configuration of mod_autoindex, which is enabled by default in Apache installations on Debian, takes over the /icons URL to serve up the filetype icons in directory listings, from /usr/share/apache2/icons. Further, that directory is set to not serve directory listings, so no matter what options I put on the /usr/share/icons directory, it wasn't going to make any difference to the options for the directory that was actually being served up.

By moving my symlink from /icons to /img/icons, and updating all the links appropriately, suddenly everything worked again.

So, lesson learned - check the error log sooner next time.

But... it seems odd to me that a built-in Apache module would take over such an obvious and useful part of the root namespace on all websites for itself. Do the authors really think that no-one would want to use /icons on their own site? It's not just mod_autoindex, either. mod_info uses /server-info, mod_ldap uses /ldap-status, and mod_status uses /server-status. Granted, these locations aren't quite as valuable as /icons, and not all of those modules are enabled by default.

Still, I'd have thought that Apache would reserve something more obscure, like /x-apache/, or even /x-\*) and require all modules to use a subset of this namespace instead.

Maybe I should submit a patch...

Thursday, 31 May

19:42

Wednesday, 23 May

18:45

Feeds

FeedRSSLast fetchedNext fetched after
Another grumpy sysadmin XML 13:16, Wednesday, 22 January 16:16, Wednesday, 22 January