Setting up your email, either on your phone, or on a traditional desktop mail client (yes, some of still use those), used to be a real pain. You'd need to find the documentation from your email provider that told you which protocols they supported, which servers to connect to (that can be different for sending and receiving email), which port numbers to use, whether to use a secure connection (!), and then figure out where on your email client you needed to put all this information.
These days, it's much easier. There's a standard, RFC-6186, which email providers can use to publish the relevant settings as part of their DNS records. That way, email clients can look all of that information up just from the domain portion of your email address. Simples.
So why am I writing about it?
Well, before RFC-6186 was created, different email clients came up with their own methods of autoconfiguration. Normally, this was done by publishing the details, in a format they invented, in a special place on a website associated with the email domain.
Now, while the recent versions of these email clients are mostly capable of automatically configuring themselves with RFC-6186 style information, old versions of those email clients still exist in the wild. Also, users who are unable (for whatever reason) to upgrade to newer versions of software - like email clients - are probably the ones you want to be able to help the most. Therefore, supporting these legacy autoconfiguration systems would be really useful.
One principle widely regarded as important, or a "best practice" in software development and system administration, is Don't Repeat Yourself. Briefly, if you have multiple copies of some code or data somewhere, then if they ever need to change it's possible that they won't all be changed together. Copies can get out of synchronisation. Therefore, each piece of code or data should have one canonical location, and each other use should refer to that same one. That way, it's impossible for copies to get out of sync.
Hence, when setting up legacy autoconfiguration systems, they should not require their own copies of the email settings, but load the settings from the canonical RFC-6186 data as-needed.
The two most commonly-used legacy autoconfiguration methods I'm aware of (and the two I wanted to actually support), are Microsoft's Outlook, and Mozilla's Thunderbird.
Outlook's legacy autoconfiguration details can be found here, whereas Thunderbird's can be found here. They both involve fetching an XML file containing the settings from a given location on a website, but each uses a different XML schema, a different location, and sometimes a different website.
Fortunately, it's possible to generate XML files dynamically, while loading external information (like RFC-6186 records), without too much difficulty.
Enter emailautoconf, which I put together to generate those files in that way.
If you own the domain example.com
and have set up email for it, have
configured your RFC-6186 records appropriately, and are running a
website at https://example.com/
(note: no www.
) with the Apache2
web server and PHP, then setting up emailautoconf can be as easy as
grabbing a copy of the sources, installing it with sudo make install
,
and adding the following line to your Apache2 configuration file:
Include /usr/local/share/emailautoconf/apache2/main.conf
More information can be found in the
README,
including about how to configure it on the autoconfig
and
autodiscover
subdomains which are also specified by those
configuration systems.
Note that while emailautoconfig currently only supports the Outlook and Thunderbird legacy autoconfiguration systems, and the Apache2 web server, that's only because those are all I needed right now. I'm open to support other systems or webservers, if anyone is interested in contributing patches.
posted at: 16:31 | path: / | permanent link to this entry
In part 3, I tried to get Dovecot with the dovecot-antispam plugin to enable per-user adaptive spam filtering with CRM114.
So, sometimes I have an idea about how I want to solve a problem, which is apparently different from how everyone else does it.
That tends to be awkward, because all the documentation and write-ups other people have done aren't geared to my use case.
This can be because I don't know what I'm doing, but when I figure that out, the way everyone else does it makes sense and I end up doing that.
This time I tried not to have a strong opinion on how to set up dovecot-antispam
, and instead let the documentation and examples guide me to the typical use case.
Unfortunately, that didn't work out.
Now I have to pick an adaptive filtering system that isn't CRM114
, and make that work instead.
Looking through the Debian package repository for spam filters, it turns out that there are quite a few. I don't know much about most of them.
Fortunately, Debian has the Popularity Contest. It's an opt-in feature that allows Debian users to have their systems report which packages they have installed, so that data can be aggregated to get an idea for how popular each package is. For each spam filter I could find, I had a look at the popcon numbers, and which versions are present in the last three releases of Debian to get an idea of whether they're still under active development:
Package | Popcon | Bullseye | Buster | Jessie | Notes |
---|---|---|---|---|---|
bmf | 6 | 0.9.4 | 0.9.4 | 0.9.4 | |
bogofilter | 43,823 | 1.2.5 | 1.2.4 | 1.2.4 | Dependency of evolution-plugin-bogofilter (popcon 31,690), which is recommended by evolution (popcon 44,217) |
bsfilter | 494 | 1.0.19 | 1.0.19 | 1.0.19 | |
crm114 | 128 | 20100106 | 20100106 | 20100106 | |
ifile | 8 | 1.3.9 | 1.3.9 | 1.3.9 | |
mailfilter | 30 | 0.8.7 | 0.8.6 | 0.8.6 | |
qsf | 8 | - | 1.2.7 | 1.2.7 | |
scmail | 4 | 1.3.4 | 1.3.4 | 1.3.4 | |
spamassassin | 7,496 | 3.4.6 | 3.4.2 | 3.4.2 | |
spambayes | 95 | - | 1.1 | 1.1 | |
spamoracle | 13 | 1.6 | 1.4 | 1.4 | |
spamprobe | 426 | 1.4 | 1.4 | 1.4 | |
sylfilter | 286 | 0.8 | 0.8 | 0.8 |
bogofilter tops the popularity list by a huge margin. So much so in fact that I had to look into why that might have been, and discovered that it works as a plugin with a couple of email clients. Notably, it is installed by default with Evolution, which is very popular and included as part of some of the Desktop Environments shipped by Debian, which accounts for a large majority of its popcon numbers.
As far as non-email-client installs goes, bogofilter
and spamassassin
seem to be in roughly the same ballpark of popularity as each other, and far ahead of all the other adaptive spam filters - including being 50× more popular than CRM114
.
They have also both received an update for the recent Debian bullseye
release.
And I already have spamassassin
installed.
dovecot-antispam
with SpamAssassin's sa-learn
.Although the pipe
backend for dovecot-antispam
is generic and needs to be told how to run the relevant training program, it's not actually that hard to do so.
It also helps that there's a sample of all the required options in the man page, under Configuration.
The entire configuration file I ended up with (/etc/dovecot/90-plugin-antispam
) is:
plugin { ## Generic options antispam_backend = pipe antispam_trash = Trash antispam_spam = Junk ## Pipe plugin options antispam_pipe_program = /usr/bin/sa-learn antispam_pipe_program_args = --siteconfigpath=/etc/spamassassin/delivery;--local;--username=%u antispam_pipe_program_spam_arg = --spam antispam_pipe_program_notspam_arg = --ham antispam_pipe_tmpdir = /tmp }
The only part that might need a bit of explanation there is the --siteconfigpath=/etc/spamassassin/delivery
- because I want the local delivery spam options to be different from the global prefilter options.
In particular, I want the global prefilter to have adaptive filtering disabled, whereas thats the whole point of the local delivery filter.
But also, while I want the global prefilter to make use of online spam tests (AskDNS, SPF, etc...), I need --local
to have the local delivery filter use local tests only - and you can't specify that in the config file!
This post originally had --configpath=/etc/spamassassin/delivery
in the above - it should have been --siteconfigpath
.
There's still one piece missing from the puzzle.
The dovecot-antispam
plugin checks if email is moved into or out of the Junk
folder, and runs the training program accordingly.
But how does it check if incoming email is spam, and move it into the Junk
folder?
It doesn't. You have to set that up separately.
sieve
Second part first - moving incoming spam into the Junk folder.
This isn't too hard, if you already have sieve filtering enabled.
You can configure a sieve_after directory and drop a sieve file in there to look for the SpamAssassin spam header and move the email appropriately.
My /etc/dovecot/sieve-after/99-spamfilter.sieve
script is just:
require ["fileinto","regex"]; if header :regex "X-Spam-Status" "^Yes" { fileinto "Junk"; stop; }
spamc
/spamd
, again.The first part, checking if incoming email is spam with SpamAssassin, is trickier. Because of course it is.
The first wrinkle is that emails might (still) be larger than SpamAssassin's recommended limit of 500kB.
Therefore you can't pass the emails to spamassassin
, but have to pass them to spamc
instead.
The wrinkle turns into an actual problem when you realise that spamc
has no way of passing configuration options (or a configuration path) to spamd
.
That means there's no way of getting spamc
for per-user adaptive filtering to use different options than spamc
for global rejection filtering.
The solution?
You have to run two spamd
daemons, with different configurations, and get the different invocations of spamc
to talk to the right one.
As someone who didn't want to run one spamd
daemon, this is getting annoying.
But I'm far enough down this rabbit hole that I've kind of stopped caring, and don't really have the energy to go back and start evaluating yet another spam filter.
So I set up a second daemon, with a different configuration file, and different command-line options (again, because some options can't be specified in the config file), listening on a different port, using a different PID file (because we really do want to run two instances at the same time), and that seems to be working.
The last part of the last problem is that Dovecot's delivery agent doesn't seem to have a way of passing incoming emails to another program like spamc
to actually perform the spam check.
And I only want to pass emails to spamc
if they're about to be delivered to a local user - not if they're just passing through the email system.
Fortunately, I can still fix that on the Postfix side by changing the delivery command from
mailbox_command = /usr/lib/dovecot/dovecot-lda -f "$SENDER" -a "$RECIPIENT"
to
mailbox_command = /usr/bin/spamc -p 785 -u "$USER" -e /usr/lib/dovecot/dovecot-lda -f "$SENDER" -a "$RECIPIENT"
Pretty much. Yeah. I think I've finally got a setup that does what I originally wanted to do.
This has felt like a tortuous and torturous path. Part of me thinks that this shouldn't be as hard as it has been. Spam filtering is a really important part of dealing with emails these days - and has been for a couple of decades. So why hasn't dealing with it been made more streamlined by now?
Part of me wonders if I'm just doing it completely wrong. That maybe there is a really straightforward way of setting this up that I've just missed entirely.
Even so, I have read some of the documentation for the tools I've worked with here, notably SpamAssassin, and I still have some big-picture questions:
If spamc
is meant to be a drop-in replacement for spamassassin
, why can't spamc
take the same command-line options as spamassassin
?
Why doesn't spamc
have e.g. a --local
option?
If spamc
is meant to be a drop-in replacement for spamassassin
, why can't spamassassin
take the same command-line options as spamc
?
Why doesn't spamassassin
have e.g. a --max-size
option?
If spamassassin
has been written, by the developers, to be able to work with multiple predefined configurations, as evidenced by the existence of the --siteconfigpath
option, why can't spamd
work with multiple predefined configurations?
Why can't all options be specified (or at least have their defaults set) in predefined configuration files?
e.g. there appears to be no config file equivalent of spamd
's --local
option to use local tests only, or its --nouser-config
to prevent trying to load a per-user configuration.
Why hasn't the slow startup time been fixed? Python has __pycache__. Emacs has dump/unexec (for its sins). Most OpenGL implementations have a shader cache.
What's with sa-compile
?
sa-compile uses "re2c" to compile the site-wide parts of the SpamAssassin ruleset. No part of user_prefs or any files included from user_prefs can be built into the compiled set.
This compiled set is then used by the "Mail::SpamAssassin::Plugin::Rule2XSBody" plugin to speed up SpamAssassin's operation
Additionally, "sa-compile" will not restart "spamd"
Congratulations, you've added the complexity of a cache, but without actually fixing the problem enough that you can eliminate any of the other awkward workarounds that are in place.
Look, I'm a software developer. I know that features don't just happen and someone has to write them, and that every proposed feature starts with minus 100 points. But still, SpamAssassin is 20 years old, and it's not like these are new features that no-one is quite sure how they'd work. These are features that either already exist, or that people have already put a fair amount of work on improving. They're just not consistently available, or quite finished yet.
How long does that last layer of polish take though?
posted at: 14:31 | path: / | permanent link to this entry
In part 2, I got Postfix to check for obvious spam and reject it altogether.
For emails that are only probably spam, I want to filter them into the user's Junk
folder.
Also, I want the system to track email that the user manually moves into and out of their Junk
folder, and use that to make the filter better over time.
This type of filtering was first proposed by Paul Graham in his 2002 essay A Plan for Spam.
One point mentioned in A Plan for Spam is:
each user should have his own per-word probabilities based on the actual mail he receives. This (a) makes the filters more effective, (b) lets each user decide their own precise definition of spam, and (c) perhaps best of all makes it hard for spammers to tune mails to get through the filters.
I have seen some suggest that Graham's point a) is incorrect, becase adaptive filters become more effective the more email they see. Therefore, a shared adaptive filter which sees all the email on a system will learn quicker and be more effective than one that sees only one user's email. This does make sense in the general case (although I have not seen any hard numbers comparing the approaches) but I disagree with the idea of using a shared filter based on Graham's point b) that each user should have their own precise definition of spam.
For one thing, some people have a genuine interest in the things that other people might mark as spam.
As I mentioned in part 1, I get a lot of buy-to-let spam, and I'd like my personal spam filter to bin all of it.
But some people (even some honest, upstanding citizens) work in the buy-to-let industry, and would definitely not want email about that going to their Junk
folder.
Another reason is that different people use spam filters in different ways.
Some people sign up for mailing lists, on purpose, and then after some time simply become uninterested in those emails.
Unfortunately, some of those mailing lists can be hard to unsubscribe from.
Therefore some people "unsubscribe" from mailing lists by marking them as spam until those emails stop showing up in their Inbox
.
There is an argument that this is "wrong". Those emails are not technically spam, because spam is unsolicited, and as the user signed up to that mailing list those emails don't qualify as spam. As a techie (and therefore a born pedant) I have some sympathy with that view.
On the other hand, it is the mark of a good tool that people adapt it to perform tasks unimagined by its original creator. If people create a workflow to manage their environment in a way that works for them, I don't think it's very helpful to tell them that they shouldn't do that. Firstly because it's generally ineffective (some people will do it anyway), but mostly because it gives rise to the kind of resentment users can have toward IT admins for rules that don't seem to benefit anyone, but still make their lives worse. At the very least, you should try to provide users with an alternative tool that solves their problems in a better way, before asking them to stop using the tool they already have in the way you don't like.
Therefore, if two people on the same system sign up to a mailing list, and one person wants to treat it as spam when the other doesn't, let them. Give each user their own spam filter.
While I'm using Postfix
to transfer email to and from the internet (the Mail Transfer Agent), it's not responsible for delivering and managing email on the local system.
For that I'm using Dovecot as both the local delivery agent and as the IMAP server which users connect to to read and manage their email.
Dovecot
has a plugin, dovecot-antispam, which allows it to train a spam filter based upon users moving emails into and out of the Junk
folder.
This plugin has a few different backends for working with different spam filters.
pipe - Sends (pipes) email to a spam filter.
This is a generic backend that can use any spam filter, but you need to tell dovecot-antispam
what program to run to classify emails, what parameters to pass it, and how to tell the filter if an email is spam or ham.
It's flexible, but requires some careful configuration.
spool2dir - This is a generic backend like "pipe", but instead of sending emails to the spam filter, it drops them into different directories for spam/ham. You then have to set up "some other system" to monitor those directories and run the spam filter on any emails that show up there. That has all the complexity of the "pipe" backend, and then some more on top, so I'm avoiding that one.
dspam - is a spam filter which is not packaged for Debian, and by the looks of the project was last updated in 2012. I'd have to built it myself from sources. This is not appealing.
CRM114 - "a system to examine incoming e-mail [...] and to sort, filter, or alter the incoming files or data streams according to the user's wildest desires."
It's packaged for Debian, and dovecot-antispam
has been adapted to work specifically with it.
Let's have a closer look at that.
(Named after The CRM 114 Discriminator from Dr. Strangelove.)
First stop after installation, the crm(1) man page:
CRM114 is a language designed to write filters in. It caters to filtering email, system log streams, html, and other marginally human-readable ASCII that may occasion to grace your computer.
...followed by a lot of information about how the CRM programming langauge works.
Huh. So, CRM114 isn't a spam filter. It's a programming language for writing many types of filters - including spam filters. Anyway, right at the bottom of the man page, it says:
This manpage describes the crm114 utility as it has been described by QUICKREF.txt, shipped with crm114-20040212-BlameJetlag.src.tar.gz. The DESCRIPTION section is copy-and-pasted from INTRO.txt as distributed with the same source tarball.
Let's track down and have a look at INTRO.txt:
If you are reading this to get information on how to install CRM114 as a mailfilter, you have the _wrong_ document.
But fear not, we _do_ have the document you want. The document you want if you want to know how to install CRM114 as a mailfilter is:
CRM114_Mailfilter_HOWTO.txt
*sigh* Fine. From The CRM114 & Mailfilter HOWTO:
CRM114 is a *language* designed to write text filters and classifiers in. It makes it easy to tweak code.
Mailfilter is just _one_ of the possible filters; there are many more out there and if Mailfilter doesn't do what you want, it's easy to create one that does.
Mailreaver is another one of the filters, with different (and better, I hope) designs, that can use Mailtrainer (yet another filter) to build even better statistics files.
There are yet other filters written in CRM114; you can read all about them on the web page:
crm114.sourceforge.net
(and if you create one, and want to share it, put it on a web page and send me an email so I can add a pointer.)
...followed by a list of the 8 (yes, 8) major steps to using CRM114 mailfilter
/mailreaver
.
I just wanted a spam filter. I got a langauge for writing spam filters, two different spam filters with different designs, a third filter to do something slightly different, a link to a web page with even more filters on it, and the option of not just configuring the filters I already have, but the offer of actually rewriting them, or even of writing my own, in a language I've never used before.
I'm definitely getting some choice overload vibes here.
mailreaver
The Mailfilter HOWTO quoted above seems to suggest that mailreaver
is the preferred CRM114 spam filter, and the dovecot-antispam
Configuration documentation includes a reference to it too, so let's skip trying to evaluate all the different possible CRM114 filters for now, and just go with that and see where it leads.
Looking closely, mailreaver
is an actual program.
When installed, it's marked as executable and includes a shebang line that allows it to run like any other Unix script.
The problem is, it's installed outside of the normal PATH, doesn't come with it's own man page, and doesn't provide usage information if run as /usr/share/crm114/mailreaver --help
.
That does not inspire confidence.
Then, from reading the HOWTO, configuring it is really awkward.
There's no default configuration file installed anywhere under /etc
, where configuration files are normally found.
There is an example configuration file in /usr/share/crm114/mailfilter.cf
, but you need to edit it before you can use it.
The complication here is that you can't edit that version because your changes will get overwritten if the package gets updated for any reason.
So you need to copy it yourself to somewhere like /etc
and edit that version.
The configuration file syntax seems unique to CRM114/mailreaver
and is just strange.
Most configuration files would allow you to set an option, like adding spam headers to scanned emails, with something like add_headers=yes
.
The way to do this with mailreaver
is :add_headers: /yes/
.
At least, that's one configuration file.
Along with the mailfilter.cf
, there's also the rewrites.mfp
configuration file, which is required, but can be empty, but if it's not empty its syntax looks like yourname@yourdomain.yourplace>->MyEmailAddress
where ">->
" is the "rewrite operator", whatever that is.
You also have to create the spam/ham statistics files, even though they also can be empty.
mailreaver
apparently can't create these files itself on first use if they don't exist, so when you create a new user on your mailserver, you have to create empty statistics files for them otherwise the filtering won't work.
The spam/ham statistics files have the extension .css
, despite mailfilter
being created around 2004, and Cascading Stylesheets (CSS) having been an important part of HTML4 when it was released in 1997, 7 years and a dot-com bubble previously.
According to The CRM114 FAQ it's because they were originally "CRM Sparse Spectra" files, and I know it's a petty complaint, but reusing a well-established file extension still bothers me.
A lot of the documentation insists that the configuration files and the statistics files have to be in the same directory.
This means you couldn't set up a global configuration with per-user filters.
Looking a lot further down through the Mailfilter HOWTO, when it finally gives a reference for the mailreaver
command options, it does say that there is a separate --config
option for using a file other than mailfilter.cf
as the config file - but implies that only the file name, rather than a full path to the config file, can be given.
Looking back at the dovecot-antispam
sample configuration for mailreaver
, it looks like you can actually give a full path to a config file.
Also, examining the mailreaver
source code maybe backs that up, but the CRM114 language is unlike any other programming language I've ever seen, and at this point I'm fed up enough that I seriously can't be bothered to learn it sufficiently to figure it out for certain.
CRM114/mailreaver
doesn't feel like a finished product to me.
It feels like a research project someone got working to scratch their own itch. Then, when it did what they needed, and they knew how to modify it to do anything else they wanted it to, they stopped improving it further. It seems that, despite its technical capabilities, no community formed around it to give it the push to become a genuinely usable bit of software, by people who aren't particularly interested in the technology behind how it does what it does. Normally that push might come from either direct contributions from that community, or just from feedback where multiple users all describe the same problems.
It looks like CRM114 is a programming language that has had less than a dozen programs written in it - ever. With this, I had a look around the project's release history, and I don't think it's been updated since 2009.
CRM114 may do a really good job of solving its author's problem, which is fine, but it's not going to work for me.
Having been defeated by CRM114, I now need to find a different adaptive spam filter that I can plug into dovecot-antispam
via it's "pipe" backend.
Fortunately, SpamAssassin - which I already have installed and am using for definite global spam rejection - has such a filter in it's Bayesian Classifier, sa-learn.
I'll look at that in part 4.
posted at: 11:52 | path: / | permanent link to this entry
In part 2, having rejected using a Postfix Content Filter to check for spam because it can't reject emails that are sufficiently spammy, I set up spam filtering with SpamAssassin/spamass-milter. However, that doesn't tell the whole story, and I ended up there by more of a roundabout route than that post lets on.
spamass-milter
?I did get to the point of setting up spamass-milter
.
I was most of the way through configuring it when I re-checked the Postfix Milter documentation, and came across the following limitation:
When you use [some other Postfix feature], Milter applications have access only to the SMTP command information; they have no access to the message header or body, and cannot make modifications to the message or to the envelope.
...which kind of breaks spam filtering.
I mean, sure, you can do some spam detection on the SMTP command information, such as the sender address, but the data that's best for determining if an email is spam is the content of email itself. Without that, your spam classification just isn't going to be very good at all.
Combining that problem with the "2 daemons connected by another program" setup that I already wasn't very happy with, I decided to look into another approach.
spampd
Looking further through the Postfix documentation, I realised that I wanted a Before-Queue Content Filter proxy. This is a program that Postfix passes the email to, which can filter or transform it as it wants, and then either tell Postfix to reject the email, or pass it back into the Postfix queue.
Fortunately, the Spam Proxy Daemon spampd exists, which uses SpamAssassin to do exactly that. Also:
spampd was initially designed as a content filter mechanism for use with the Postfix MTA.
...which is good. But also, from the spampd
man page, under Installation it says:
Note that spampd replaces spamd from the SpamAssassin distribution in function. You do not need to run spamd in order for spampd to work.
...and that's great!
I installed it, got it up and running, and started to configure it appropriately to reject unwanted emails.
But then, looking further through the spampd
man page, right at the bottom, under To Do it says:
Add configurable option for rejecting mail outright based on spam score.
So, while Postfix makes it possible for a proxy to reject emails, this particular one can't. Which is the only thing I wanted it for.
...well, that's just fantastic. *sigh*
Having familiarised myself with how the proxy setup works with spampd
, I thought I might be able to create one myself without too much effort.
According to my reading of the section How Postfix talks to the before-queue content filter, it sounded to me like Postfix just opens a connection to the proxy, and passes the whole email command queue and message to it in one long stream of data. Then it expects a single response, and for the proxy to pass the whole stream back into Postfix as-is if the email is accepted. That didn't actually sound too hard.
In particular, Linux has programs which mean you don't have to deal with the complexity of handling network connections between Postfix and the proxy yourself. There are some which will listen for incoming connections for you, and pass the data as if it were coming from a file. Traditionally this was done with one of the many implementations/replacements of inetd, but more recently this functionality is part of the systemd service manager, with systemd.socket activation. Then, there are programs that will accept data as if you were writing it to a file, and send it on to a network connection instead, such as the many implementations of netcat.
The plan was: Read the whole email command stream and message from Postfix, pass it to SpamAssassin, and then based on the spam score either reject it, or accept it and pass the whole thing back into Postfix again.
That's about 12 lines of shell script. No problem.
...even when you take into account Hofstadter's Law
Having got a (roughly) 12-line shell script prototype up and running, I got Postfix to connect to it, but it wasn't receiving any data. Playing around with some ideas, I tried writing the response immediately, and I started to be able to read the command stream. If I wrote more data back to the program that was sending it, I got more of the command stream.
I realised that Postfix actually wanted to talk the full SMTP protocol to the proxy.
This means sending a "service ready" message as soon as Postfix connects, and then acknowledge each command Postfix sends me before it would send another, until it finally sends the email data. This isn't really that hard (hurrah for text-based protocols!) but it was more complex than I was expecting. If I'd realised this at the start, I wouldn't have picked shell scripting as the language to do it in.
But by the time I'd figured all this out, I'd written the first 90% of the proxy anyway. Given that, I thought I might as well finish up the second 90% of the work.
The result was spamprox - The barely SMTP-capable spam proxy. It's kind of a bodge job, doesn't really understand what it's doing, and would probably fail badly if it had to deal with anything other than the exact commands Postfix normally sends. On the plus side, it doesn't have to deal with anything else, and for the one role it has, it works!
All the ways that other people had come up with to do spam filtering weren't good enough, but I had managed to create one that was.
Having got through all that, I started to write everything up. After lots of writing and editing and changing tone, as I was getting close to finishing, I went back through all the documentation so I could link to anything relevant and include quotes where appropriate. I dug up the Postfix Milter documentation to get the exact text of the "you can't rely on having the email contents in milters" limitation and found:
When you use the before-queue content filter for incoming SMTP mail (see SMTPD_PROXY_README), Milter applications have access only to the SMTP command information...
Wait, what?
The "[some other Postfix feature]" that I'd dismissed as being an internal filter or feature I might have wanted to enable at some point, turned out to be "an external proxy" which you might use for something like... a spam filter. i.e. The reason that a milter wouldn't have the data it needed for spam filtering, would only be because I had also set up a proxy to do spam filtering? Except, I'd never need to set up both, so my fear about milters not working was completely unfounded. I just didn't have the knowledge I needed to understand that properly when I first read the docs.
I still wasn't entirely happy with the "2 daemons connected by another program" milter setup, but given that the alternative was a home-grown bodge job, written in a language that didn't turn out to be well-suited for the complexity of the task, I thought that maybe I ought to reconsider my approach.
...is that at least you get plenty of exercise.
-- Author unknown, paraphrased
Once I'd got past my reluctance to throw away the work I'd put into spamprox
, and remembered that I'd prefer not to have to maintain my own custom parts in my email setup, that's when I decided to go back to the spamass-milter
setup - as described in Part 2.
That decision also involved abandoning a lot of the blog write-up I'd already done for that work. The tone I'd gone with, of getting increasingly exasperated with the failure of one approach after another to try and make existing components work in what seemed to be the most obvious way, didn't make sense anymore. The run-up to getting the milter approach working (i.e. Part 2 of the series) didn't have enough failures to make the point, and once I'd realised that the chain of failures was due to my own misunderstanding, I needed to rewrite everything that came after (i.e. this post) to be a lot more humble.
Still, I learned quite a bit. So it wasn't a complete waste.
Anyway, in between going through 4 different approaches to spam filtering, including writing my own proxy to be part of that, writing nearly all of it up before discovering my mistake, wavering over the decision to abandon that and go back to a previously dismissed approach, and then finding the motivation to rewrite the writeup, accounts for a reasonable proportion of the delay between Part 1 and Part 2.
Hopefully Part 3 will be completed much more promptly. See you then...
posted at: 14:34 | path: / | permanent link to this entry
In Part 1 I set up spam filtering for the Postfix mail server with SpamAssassin, according to the Debian SpamAssassin default instructions for doing so.
I don't want to waste any more resources than absolutely necessary on spam. "Resources" includes processing time and disk space, but also human attention, like the time it takes to check through your spam folder if you suspect something may have ended up there. Therefore, I'd like to keep as much spam as possible off the system entirely.
If I receive a spam email, I have 3 options for not keeping it.
The simplest, but least polite option, is simply to delete any email that gets a high enough spam score. However, email is meant to be a reliable delivery service, as described in the email standards document RFC 5321 §6:
When the receiver-SMTP accepts a piece of mail (by sending a "250 OK" message in response to DATA), it is accepting responsibility for delivering or relaying the message. It must take this responsibility seriously. It MUST NOT lose the message for frivolous reasons, such as because the host later crashes or because of a predictable resource shortage.
[...]
As discussed in Section 7.8 and Section 7.9 below, dropping mail without notification of the sender is permitted in practice. However, it is extremely dangerous and violates a long tradition and community expectations that mail is either delivered or returned. If silent message-dropping is misused, it could easily undermine confidence in the reliability of the Internet's mail systems.
So, don't do that.
Again from standards doc RFC 5321 §6, linked above:
Utility and predictability of the Internet mail system requires that messages that can be delivered should be delivered, regardless of any syntax or other faults associated with those messages and regardless of their content. If they cannot be delivered, and cannot be rejected by the SMTP server during the SMTP transaction, they should be "bounced" (returned with non-delivery notification messages) as described above.
If you've ever received an email titled "Delivery failure" or similar, this is a bounce message.
The trouble with bounces is that spammers often use hijacked computers to send their emails, and lie about the return address. So if you've ever received an email titled "Delivery failure" for an email you never sent, this is because a spammer tried to send an email to someone, and put your email address as the sender. When the recipient's system bounced the email, the bounce notification got delivered to you.
As the Debian SpamAssassin Notes point out:
The problem is, spammers (and viruses) routinely forge the from address on the envelope. This means that if there is a bounce generated, it will go to this address, which can be randomly generated, or worse, an innocent third party.
Therefore, it is very important that your system doesn't generate a bounce.
Similarly, from the Postfix content filter instructions:
NOTE: in this time of mail worms and spam, it is a BAD IDEA to send known viruses or spam back to the sender, because that address is likely to be forged.
So, don't do that either.
The third option is to reject the email without accepting it in the first place.
The way email works is that the sender's system connects to the recipient's system, and passes all the email data to it. As described above, at the end of this process the recipient's system says "OK", and at that point it takes responsibility for the email. However, it also has the option of saying "No" for some reason. For example, if the email is too large, or if the recipient's mailbox has reached its quota limit. (Remember mailbox quotas? They still exist in some places.)
Then it's the sending system's responsibility to handle the problem of the email not being delivered - and it knows where the email really came from.
This is the best way to handle emails you want to keep off your system.
The problem is that, as I was trying to figure out how to make the content filter tell Postfix to reject emails with a high enough spam score, I learned that Postfix's content filters are applied after the email has already been accepted by the mail server. At that point, it's too late to reject it. Therefore, I need to find a different way to configure SpamAssassin to work with Postfix.
A milter (mail filter) is a type of external program that some mail servers can use to implement custom filtering mechanisms. Postfix has support for milters, and can use them early enough in the process pipeline that they can be used to reject emails.
Further, there is a milter wrapper for SpamAssassin, spamsass-milter, that allows SpamAssassin to be used as one. Great!
...except for the problem of too many levels of indirection.
One thing about how milters work is that they cannot be run on-demand, but need to run as a daemon that the mail server connects to.
You might remember from Part 1 that SpamAssassin is also best run as a daemon (spamd
) so that its slow startup can be done only once, ahead of time.
Given this, it would be reasonable to expect that spamass-milter
would integrate SpamAssassin into itself to absorb this cost.
Sadly, reasonable expectations often appear to be something of an unattainable luxury in this kind of endeavour.
Therefore, to use spamass-milter, you need to have both the spamass-milter
and spamd
daemons running all the time.
Worse, spamass-milter
can't even connect directly to spamd
itself, but needs to run the external spamc
program to do so on it's behalf.
So, rather than just have the mail server:
Instead, it:
I realise that part of this is just a consequence of the way milters work, but still, it seems like a lot more moving parts than should be necessary.
Aesthetic considerations about the elegance of the solution aside, once the milter was set up as described in the Debian SpamAssassin Postfix Milter instructions, the system was classifying spam as before, but rejecting any sufficiently spammy emails before they were accepted. Hurrah!
After the work it took to get here, that's good enough for now.
But... we're not done yet. Stick around for Part 3! Well, don't stick around - given how long it took me to get this post out you might be in for a bit of a wait. Go do something else, and maybe check back in a week or so. There might even be a Part 2b first.
posted at: 09:57 | path: / | permanent link to this entry