In part 2, having rejected using a Postfix Content Filter to check for spam because it can't reject emails that are sufficiently spammy, I set up spam filtering with SpamAssassin/spamass-milter. However, that doesn't tell the whole story, and I ended up there by more of a roundabout route than that post lets on.
spamass-milter
?I did get to the point of setting up spamass-milter
.
I was most of the way through configuring it when I re-checked the Postfix Milter documentation, and came across the following limitation:
When you use [some other Postfix feature], Milter applications have access only to the SMTP command information; they have no access to the message header or body, and cannot make modifications to the message or to the envelope.
...which kind of breaks spam filtering.
I mean, sure, you can do some spam detection on the SMTP command information, such as the sender address, but the data that's best for determining if an email is spam is the content of email itself. Without that, your spam classification just isn't going to be very good at all.
Combining that problem with the "2 daemons connected by another program" setup that I already wasn't very happy with, I decided to look into another approach.
spampd
Looking further through the Postfix documentation, I realised that I wanted a Before-Queue Content Filter proxy. This is a program that Postfix passes the email to, which can filter or transform it as it wants, and then either tell Postfix to reject the email, or pass it back into the Postfix queue.
Fortunately, the Spam Proxy Daemon spampd exists, which uses SpamAssassin to do exactly that. Also:
spampd was initially designed as a content filter mechanism for use with the Postfix MTA.
...which is good. But also, from the spampd
man page, under Installation it says:
Note that spampd replaces spamd from the SpamAssassin distribution in function. You do not need to run spamd in order for spampd to work.
...and that's great!
I installed it, got it up and running, and started to configure it appropriately to reject unwanted emails.
But then, looking further through the spampd
man page, right at the bottom, under To Do it says:
Add configurable option for rejecting mail outright based on spam score.
So, while Postfix makes it possible for a proxy to reject emails, this particular one can't. Which is the only thing I wanted it for.
...well, that's just fantastic. *sigh*
Having familiarised myself with how the proxy setup works with spampd
, I thought I might be able to create one myself without too much effort.
According to my reading of the section How Postfix talks to the before-queue content filter, it sounded to me like Postfix just opens a connection to the proxy, and passes the whole email command queue and message to it in one long stream of data. Then it expects a single response, and for the proxy to pass the whole stream back into Postfix as-is if the email is accepted. That didn't actually sound too hard.
In particular, Linux has programs which mean you don't have to deal with the complexity of handling network connections between Postfix and the proxy yourself. There are some which will listen for incoming connections for you, and pass the data as if it were coming from a file. Traditionally this was done with one of the many implementations/replacements of inetd, but more recently this functionality is part of the systemd service manager, with systemd.socket activation. Then, there are programs that will accept data as if you were writing it to a file, and send it on to a network connection instead, such as the many implementations of netcat.
The plan was: Read the whole email command stream and message from Postfix, pass it to SpamAssassin, and then based on the spam score either reject it, or accept it and pass the whole thing back into Postfix again.
That's about 12 lines of shell script. No problem.
...even when you take into account Hofstadter's Law
Having got a (roughly) 12-line shell script prototype up and running, I got Postfix to connect to it, but it wasn't receiving any data. Playing around with some ideas, I tried writing the response immediately, and I started to be able to read the command stream. If I wrote more data back to the program that was sending it, I got more of the command stream.
I realised that Postfix actually wanted to talk the full SMTP protocol to the proxy.
This means sending a "service ready" message as soon as Postfix connects, and then acknowledge each command Postfix sends me before it would send another, until it finally sends the email data. This isn't really that hard (hurrah for text-based protocols!) but it was more complex than I was expecting. If I'd realised this at the start, I wouldn't have picked shell scripting as the language to do it in.
But by the time I'd figured all this out, I'd written the first 90% of the proxy anyway. Given that, I thought I might as well finish up the second 90% of the work.
The result was spamprox - The barely SMTP-capable spam proxy. It's kind of a bodge job, doesn't really understand what it's doing, and would probably fail badly if it had to deal with anything other than the exact commands Postfix normally sends. On the plus side, it doesn't have to deal with anything else, and for the one role it has, it works!
All the ways that other people had come up with to do spam filtering weren't good enough, but I had managed to create one that was.
Having got through all that, I started to write everything up. After lots of writing and editing and changing tone, as I was getting close to finishing, I went back through all the documentation so I could link to anything relevant and include quotes where appropriate. I dug up the Postfix Milter documentation to get the exact text of the "you can't rely on having the email contents in milters" limitation and found:
When you use the before-queue content filter for incoming SMTP mail (see SMTPD_PROXY_README), Milter applications have access only to the SMTP command information...
Wait, what?
The "[some other Postfix feature]" that I'd dismissed as being an internal filter or feature I might have wanted to enable at some point, turned out to be "an external proxy" which you might use for something like... a spam filter. i.e. The reason that a milter wouldn't have the data it needed for spam filtering, would only be because I had also set up a proxy to do spam filtering? Except, I'd never need to set up both, so my fear about milters not working was completely unfounded. I just didn't have the knowledge I needed to understand that properly when I first read the docs.
I still wasn't entirely happy with the "2 daemons connected by another program" milter setup, but given that the alternative was a home-grown bodge job, written in a language that didn't turn out to be well-suited for the complexity of the task, I thought that maybe I ought to reconsider my approach.
...is that at least you get plenty of exercise.
-- Author unknown, paraphrased
Once I'd got past my reluctance to throw away the work I'd put into spamprox
, and remembered that I'd prefer not to have to maintain my own custom parts in my email setup, that's when I decided to go back to the spamass-milter
setup - as described in Part 2.
That decision also involved abandoning a lot of the blog write-up I'd already done for that work. The tone I'd gone with, of getting increasingly exasperated with the failure of one approach after another to try and make existing components work in what seemed to be the most obvious way, didn't make sense anymore. The run-up to getting the milter approach working (i.e. Part 2 of the series) didn't have enough failures to make the point, and once I'd realised that the chain of failures was due to my own misunderstanding, I needed to rewrite everything that came after (i.e. this post) to be a lot more humble.
Still, I learned quite a bit. So it wasn't a complete waste.
Anyway, in between going through 4 different approaches to spam filtering, including writing my own proxy to be part of that, writing nearly all of it up before discovering my mistake, wavering over the decision to abandon that and go back to a previously dismissed approach, and then finding the motivation to rewrite the writeup, accounts for a reasonable proportion of the delay between Part 1 and Part 2.
Hopefully Part 3 will be completed much more promptly. See you then...
posted at: 14:34 | path: / | permanent link to this entry