Dovecot dspam integration

Attention!

The information on this page is historic, the dspam plugin has been superseded by the more generic dovecot antispam plugin.

Here's how I integrate dspam into dovecot.

Contents

Rationale

We all want spam filtering, I want it on my server, so my webmail is already filtered. I get approximately 1400 spam mails per month, and many more that don't make it through my SMTP front lines. I use dspam because it has a much lower footprint than SpamAssassin, and also because it succeeds in catching almost all of my spam with almost no false positives.

Even if I'm using dspam, almost everything I'll put here should pretty much be applicable to any other spam filtering system as well.

References

dovecot: http://www.dovecot.org/
dspam: http://dspam.nuclearelephant.com/
DMT: http://annexia.ca/Members/Dustin/Scripts/dmt/
my first message to the dovecot mailing list: http://news.gmane.org/find-root.php?message_id=%3c41BF00D6.60604%40sipsolutions.net%3e

dovecot & dspam

why integrate dovecot and dspam?

Here are the reasons why I think that dovecot should (via a plugin) be able to handle spam specially.

People have suggested that things like DMT work well enough for them. Well, they don't for me. DMT is a system that needs 6 folders to work properly! Not all of those need to be visible to the end-user maybe, but still. Doesn't 6 folders sound a little bit too much to you? It does to me, I want to get away with just a single one. And it's possible, I've been using it for a long time now.

top level description

I have written a dovecot plugin that watches a special folder which I'll call SPAM from now on. When the MTA (exim here) delivers a message to the user, it'll first run it through the spam classifier, in my case dspam. If it is classified as spam, it'll be delivered to the SPAM folder instead of the normal filtering file the user may have (my system uses maildrop).

Now at this point we have:

everything classified as spam in SPAM
everything else wherever the user wants it (mailing list folders etc., I for one have very elaborate filtering rules)

Obviously this isn't enough because our spam scanner needs training. We'll occasionally have false positives and false negatives.

With something like DMT, you have to move false positives into a HamTrain folder, and false negatives into a SpamTrain folder. On the other hand, I want those mails in whatever special folders I chose, since my mail filters didn't apply to the spam email and that false positive could've been through a mailing list.

Now this is the point where my dovecot plugin comes into play. Instead of moving mail into special folders, the user has two actions available:

moving it out of the SPAM folder and
moving it into the SPAM folder.

The dovecot plugin watches these actions (and additionally prohibits APPENDs to the SPAM folder, more for technical reasons than others) and tells dspam that it made an error and needs to re-classify the message, depending on which of the two actions the user did. The user can now move the message directly into whatever folder she choses, and it all works. Almost magic.

scaling better

When I first suggested this, I was told that it wouldn't scale. I believe it can be made scale just as well as DMT or another home-grown system that runs the re-learning as a cron-job.

If you trigger training based on a mail copy, what happens when someone dumps 400 emails into a folder all at once? What happens when 30 people do this all at the same time? It might not suit a smaller system at peak hours to have this done. -- Tom Allison (on the dovecot mailing list)

Well, here's my answer: instead of instantly re-learning the messages, we implement the DMT approach behind the scenes. When the user moves a mail out of the SPAM folder, we hardlink it into a special RelearnHam folder, when it is moved into the SPAM folder, hardlink into a RelearnSpam folder. At night, when the system is supposed to be doing less, process all the mails in those Relearn* folders (for all users) depending on which folder they are in. With dspam, this gets even better: it suffices to store only the signatures of those mis-classified mails, for example in a special database. After re-learning, the messages are purged.

Actually, this is not quite correct. As Tom Allison pointed out: when the user re-visits her decision about the classification before the cron-job is run, this could give inconsistencies. Therefore, if a mail is linked into the RelearnHam folder it must be unlinked from RelearnSpam (and vice versa).

technical description

This section is obviously more specific to dspam than the rest.

We can link the dovecot plugin with libdspam. Apparently, contrary to what I said before, libdspam also reads the configuration file(s), so this should be enough. If using mysql, this would also take advantage of the config extension (though this is only useful when having the dspam cgi installed).

The plugin would have to look something like this (pseudo) code.

implementation

I decided that hardlinking into special folders was too much work (especially with removing the mails again!), so for now my plugin is calling the dspam client directly. Dspam has weird behaviour in that it has an exit code of 0 even if something goes wrong, but only prints an error message. Until that is fixed, I have added code to intercept the output (any output is treated as an error).

Here's the code for dovecot 1.0 beta 7 through rc7. I have tested with beta 7, beta 8, rc2 and rc7 but expect the internal API won't change before 1.0 is released. I do, however, recommend to recompile the code for each dovecot update as internal fields and structures may not be binary compatible.

To use this plugin, you need to configure dspam to

deliver spam into the SPAM folder for each user (or change the code)
add a X-DSPAM-Signature header line with the signature. Body or attachment signatures do not work, as dovecot is programmed to extract the signature. It should be trivial to change this, but it seemed cleaner this way. If you want (or already have) a different header line name, then just change the #define on top of the file.

code license

dovecot plugin for dspam

Copyright (C) 2004-2006  Johannes Berg

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License Version 2 as
published by the Free Software Foundation.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA

questions?

If you have any questions please send them to the dovecot mailing list: http://dovecot.org/cgi-bin/mailman/listinfo/dovecot. Make sure to send a copy (Cc) to me, I no longer read the dovecot mailing list frequently.

additional tools

Along with the plugin, I use a script to automatically delete all seen mail in my SPAM folder that is older than three weeks. The script is also available. I invoke it from cron with

/usr/local/bin/cleanspam --seen --expunge 21