Per user configurable spam killer with Postfix and SpamAssassin™

Mikko Pikarinen <goblet@goblet.net>, October 2003

Preface

We have a mail server which has a little more than 2000 virtual users and some quite active mailing lists run by Mailman. The main idea of the server is to serve members of an association so that the people can have address@ourdomain and the mail arriving to those addresses is forwarded to their real mail addresses.

The spam percentage of all mail coming to our server is shocking. Incoming count is about 40000-50000 emails per week and only less than 10000 of them is not spam. About 15000-25000 is blocked by ordb, dsbl, rfc-ignorant.org and our own access lists. The rest get through and are checked by SpamAssassin. Less than one third of them are tagged as not spam.

Some of the people wanted their all mail, even spam, to be forwarded to them and just to be tagged by SpamAssassin. Some of them wanted everything even smelling spam to be totally killed before going to fill their real mailboxes. Somebody wanted only sure spams to be sent to the /dev/null. So we needed a spam killer which could be configured per recipient address.

This document assumes that you have some kind of knowledge how Postfix, Spamassassin and Procmail do work. It's good if you understand Perl code too.


Software

We are using the following software (versions at the time this document was written)
The Platform there is Debian GNU/Linux but all this should be possible on any Unix® or lookalike.

The configuration of Postfix

The smtpd_recipient_limit = 1000 in the main.cf of the Postfix. That's enough for our use. We have about 500 recipients on our largest mailing list and the Mailman is sending to max 500 recipients at a time with the default configuration.

The master.cf is configured like this (non-default settings shown only). The hostname is "qsp" as you can see.

qsp:smtp  inet  n       -       n       -       -       smtpd
  -o content_filter=filter:dummy
  -o cleanup_service_name=pre-cleanup
localhost:smtp inet  n  -       n       -       -       smtpd
pre-cleanup unix n      -       n       -       0       cleanup
  -o alias_maps=
  -o virtual_maps=
cleanup   unix  n       -       n       -       0       cleanup
filter    unix  -       n       n       -       -       pipe
   flags=Rq user=spamd argv=/usr/bin/procmail -Y -m /etc/procmailrcs/master.rc ${sender} ${recipient}

So every mail coming from the internet is sent to content filter "filter" and cleanup service is special "pre-cleanup" so that we have the virtual address when going to filter. Because of that we can match virtual address in the filter, not real address.

The service "filter" sends the mail to procmail with arguments sender and recipient(s).

Only mail coming from the internet is sent to the filter, outgoing mail is not. That is primarily beacuse of Mailman which sends all mail through a socket on localhost:25.

You should read the FILTER_README file from the Postfix distribution package to understand this configuration completely.


The configuration of Procmail

As seen above, the mail is sent to procmail with parameters -Y -m master.rc ${sender} ${recipient}
So procmail gets first argument as envelope sender and 2..n parameters as recipients. Then the mail is checked by SpamAssassin using spamc which connects to spamd running in the background. Consult the manual of SpamAssassin for more info.

Our spamd is started in init script like this:
spamd -d -u spamd -r /var/run/spamd.pid --socketpath=/var/run/spamd.sock

You should read the procmailrc(5), spamd(1) and spamc(1) manpages to understand this completely.

master.rc

SHELL=/bin/sh
DROPPRIVS=YES
LINEBUF=32768                
SENDMAILFLAGS="-oi"
SPAMC="/usr/local/bin/spamc"

FROM="<$1>"
SHIFT=1

:0f
|$SPAMC -f -U /var/run/spamd.sock

:0
* ^X-Spam-Level: \*\*\*\*\*
{
  SWITCHRC="/etc/procmailrcs/spamkill.rc"
}

:0
! -f $FROM "$@"


We must have space for 1000 addresses.
Set sendmailflags just to be sure.


Set the variable FROM to the envelope sender
and remove it from $@

Filter through spamd which is started to use
unix domain socket, not TCP port.


If we get at least five points, we will
switch to another script spamkill.rc




Wasn't spam so forward to all recipients.

spamkill.rc

If the mail is spam (in SpamAssassin's opinion) we send the recipient list to our perl script filter_recipients which returns a list of recipients tha allow spam with that level delivered to them or who don't exist in the database. They who gave us permission to kill spam, will be removed from the recipient list before forwarding the mail.

LOGFILE="/var/log/spamkill/spamkill"
UMASK=022
LOGABSTRACT="no"
VERBOSE="no"
FORMAIL="/usr/bin/formail"
FILTERPL="/usr/sbin/filter_recipients"

DATE=`date +%Y%m%d-%H%M%S`
XSPAMLEVEL=`$FORMAIL -zxX-Spam-Level`

:0 Wi
RECIPIENTS=|$FILTERPL "$XSPAMLEVEL" "$@"

:0 Wi
OLDRECIPIENTS=|echo "$@"

SHIFT=1000

LOG="$DATE $$ Level: $XSPAMLEVEL From: $FROM
$DATE $$ Orig-To: $OLDRECIPIENTS
"

:0
* RECIPIENTS ?? @ourdomain
{
LOG="$DATE $$ Sent-To: $RECIPIENTS
"

:0
! -f $FROM $RECIPIENTS
}

LOG="$DATE $$ Sent-To: /dev/null
"

:0
/dev/null
Set the logfile...
and umask so that others can read the log
We don't want everything logged...
so logabstract and verbose are off



Get the date for logging
Extract the X-Spam-Level header


Send the spam level and recipient list to our script
and get the new recipient list.
Save the original recipient list for logging. Note that
$@ is expanded only when in an argument list to a program

Clear the variable $@ because ! sends to $@

Do the logging. Notice the newlines!




If we have something in variable RECIPIENTS
(any pattern that should match a valid recipient)
Then log it
Notice the newline and dquote on next line again


Forward the mail


No recipients left, log it



And send there where all spam belongs to

The brain of the system

A Perl script called filter_recipients

We have a map file where we have addresses and their minimum spam levels when the mail can be sent to /dev/null. The map file is made by command
postmap hash:/etc/postfix/spamkill and the source map file includes something like this:

foo@ourdomain 5
bar@ourdomain 7
baz@ourdomain 8

So user foo allows us to kill mail when it's level is at least 5, bar when it's at least 7 and so on. The recipient addresses who aren't in the map will get all spam. Notice that we play with integer spam levels here. In our case the map file is handled and created by a web based application, as the virtual user map too, but that's an another story :)

The recipient list is given to the perl script "filter_recipients" and it removes those recipients who allow us to kill spam automatically.

The script:
#!/usr/bin/perl 

use strict;
use DB_File;

my $mapdb = "/etc/postfix/spamkill.db";
my $spamlevel = length(shift @ARGV);

tie(my %db, 'DB_File', $mapdb, O_RDONLY, 0664, $DB_HASH) 
  or print "@ARGV\n" and exit(1);
foreach my $addr (@ARGV) {
  $addr = lc($addr);
  (my $test = $addr) =~ s/\+[^@]+//;
  $test .= chr(0);
  if ($db{$test} > $spamlevel or !defined $db{$test}) {
    print "$addr ";
  }
}
untie(%db);
print "\n";
exit(0);


Real men always use strict :-)


Set the map db here
Get the spam level from the first argument

Open the db file for reading 
  or just return the list if error in opening

Go through the recipient list
and print only those who allow
spam to be delivered on this
level or who are not in the db at all



Untie the db,
print a newline
and exit

About the performance

No problems at least with this amount of mail. We get more than a thousand spams per day. Our server is a AMD Athlon™ XP 2400+ running on 2GHz and the machine has 512MB of memory. On an average the CPU is about 99% idle although there is a webserver running in the same machine.

I tested that perl script with a mapfile including 1000 addresses and random levels. I gave 1000 addresses on argument list. On that machine the execution took something between 0.02 to 0.03 seconds, so it really is fast enough for us. It shouldn't be impossible to write that program again in C so that it would be some milliseconds faster.

If you have to scan eg. millions of mail every day, this may not be the right solution. But I really don't know, I haven't tested it with bigger volumes. Hope that this information can still be at least partly usable for that kind of situations.