Spam statistics from my email server

I use SpamAssassin on my e-mail server to flag spam messages that come to my addresses. It uses a series of checks on each message and determines a Spam Score. If the Score is above a user-defined threshold, it adds a header that says that it is spam. Then dovecot files it away into a spam folder instead of my inbox. It does a pretty good job but requires tuning sometimes. I wanted to see if I could change my threshold from the default (5.0) without getting too many false positives or negatives. To do that, I’d have to collect some stats from my messages.

Here’s an example of headers added by SpamAssassin to a spam message:

Given those headers, I can get lists those spam scores with a simple chained command in the Junk folder and then also in normal inbox folders

This gives files that have Spam and Ham ratings. Then I just plotted them with matplotlib:

Results:

All the red below the threshold gets into my inbox and I have to file it manually, so that’s no good. I can retrain SpamAssassin to do better with its Bayes filters and that improves things for a while, but for now it’s clear that I probably shouldn’t reduce the threshold much because I do have ham coming in all the way up to 5.0 (most of which are encrypted Facebook notifications).

I had a few outliers in the green at first with very high spam scores that I clearly had just forgotten to put in the Junk folder. I found them with a series of commands:

First, print out the exact spam score of the violators:

That gave a 28.3. Then I can just do a simple grep to find the exact message and refile it correctly:

After retraining, I’ll see if I can get ham with lower scores.

Ah yes, running your own email server is fun!

Leave a Reply

Your email address will not be published. Required fields are marked *