I use SpamAssassin on my e-mail server to flag spam messages that come to my addresses. It uses a series of checks on each message and determines a Spam Score. If the Score is above a user-defined threshold, it adds a header that says that it is spam. Then dovecot files it away into a spam folder instead of my inbox. It does a pretty good job but requires tuning sometimes. I wanted to see if I could change my threshold from the default (5.0) without getting too many false positives or negatives. To do that, I’d have to collect some stats from my messages.
Here’s an example of headers added by SpamAssassin to a spam message:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
X-Spam-Flag: YES X-Spam-Level: ***** X-Spam-Status: Yes, score=5.4 required=5.0 tests=BAYES_50,DIET_1,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HTML_IMAGE_ONLY_16,HTML_MESSAGE, HTML_SHORT_LINK_IMG_2,MIME_HTML_ONLY,PYZOR_CHECK,SUBJECT_DIET,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Report: * 0.0 DIET_1 BODY: Lose Weight Spam * 1.5 SUBJECT_DIET Subject talks about losing pounds * 0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked. * See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block * for more information. * [URIs: juiceenewsdaily.com] * 1.1 HTML_IMAGE_ONLY_16 BODY: HTML: images with 1200-1600 bytes of words * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.4932] * 0.0 HTML_MESSAGE BODY: HTML included in message * 0.7 MIME_HTML_ONLY BODY: Message only has text/html MIME parts * -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature * 1.4 PYZOR_CHECK Listed in Pyzor (http://pyzor.sf.net/) * -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's * domain * 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily * valid * 0.0 HTML_SHORT_LINK_IMG_2 HTML is very short with a linked image |
Given those headers, I can get lists those spam scores with a simple chained command in the Junk folder and then also in normal inbox folders
1 |
grep X-Spam-Status: * | awk '{print $3}' | awk -F'=' '{print $2}' > /home/nick/trash_ratings.txt |
This gives files that have Spam and Ham ratings. Then I just plotted them with matplotlib:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
"""Plot some info about email server SpamAssasin scores.""" import matplotlib.pyplot as plt import numpy as np FILES = ['spam_ratings.txt', 'trash_ratings.txt'] def read(files): data = [] for fname in files: print('Reading ' + str(fname) +'.') with open(fname) as f: vals = [] for line in f.readlines(): try: vals.append(float(line)) except ValueError: # occasional bad values show up pass data.append(np.array(vals)) return data def plot(data): spam, ham = arrays n, bins, patches = plt.hist(spam, 50, normed=1, facecolor='red', alpha=0.75, label='Spam') n, bins, patches = plt.hist(ham, 50, normed=1, facecolor='green', alpha=0.75, label='Ham') plt.grid(color='0.90') plt.xlabel('SpamAssassin Score') plt.ylabel('Frequency') plt.title('Spam Scores of my E-mail') plt.legend() ax = plt.gca() ax.axvline(x=5) ax.annotate('Spam Threshold = 5', xy=(5, 0.2), xytext=(0.6,0.5), textcoords='axes fraction', arrowprops=dict(facecolor='black', shrink=0.05,width=2)) plt.show() if __name__ == '__main__': arrays = read(FILES) plot(arrays) |
Results:
All the red below the threshold gets into my inbox and I have to file it manually, so that’s no good. I can retrain SpamAssassin to do better with its Bayes filters and that improves things for a while, but for now it’s clear that I probably shouldn’t reduce the threshold much because I do have ham coming in all the way up to 5.0 (most of which are encrypted Facebook notifications).
I had a few outliers in the green at first with very high spam scores that I clearly had just forgotten to put in the Junk folder. I found them with a series of commands:
First, print out the exact spam score of the violators:
1 |
grep X-Spam-Status: * | awk '{print $3}' | awk -F'=' '{print $2}' | awk '$1 > 5 {print ;}' |
That gave a 28.3. Then I can just do a simple grep to find the exact message and refile it correctly:
1 |
grep "score=28.3" * |
After retraining, I’ll see if I can get ham with lower scores.
Ah yes, running your own email server is fun!