{"id":1323,"date":"2017-05-29T16:11:41","date_gmt":"2017-05-29T23:11:41","guid":{"rendered":"https:\/\/partofthething.com\/thoughts\/?p=1323"},"modified":"2017-07-07T20:16:12","modified_gmt":"2017-07-08T03:16:12","slug":"spam-statistics-from-my-email-server","status":"publish","type":"post","link":"https:\/\/partofthething.com\/thoughts\/spam-statistics-from-my-email-server\/","title":{"rendered":"Spam statistics from my email server"},"content":{"rendered":"<p>I use SpamAssassin on my e-mail server to flag spam messages that come to my addresses. It uses a series of checks on each message and determines a Spam Score. If the Score is above a user-defined threshold, it adds a header that says that it is spam. Then dovecot files it away into a spam folder instead of my inbox. It does a pretty good job but requires tuning sometimes. I wanted to see if I could change my threshold from the default (5.0) without getting too many false positives or negatives. To do that, I&#8217;d have to collect some stats from my messages.<\/p>\n<p><!--more--><\/p>\n<p>Here&#8217;s an example of headers added by SpamAssassin to a spam message:<\/p>\n<pre class=\"\">X-Spam-Flag: YES\r\nX-Spam-Level: *****\r\nX-Spam-Status: Yes, score=5.4 required=5.0 tests=BAYES_50,DIET_1,DKIM_SIGNED,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 DKIM_VALID,DKIM_VALID_AU,HTML_IMAGE_ONLY_16,HTML_MESSAGE,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 HTML_SHORT_LINK_IMG_2,MIME_HTML_ONLY,PYZOR_CHECK,SUBJECT_DIET,URIBL_BLOCKED\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 autolearn=no autolearn_force=no version=3.4.0\r\nX-Spam-Report:\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0 0.0 DIET_1 BODY: Lose Weight Spam\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0 1.5 SUBJECT_DIET Subject talks about losing pounds\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0 0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 See http:\/\/wiki.apache.org\/spamassassin\/DnsBlocklists#dnsbl-block\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0\u00a0\u00a0\u00a0\u00a0 for more information.\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0\u00a0\u00a0\u00a0\u00a0 [URIs: juiceenewsdaily.com]\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0 1.1 HTML_IMAGE_ONLY_16 BODY: HTML: images with 1200-1600 bytes of words\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60%\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0\u00a0\u00a0\u00a0\u00a0 [score: 0.4932]\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0 0.0 HTML_MESSAGE BODY: HTML included in message\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0 0.7 MIME_HTML_ONLY BODY: Message only has text\/html MIME parts\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 * -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0 1.4 PYZOR_CHECK Listed in Pyzor (http:\/\/pyzor.sf.net\/)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 * -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 domain\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0\u00a0\u00a0\u00a0\u00a0 valid\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 *\u00a0 0.0 HTML_SHORT_LINK_IMG_2 HTML is very short with a linked image\r\n\r\n\r\n<\/pre>\n<p>Given those headers, I can get lists those spam scores with a simple chained command in the Junk folder and then also in normal inbox folders<\/p>\n<pre class=\"\">grep X-Spam-Status: * | awk '{print $3}' | awk -F'=' '{print $2}' &gt; \/home\/nick\/trash_ratings.txt<\/pre>\n<p>This gives files that have Spam and Ham ratings. Then I just plotted them with matplotlib:<\/p>\n<pre class=\"lang:python decode:true \" title=\"Email spam histogram plotter\">\"\"\"Plot some info about email server SpamAssasin scores.\"\"\"\r\n\r\nimport matplotlib.pyplot as plt\r\nimport numpy as np\r\n\r\nFILES = ['spam_ratings.txt', 'trash_ratings.txt']\r\n\r\n\r\ndef read(files):\r\n    data = []\r\n    for fname in files:\r\n        print('Reading ' + str(fname) +'.')\r\n        with open(fname) as f:\r\n            vals = []\r\n            for line in f.readlines():\r\n                try:\r\n                    vals.append(float(line))\r\n                except ValueError: # occasional bad values show up\r\n                    pass\r\n            data.append(np.array(vals))\r\n    return data\r\n\r\n\r\ndef plot(data):\r\n    spam, ham = arrays\r\n    n, bins, patches = plt.hist(spam, 50, normed=1, \r\n                                facecolor='red', alpha=0.75, label='Spam')\r\n    n, bins, patches = plt.hist(ham, 50, normed=1, \r\n                                facecolor='green', alpha=0.75, label='Ham')\r\n    plt.grid(color='0.90')\r\n    plt.xlabel('SpamAssassin Score')\r\n    plt.ylabel('Frequency')\r\n    plt.title('Spam Scores of my E-mail')\r\n    plt.legend()   \r\n    ax = plt.gca() \r\n    ax.axvline(x=5)  \r\n    ax.annotate('Spam Threshold = 5', xy=(5, 0.2),  xytext=(0.6,0.5),\r\n                textcoords='axes fraction',\r\n                arrowprops=dict(facecolor='black', shrink=0.05,width=2))\r\n    plt.show()\r\n\r\n\r\nif __name__ == '__main__':\r\n    arrays = read(FILES)\r\n    plot(arrays)\r\n\r\n    \r\n    \r\n            \r\n<\/pre>\n<p>Results:<\/p>\n<p><a href=\"https:\/\/partofthething.com\/thoughts\/wp-content\/uploads\/plot.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-1325\" src=\"https:\/\/partofthething.com\/thoughts\/wp-content\/uploads\/plot.png\" alt=\"\" width=\"640\" height=\"480\" srcset=\"https:\/\/partofthething.com\/thoughts\/wp-content\/uploads\/plot.png 640w, https:\/\/partofthething.com\/thoughts\/wp-content\/uploads\/plot-300x225.png 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a>All the red below the threshold gets into my inbox and I have to file it manually, so that&#8217;s no good. I can retrain SpamAssassin to do better with its Bayes filters and that improves things for a while, but for now it&#8217;s clear that I probably shouldn&#8217;t reduce the threshold much because I do have ham coming in all the way up to 5.0 (most of which are encrypted Facebook notifications).<\/p>\n<p>I had a few outliers in the green at first with very high spam scores that I clearly had just forgotten to put in the Junk folder. I found them with a series of commands:<\/p>\n<p>First, print out the exact spam score of the violators:<\/p>\n<pre class=\"\">grep X-Spam-Status: * | awk '{print $3}' | awk -F'=' '{print $2}'\u00a0 | awk '$1 &gt; 5 {print ;}'\r\n<\/pre>\n<p>That gave a 28.3. Then I can just do a simple grep to find the exact message and refile it correctly:<\/p>\n<pre class=\"\">grep \"score=28.3\"\u00a0 *<\/pre>\n<p>After retraining, I&#8217;ll see if I can get ham with lower scores.<\/p>\n<p>Ah yes, running your own email server is fun!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I use SpamAssassin on my e-mail server to flag spam messages that come to my addresses. It uses a series of checks on each message and determines a Spam Score. If the Score is above a user-defined threshold, it adds a header that says that it is spam. Then dovecot files it away into a &hellip; <a href=\"https:\/\/partofthething.com\/thoughts\/spam-statistics-from-my-email-server\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Spam statistics from my email server<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":4,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[3],"tags":[],"class_list":["post-1323","post","type-post","status-publish","format-standard","hentry","category-computers"],"_links":{"self":[{"href":"https:\/\/partofthething.com\/thoughts\/wp-json\/wp\/v2\/posts\/1323","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/partofthething.com\/thoughts\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/partofthething.com\/thoughts\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/partofthething.com\/thoughts\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/partofthething.com\/thoughts\/wp-json\/wp\/v2\/comments?post=1323"}],"version-history":[{"count":3,"href":"https:\/\/partofthething.com\/thoughts\/wp-json\/wp\/v2\/posts\/1323\/revisions"}],"predecessor-version":[{"id":1388,"href":"https:\/\/partofthething.com\/thoughts\/wp-json\/wp\/v2\/posts\/1323\/revisions\/1388"}],"wp:attachment":[{"href":"https:\/\/partofthething.com\/thoughts\/wp-json\/wp\/v2\/media?parent=1323"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/partofthething.com\/thoughts\/wp-json\/wp\/v2\/categories?post=1323"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/partofthething.com\/thoughts\/wp-json\/wp\/v2\/tags?post=1323"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}