简单的命令行纯文本垃圾邮件或火腿分类器

我有一大堆已经被保存的数据库条目,它们都是垃圾邮件。 我希望能够将每个输出的文本输出到一个spamassassin或类似的工具,以便能够得到它是垃圾邮件的可能性的分数,但没有从邮箱,甚至运行的整个机器学习的东西一个邮件服务器。 看来,我发现的所有东西都非常偏向电子邮件,而不仅仅是一个简单的stdin > process > stdouttypes的东西。

如果有一个用脚本语言编写的话,那很好,但是我宁愿有一些东西可以用于开箱即用的centos机器。 任何帮助赞赏。

有趣的是你提到了spamassassin,因为它有一个模式,看起来正是你想要的(在这种情况下, /tmp/spammy包含一个候选邮件):

 [me@lory tmp]$ spamassassin < /tmp/spammy Oct 20 11:54:47.097 [19986] warn: netset: cannot include 127.0.0.1/32 as it has already been included From: "REDACTED" <redacted> To: REDACTED Subject: Pharmacy Date: 20 Oct 2014 02:22:04 +0100 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lory.teaparty.net X-Spam-Flag: YES X-Spam-Level: ********* X-Spam-Status: Yes, score=9.2 required=3.9 tests=BAYES_20,MISSING_MID, NO_RECEIVED,NO_RELAYS,TVD_SPACE_RATIO,URIBL_BLACK,URIBL_DBL_SPAM, URIBL_JP_SURBL,URIBL_SBL,URIBL_WS_SURBL autolearn=no version=3.3.1 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----------=_5444E9FB.89EA3D9F" This is a multi-part message in MIME format. ------------=_5444E9FB.89EA3D9F Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit Spam detection software, running on the system "lory.teaparty.net", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Good medicines special http://canadiantabletstore.com/ [...] Content analysis details: (9.2 points, 3.9 required) pts rule name description ---- ---------------------- -------------------------------------------------- 2.5 URIBL_DBL_SPAM Contains a spam URL listed in the DBL blocklist [URIs: canadiantabletstore.com] 1.7 URIBL_BLACK Contains an URL listed in the URIBL blacklist [URIs: canadiantabletstore.com] 1.6 URIBL_WS_SURBL Contains an URL listed in the WS SURBL blocklist [URIs: canadiantabletstore.com] 1.2 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist [URIs: canadiantabletstore.com] -0.0 NO_RELAYS Informational: message was not relayed via SMTP 1.6 URIBL_SBL Contains an URL's NS IP listed in the SBL blocklist [URIs: canadiantabletstore.com] -0.0 BAYES_20 BODY: Bayes spam probability is 5 to 20% [score: 0.1750] 0.5 MISSING_MID Missing Message-Id: header -0.0 NO_RECEIVED Informational: message has no Received headers 0.0 TVD_SPACE_RATIO TVD_SPACE_RATIO ------------=_5444E9FB.89EA3D9F Content-Type: message/rfc822; x-spam-type=original Content-Description: original message before SpamAssassin Content-Disposition: inline Content-Transfer-Encoding: 8bit Date: 20 Oct 2014 02:22:04 +0100 From: "REDACTED" <REDACTED> To: REDACTED Subject: Pharmacy Good medicines special http://canadiantabletstore.com/ ------------=_5444E9FB.89EA3D9F--