[SimpleMail-usr] Optimise SPAM filter operation

  • From: Kulwant Bhogal <kulwant.bhogal@xxxxxxxxxxxxxx>
  • To: simplemail-usr@xxxxxxxxxxxxx
  • Date: 26 Dec 2003 18:32:18 +0000

Hello Sebastian,

Would it be possible to optimise the operation of the Statistical SPAM
filter a little? 

I have one idea for how this can be done. There are potentially loads more.

Quite often, there will be many duplicate SPAM messages in the SPAM folder
because the sources of the SPAM are the same. Furthermore the content from
these senders may not change significantly or at all over a period of time.
Repeated isolation of these messages leads to the storage of essentially
duplicate messages in the SPAM folder. For the purposes of regenerating the
spam dictionary, the duplicates serve no useful purpose as long as a count
of how many times those spam words have been received has been maintained.    

My idea therefore is to keep only one copy of duplicate emails in the SPAM
folder. To save having a potentially computationally intensive duplicate removal
algorithm, this can be achieved by generating a word based checksum from the
content of each SPAM email. Duplicate content emails will generate the same
checksum thereby allowing a) the duplicate to be isolated and deleted
instead of being stored in the SPAM folder and b) for the offending email to
still be used to bump up the SPAM word count in the SPAM dictionary.    

Does this idea sound implementable? 

Keep up the excellent work. 

Kind regards, 

Kulwant

__________________________________________________________________________
SimpleMail mailing list   -   //www.freelists.org/list/simplemail-usr
Listserver help.: mailto:simplemail-usr-request@xxxxxxxxxxxxx?Subject=HELP
Unsub....: mailto:simplemail-usr-request@xxxxxxxxxxxxx?Subject=UNSUBSCRIBE

Other related posts: