I needed to investigate a signature in ndb file to discover why a message was filtered. I wrote a small script to translate ndb base into a human-readable form (hex totext). Because I am not sure if this list allows file attachments I put this script inline at the end of the mail. I translated sanesecurity files and I noticed that some signatures are duplicated. For example Sanesecurity.Junk.5795 to Sanesecurity.Junk.5802 (8 signatures) are all the same.
I wrote a second script to find duplicated signatures. While testing files I realizedthere is another type of semi-duplicates, where one signature begins with another one. For example Sanesecurity.Junk.316 is (rot-13) "1 zbagu-1 vapu" and Sanesecurity.Junk.317 is "1 zbagu-1 vapu uggc://" (http:// appended to the first pattern). My script detect this type of duplication as well.
I applied my script to ndb files with the following results: junk.ndb: 375 duplicates from 7571 signatures removed. lott.ndb: 168 duplicates from 2031 signatures removed. msrbl-spam: 200 duplicates from 2568 signatures removed. phish.ndb: 321 duplicates from 9534 signatures removed. scam.ndb: 307 duplicates from 9380 signatures removed. spear.ndb: 3 duplicates from 2446 signatures removed. Of course I can present full logs, with all the names of removed signatures.However, I realised that a few duplicates are of different type. I mean for example Sanesecurity.Phishing.Cur.127 and Sanesecurity.Phishing.Cur.128, both containing (rot-13) url-substring "pbk.arg:81/hcqngr/vaqrk.cuc" but one of type 3 (normalised HTML) and the second one 4 (mail file). This could be intentional, provided one is for text-plain mail and the second for a kind of encoded html mail (and that way not detected by type 4 signature). Simply I don't know what kind of html normalisation clamav does.
A few cross-type duplicates are between 4 and 0 (mail file/any file) and I think these ones are simply mistakes. However removing cross-type duplicates could change functionality of ndb files. The script can be simply changed to distinct cross-type duplicates and leave them in a database. After applying this change 3 logs changed:
msrbl-spam: 194 duplicates from 2574 signatures removed. (left 6 4/0 cross type) scam.ndb: 303 duplicates from 9384 signatures removed. (left 4 4/0 cross type)phish.ndb: 269 duplicates from 9586 signatures removed. (left many 3/4 and few 4/0 cross types)
I think only phish.ndb contains intentional cross-type duplicates and cross type 4/0 duplicates in msrbl and scam are mistakes. Lets see what Steve will say.
If you are interested in my results you need Rebol interpreter, scripts are written in Rebol/Core (thanks to power of Rebol parse they are short and fast). You can get it at http://www.rebol.com/platforms.html. No installation is needed. Scripts are not fully optimized but I tried to write them easy to understand for people without Rebol skills.
I think the best way would be optimization at the source, in Sanesecurity, done by Steve. Steve, what do you think? Maybe 1000 signatures is not too much, but allways, files will be a few percent smaller without any loose of functionality.
Piotr Kubala Here is the script to translate ndb file. Save everything between ------- lines to ndb2txt.r and run: rebol ndb2txt.r filename.ndb script writes translated database into filename.txtThe script does not translate "hex[x-y]hex" patterns, I did not find them in sanesecurity files.
------------------------------------------------------------------------------ REBOL [ Purpose: {Translate ndb file into human-readable form} ] file: to-rebol-file any [system/script/args %junk.ndb] hex: charset "0123456789ABCDEFabcdef" trans: make string! length? file foreach line read/lines file [ parse line [ copy nfo 3 thru ":" (append trans nfo) some [ copy sig some hex (append trans to-string load join "#{" [sig "}"]) | "{" copy sig thru "}" (append trans join "{" sig) | "(" copy sig 2 skip (append trans dehex join "(%" sig) some [ "|" copy sig 2 skip (append trans dehex join "|%" sig) ] ")" (append trans ")") | "*" (append trans "{*}") ] ] append trans newline ] write replace file %.ndb %.txt trans -------------------------------------------------------------------------------- The second script optimizes ndb file removing duplicates. Save it to remdupl.r Usage (no wildcards): rebol remdupl.r file1.ndb file2.ndb file3.ndb ... script generates log files file1.log file2.log ... and writes optimized files into file1.new file2.new ...If you want to distinct cross type signatures (recommended) then remove 2 occurances of number 3 from script. I decided to present it in this firm because it is easier to remove 2 chars from script than explain where to insert them.
-------------------------------------------------------------------------------- REBOL [ Purpose: {Remove duplicates from ndb files} ] foreach file parse any [system/script/args %junk.ndb] "" [ sigs: make block! 2000 duplicates: make hash! 200 log: make string! 2000 foreach line ndb: read/lines file: to-rebol-file file [ parse line [copy nfo 3 thru ":" copy sig to end] repend sigs [nfo sig] ] sort/skip/compare sigs 2 2 lastsig: lastnfo: "" foreach [nfo sig] sigs [ either find/match sig lastsig [ append duplicates nfo repend log [lastnfo " = " nfo newline] ][ lastnfo: nfo lastsig: sig ] ] remove-each line ndb [ parse line [copy nfo 3 thru ":"] find duplicates nfo ] write/lines replace file %.ndb %.new ndb write replace file %.new %.log log write/append file join length? duplicates [ " duplicates from " length? ndb " signatures removed." ] ] --------------------------------------------------------------------------------There could be even more duplicates, where one signature is included inside another (not at the beginning) but detection of such duplicates would be more time-consuming so I gave up.