[sanesecurity] Optimizing ndb files

  • From: PiK <pik256@xxxxxxxxx>
  • To: sanesecurity@xxxxxxxxxxxxx
  • Date: Fri, 30 Jan 2009 17:18:08 +0100

I needed to investigate a signature in ndb file to discover why a message was 
filtered.

I wrote a small script to translate ndb base into a human-readable form (hex to
text). Because I am not sure if this list allows file attachments I put this script inline at the end of the mail. I translated sanesecurity files and I noticed that some signatures are duplicated. For example Sanesecurity.Junk.5795 to Sanesecurity.Junk.5802 (8 signatures) are all the same.

I wrote a second script to find duplicated signatures. While testing files I 
realized
there is another type of semi-duplicates, where one signature begins with another one. For example Sanesecurity.Junk.316 is (rot-13) "1 zbagu-1 vapu" and Sanesecurity.Junk.317 is "1 zbagu-1 vapu uggc://" (http:// appended to the first pattern). My script detect this type of duplication as well.

I applied my script to ndb files with the following results:

junk.ndb: 375 duplicates from 7571 signatures removed.
lott.ndb: 168 duplicates from 2031 signatures removed.
msrbl-spam: 200 duplicates from 2568 signatures removed.
phish.ndb: 321 duplicates from 9534 signatures removed.
scam.ndb: 307 duplicates from 9380 signatures removed.
spear.ndb: 3 duplicates from 2446 signatures removed.

Of course I can present full logs, with all the names of removed signatures.

However, I realised that a few duplicates are of different type. I mean for example Sanesecurity.Phishing.Cur.127 and Sanesecurity.Phishing.Cur.128, both containing (rot-13) url-substring "pbk.arg:81/hcqngr/vaqrk.cuc" but one of type 3 (normalised HTML) and the second one 4 (mail file). This could be intentional, provided one is for text-plain mail and the second for a kind of encoded html mail (and that way not detected by type 4 signature). Simply I don't know what kind of html normalisation clamav does.

A few cross-type duplicates are between 4 and 0 (mail file/any file) and I think these ones are simply mistakes. However removing cross-type duplicates could change functionality of ndb files. The script can be simply changed to distinct cross-type duplicates and leave them in a database. After applying this change 3 logs changed:

msrbl-spam: 194 duplicates from 2574 signatures removed.  (left 6 4/0 cross 
type)
scam.ndb: 303 duplicates from 9384 signatures removed. (left 4 4/0 cross type)
phish.ndb: 269 duplicates from 9586 signatures removed. (left many 3/4 and few 4/0 cross types)

I think only phish.ndb contains intentional cross-type duplicates and cross type 4/0 duplicates in msrbl and scam are mistakes. Lets see what Steve will say.

If you are interested in my results you need Rebol interpreter, scripts are written in Rebol/Core (thanks to power of Rebol parse they are short and fast). You can get it at http://www.rebol.com/platforms.html. No installation is needed. Scripts are not fully optimized but I tried to write them easy to understand for people without Rebol skills.

I think the best way would be optimization at the source, in Sanesecurity, done by Steve. Steve, what do you think? Maybe 1000 signatures is not too much, but allways, files will be a few percent smaller without any loose of functionality.

Piotr Kubala



Here is the script to translate ndb file.
Save everything between ------- lines to ndb2txt.r and run:
rebol ndb2txt.r filename.ndb
script writes translated database into filename.txt

The script does not translate "hex[x-y]hex" patterns, I did not find them in sanesecurity files.

------------------------------------------------------------------------------
REBOL [
  Purpose: {Translate ndb file into human-readable form}
]

file: to-rebol-file any [system/script/args %junk.ndb]
hex: charset "0123456789ABCDEFabcdef"
trans: make string! length? file

foreach line read/lines file [
  parse line [
    copy nfo 3 thru ":" (append trans nfo)
    some [
      copy sig some hex (append trans to-string load join "#{" [sig "}"])
      | "{" copy sig thru "}" (append trans join "{" sig)
      | "(" copy sig 2 skip (append trans dehex join "(%" sig)
        some [
          "|" copy sig 2 skip (append trans dehex join "|%" sig)
        ] ")" (append trans ")")
      | "*" (append trans "{*}")
    ]
  ]
  append trans newline
]
write replace file %.ndb %.txt trans
--------------------------------------------------------------------------------

The second script optimizes ndb file removing duplicates. Save it to remdupl.r
Usage (no wildcards):
rebol remdupl.r file1.ndb file2.ndb file3.ndb ...
script generates log files file1.log file2.log ...
and writes optimized files into file1.new file2.new ...
If you want to distinct cross type signatures (recommended) then remove 2 occurances of number 3 from script. I decided to present it in this firm because it is easier to remove 2 chars from script than explain where to insert them.

--------------------------------------------------------------------------------
REBOL [
  Purpose: {Remove duplicates from ndb files}
]

foreach file parse any [system/script/args %junk.ndb] "" [
  sigs: make block! 2000
  duplicates: make hash! 200
  log: make string! 2000
  foreach line ndb: read/lines file: to-rebol-file file [
    parse line [copy nfo 3 thru ":" copy sig to end]
    repend sigs [nfo sig]
  ]
  sort/skip/compare sigs 2 2
  lastsig: lastnfo: ""
  foreach [nfo sig] sigs [
    either find/match sig lastsig [
      append duplicates nfo
      repend log [lastnfo " = " nfo newline]
    ][
      lastnfo: nfo
      lastsig: sig
    ]
  ]
  remove-each line ndb [
    parse line [copy nfo 3 thru ":"]
    find duplicates nfo
  ]
  write/lines replace file %.ndb %.new ndb
  write replace file %.new %.log log
  write/append file join length? duplicates [
    " duplicates from " length? ndb " signatures removed."
  ]
]
--------------------------------------------------------------------------------

There could be even more duplicates, where one signature is included inside another (not at the beginning) but detection of such duplicates would be more time-consuming so I gave up.


Other related posts: