[antispam-f] Re: RegEx Testing

  • From: Jeremy C B Nicoll <Jeremy@xxxxxxxxxxxxxxxx>
  • To: antispam@xxxxxxxxxxxxx
  • Date: Mon, 06 Nov 2006 17:36:23 +0000 (GMT)

In article <4e819f7af7freelists@xxxxxxxxxxxxxxxx>,
   Martin <freelists@xxxxxxxxxxxxxxxx> wrote:
> Now that AS v1.59a8 has RegEx support, I wondered if anyone was
> interested in a little application I wrote to help test Regular
> Expressions in a standalone way before committing them to a Rules
> file? (... which I needed when helping to add RegEx support to AS!)

> It is currently in Beta status, but seems reasonably stable now, and
> I would be interested in any comments before I formally release it.

> In anyone would like a copy, please email me direct.

<brain dump on>

I would like, but in the next few days have no opportunity to play with
it.

For a long time I had Pluto's regex support in use, but after a while I
turned a lot of my regex rules off.  Why?  Because I wasn't completely
convinced that they made filter checks faster.  The suspected reason
will also affect people using regex in AntiSpam (unless there's an
option in the way that AS calls regex that defeats this).  The problem
is that when a person sees a pattern like

  A.*B.*

(meaning an A followed by many characters followed by a B followed by
other stuff) and considers whether that would match a string like

 A fkhfjg fgfj ghfkgfjgjfg kjgfgfgjfgjfjgfjgfg B z

they can see at a glance that the answer is yes.  But if the test
string was
 
 A B A B A lkgdgf gfjglfgklfgfkgkfgfkgfgkfkgfgfgklfgjfg A B lfjglk B a

regex will first decide that the string matches the initial

 A B

then have another look and decide it matches: A B A B, then have
another look until it sees the almost entire string matches.  That's
because it looks for the longest matching subexpression.  If the whole
regular expression is complicated it's just possible that the regex
module will take a long time to find the longest possible match rather
than simply say "yes, there's A match" and not try to find a better
one.  For both Pluto's use and AS's use, users don't care what the
matching expression is, just that something matched...

OTOH the test strings being compared with a regular expression should
be short (both in Pluto and here) so you'd wonder why that would really
matter - surely it wouldn't take long to find a long subexpression
rather than a short one.

Well, I dunno.  A problem (I thought) in the implementation of regex in
Pluto was that there was no way to trace (and in particular timestamp
these traces) the use of regex by Pluto.  I found that once in a while
a debatch of data would be incredibly slow.  I thought it might (in
Pluto's case, probably not applicable here) happen when a "body" regex
test examined character by character the encoded representation of an
attached jpeg or other binary file.... then a test looking for, say a
munged drug name might have many hundreds of thousands of characters of
picture data to look at.  But I couldn't prove where the time was being
wasted.

Pluto (I think) compiled regex tests once and then applied them.
Another possibility is that once in a while a compiled expression got
corrupted so it meant something else and the regex pattern-match code
then set off on a wild goose chase.  Or maybe it set off with an ok
pattern but missed the end of the search string and tried to match all
of memory.

Whatever, I'd suggest caution!

Also (to Frank) are you setting thing up so that regex tests look at
the pure incoming headers, or those that have already been folded to
all upper or all lower case?  If the stuff being tested is all one or
other case, is there anything to stop people using the extended regex
stuff that says, say, match an "a" or and "A" - because obviously
that's a waste of effort.  Certainly a pattern explicitly containing eg
"[aA]" would be a waste.  I think that with Pluto JSD set things up so
that whatever a user coded as a pattern was invisibly extended that
way, and that caused a problem too: a user could define a - say - 55
character search pattern which became much longer than that when Pluto
expanded it - and that overflowed the buffer JSD has allowed for the
user-supplied pattern to live in.
 
I'm sure you already know that mixed case patterns need to be supported.

-- 
Jeremy C B Nicoll, Edinburgh, Scotland - my opinions are my own.

Other related posts: