[yunqa.de] Re: Access violation calling DIPerlRegEx.Match

  • From: Delphi Inspiration <delphi@xxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Sat, 29 May 2010 10:42:39 +0200

At 18:24 28.05.2010, Jim Bretti wrote:

>On the utf8 recommendation, are you saying I need to use coTUF8 if my 
>subject/match string contains ordinal values > 127?  

Yes, UTF-8 is the recommended way for DIRegEx to handle characters outside the 
US-ASCII codepage, that is any character code point greater than 127.

In non-UTF-8 mode, DIRegEx interprets characters according to the ISO-8859-1 
(Latin-1) codepage, which is the first 255 Unicode characters.

You can use the set_locale() function to a single-byte locale supported by the 
Windows target operating system. This rebuilds the internal character 
comparison and character type tables, but does not affect the Unicode Character 
Properties (UCP), which you are using in your pattern.

>I'm using the unicode / utf8 options only when necessary since I seem to be 
>getting better performance when I don't go through the utf8 encoding and 
>character counting.

Binary (Non-UTF8) matching is obviously faster since it can work on a much 
smaller character range. But then, simple string comparisons are usually faster 
than regular expressions. This is a little like comparing apples and pears.

For text data, I would not go the extra mile of supporting separate UTF-8 and 
non-UTF8 versions. It addd extra complexity to the code, with just a minor 
performance benefit, given todays processor speeds. Most texts are Unicode 
these days already, and those which are not, are likely to be so in the future.

For binary data, non-UTF8 matching is obviously the best choice. But then one 
would not expect to apply Unicode Character Properties to binary data, after 
all. ;-)

Ralf  

_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts: