[yunqa.de] Re: DIRegEx - translating utf8 match info to utf16 - yunqa

[yunqa.de] Re: DIRegEx - translating utf8 match info to utf16

From: Delphi Inspiration <delphi@xxxxxxxx>
To: yunqa@xxxxxxxxxxxxx
Date: Fri, 16 Nov 2007 17:31:51 +0100

Hello Jim Bretti,

>I noticed that I can write my source widestrings to memory streams, and use
>TDiRegExSearchStream_Utf16LE instead of TDIPerlRegEx.  The SearchNext method
>for the stream regex seems to return what I need ...

Yes, this does certainly work.

>is this the best way to handle?

Please notice that TDIRegExSearchStream_Utf16LE uses the DFA matching algorithm 
instead of the Perl one. While mostly similar, DFA differs from Perl in some 
important aspects: the greedy or ungreedy nature of repetition quantifiers is 
not relevant, no substring capturing, no support for backreference, plus a few 
others. To see the full list of differences, search the DIRegEx help for 
"pcrematching".

I also believe that there is a performance trade-off between 
TDIRegExSearchStream_Utf16LE and BufCountUtf8Chars approach I suggested in my 
other answer. The DFA algorithm is somewhat slower than Perl, and 
TDIRegExSearchStream_Utf16LE has some overhead for stream handling and on the 
fly character conversion. On the other hand, only needs to perform a single 
scan through the subject string.

I therefore expect that TDIRegExSearchStream_Utf16LE will be faster if

* the subject string is rather long.
* positions for multiple matches need to be determined.

For single matches in short strings I expect the BufCountUtf8Chars to perform 
faster.

Ralf

PS: My appologies for not answering sooner, I have been sick for a few days. 

_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa

References:
- [yunqa.de] DIRegEx - translating utf8 match info to utf16
  - From: Jim Bretti
- [yunqa.de] Re: DIRegEx - translating utf8 match info to utf16
  - From: Jim Bretti

[yunqa.de] Re: DIRegEx - translating utf8 match info to utf16

Other related posts: