[yunqa.de] Re: DIRegEx ansi_mbtowc

  • From: Delphi Inspiration <delphi@xxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Thu, 18 Oct 2007 13:45:09 +0200

>1) MultiByteToWideChar cchWideChar: Specifies the size, in wide characters, of 
>the buffer pointed to by the lpWideCharStr parameter. If this value is zero, 
>the function returns the required buffer size, in wide characters, and makes 
>no use of the lpWideCharStr buffer. 
> 
>But you pass there the SizeOf(c) which is **2**.

You are right, it should be **1** in both ansi_mbtowc and oem_mbtowc. Please 
correct your version of DIRegEx_SearchStream.pas. I have done this as well and 
the fix will be available with the next update of DIRegEx.

My tests show that this bug does not affect search results for single-byte 
character sets as it only tries to read one char beyond the last character of 
the buffered stream. This might explain why the problem has been unnoticed so 
far. Multi-byte encodings might have a small chance that the search fails on 
the last character, but it should be very rare indeed.

>2) Why not to call MultiByteToWideChar for the whole string, not for 
>individual chars? It may be faster.

It would certainly be faster to apply MultiByteToWideChar to the whoe string. 
On the contrary, TDIRegExSearchStream_Enc is all about not loading entire 
strings (huge files) into memory at once but in small blocks only. Since we do 
not know where blocks overlap character encoding boundaries, the decoding 
function reads exactly one character at a time.

Btw: Major character conversion libraries function along the same principles. 
As a welcome side effect, TDIRegExSearchStream_Enc can use all character sets 
and converters in DIConverters, the Delphi implementation of the iconv library.

Ralf 

_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts:

  • » [yunqa.de] Re: DIRegEx ansi_mbtowc