[yunqa.de] Re: unuable to make the extract text demo working / charset issue?

  • From: Delphi Inspiration <delphi@xxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Tue, 25 Jan 2011 10:31:41 +0100

On 24.01.2011 11:53, Laurent Breysse wrote:

> Do you mean that, for unicode versions of delphi, as long as the
> SourceBuffer of the parser contains a unicode string, and the ReadMethod
> is also set to unicode (utf-16_le), the ChartSet plugin should never be
> used?

Yes and no. It depends on how you load your HTML into the UnicodeString.
Let me try to explain using an example:

Think of some HTML in UTF-8 encoding. If you store this into an
WideString or UnicodeString, in theory three things can happen:

1. The UTF-8 sequences are properly converted to UTF-16 so that multiple
UTF-8 bytes are converted to their UTF-16 equivalents:

   'ä': C3 A4 (8-bit) -> 00E4 (16-bit)

2. The individual UTF-8 bytes are "moved" to the UnicodeString without
conversion. In this case, the UnicodeString is just (mis)used as a
storage container for a properly encoded UTF-8 text:

   'ä': C3 A4 (8-bit) -> C3A4 (16-bit)

3. The individual UTF-8 bytes are just "expanded" to 16 bit but not
converted:

   'ä': C3 A4 (8-bit) -> 00C3 00A4 (16-bit)

All three examples must be handled differently:

1. Set ReadMethods to UTF_16_LE and do not use TDIHtmlCharSetPlugin.

2. Set ReadMethods to UTF_8 and use TDIHtmlCharSetPlugin.

3. This can not be handled because the UnicodeString is neither proper
UTF-8 nor UTF-16.

> As far as I get all my HTTP strings from 'unicode providers' (for ex,
> myunicodestring := idHTTP.get(aurl), or
> TStringList.LoadFromFile(afile)), can I simply forget about the
> TDiHtmlChartsetPlugin?

No. You have to think about how Delphi stores the text to the
UnicodeString. Does it apply character conversion as in example 1? Does
it apply the correct character set?

By default, Delphi's automatic string conversion takes into
consideration just a few character encodings: UTF-16, UTF-8, plus the
users default codepage. (Un)fortunately, HTML allows for many more
encodings which Delphi's converter will not recognize. This can easily
result in conversion errors.

Unless you know exactly how Delphi handles its behind-the-scenes string
conversion, I suggest to avoid Delphi's string types as storage
containers for HTML. Sadly, this is not possible for the ExtractText
demo because it relies on TMemo for HTML input.

But if you receive your HTML from other sources then a string (for
example Internet or storage media), you are best advised to load it into
any type of TStream and assign the stream to

  TDIHtmlParser.SourceStream

The advantage of TStream is that it never converts character encodings
(well, except for TStringStream). It is also fast, flexible and
supported by major Internet components like Indy, Synapse, and others.

Ralf
_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts: