On 24.01.2011 11:53, Laurent Breysse wrote: > Do you mean that, for unicode versions of delphi, as long as the > SourceBuffer of the parser contains a unicode string, and the ReadMethod > is also set to unicode (utf-16_le), the ChartSet plugin should never be > used? Yes and no. It depends on how you load your HTML into the UnicodeString. Let me try to explain using an example: Think of some HTML in UTF-8 encoding. If you store this into an WideString or UnicodeString, in theory three things can happen: 1. The UTF-8 sequences are properly converted to UTF-16 so that multiple UTF-8 bytes are converted to their UTF-16 equivalents: 'ä': C3 A4 (8-bit) -> 00E4 (16-bit) 2. The individual UTF-8 bytes are "moved" to the UnicodeString without conversion. In this case, the UnicodeString is just (mis)used as a storage container for a properly encoded UTF-8 text: 'ä': C3 A4 (8-bit) -> C3A4 (16-bit) 3. The individual UTF-8 bytes are just "expanded" to 16 bit but not converted: 'ä': C3 A4 (8-bit) -> 00C3 00A4 (16-bit) All three examples must be handled differently: 1. Set ReadMethods to UTF_16_LE and do not use TDIHtmlCharSetPlugin. 2. Set ReadMethods to UTF_8 and use TDIHtmlCharSetPlugin. 3. This can not be handled because the UnicodeString is neither proper UTF-8 nor UTF-16. > As far as I get all my HTTP strings from 'unicode providers' (for ex, > myunicodestring := idHTTP.get(aurl), or > TStringList.LoadFromFile(afile)), can I simply forget about the > TDiHtmlChartsetPlugin? No. You have to think about how Delphi stores the text to the UnicodeString. Does it apply character conversion as in example 1? Does it apply the correct character set? By default, Delphi's automatic string conversion takes into consideration just a few character encodings: UTF-16, UTF-8, plus the users default codepage. (Un)fortunately, HTML allows for many more encodings which Delphi's converter will not recognize. This can easily result in conversion errors. Unless you know exactly how Delphi handles its behind-the-scenes string conversion, I suggest to avoid Delphi's string types as storage containers for HTML. Sadly, this is not possible for the ExtractText demo because it relies on TMemo for HTML input. But if you receive your HTML from other sources then a string (for example Internet or storage media), you are best advised to load it into any type of TStream and assign the stream to TDIHtmlParser.SourceStream The advantage of TStream is that it never converts character encodings (well, except for TStringStream). It is also fast, flexible and supported by major Internet components like Indy, Synapse, and others. Ralf _______________________________________________ Delphi Inspiration mailing list yunqa@xxxxxxxxxxxxx //www.freelists.org/list/yunqa