[yunqa.de] Re: unuable to make the extract text demo working / charset issue?

  • From: Delphi Inspiration <delphi@xxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Sun, 23 Jan 2011 15:13:58 +0100

On 23.01.2011 11:05, Laurent Breysse wrote:

> I've started evaluating the DiHTMLParser with Delphi 2010 Ent
>
> When using the extract text demo, the parsed text is not correct: the
> raw html code is displayed with white space separators, one for each
> displayed character
>
> If I clear the charset attribute of the pasted html code (ie replacing
> "chartset=UTF-8_or_any_other_charset" by "chartset="), the text is
> correctly extracted
>
> I've tried with various html pages using different charsets (google home
> page, ebay, ...), and also directly imported the html content using the
> indy http component (memoHtmlInput.text := idhttp.get(myurl)) with no
> change.
>
> What am I doing wrong with this demo?

You are not doing anything wrong with the DIHtmlParser ExtractText demo,
the problem is with Delphi 2010's automatic Unicode conversion when text
is pasted into the Unicode TMemo.

The ExtractText demo is hit by the fact that a Delphi 2010 TMemo handles
all text as Unicode, including any ANSI pasted into it. So the HTML
which is passed to the function HtmlExtractText() is always Unicode.
This generally works very well with DIHtmlParser as the proper
ReadMethod (Read_UTF_16_LE) is passed along as well.

The problem only shows if the HTML contains an explicit character
encoing in the form of, for example:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If the demo sees this, it switches to UTF-8 decoding as instructed.
Unfortunately, the TMemo has converted the UTF-8 chars to Unicode chars,
which are now decoded as if they were UTF-8. This must obviously lead to
the kind of problems you have described.

The solution is to not apply automatic character set recognition for
Unicode Delphis. I have changed the code accordingly and attached it to
this message. CharSet evaluation is now turned off for all Unicode
controls and has turned into option for ANSI (pre Delphi 2009 with no
TntUnicode defined).

Ralf

Other related posts: