[yunqa.de] Re: unuable to make the extract text demo working / charset issue?

  • From: Laurent Breysse <lbreysse@xxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Mon, 24 Jan 2011 11:53:32 +0100

Le 23/01/2011 15:13, Delphi Inspiration a écrit :
On 23.01.2011 11:05, Laurent Breysse wrote:
I've started evaluating the DiHTMLParser with Delphi 2010 Ent

When using the extract text demo, the parsed text is not correct: the
raw html code is displayed with white space separators, one for each
displayed character

If I clear the charset attribute of the pasted html code (ie replacing
"chartset=UTF-8_or_any_other_charset" by "chartset="), the text is
correctly extracted

I've tried with various html pages using different charsets (google home
page, ebay, ...), and also directly imported the html content using the
indy http component (memoHtmlInput.text := idhttp.get(myurl)) with no
change.

What am I doing wrong with this demo?
You are not doing anything wrong with the DIHtmlParser ExtractText demo,
the problem is with Delphi 2010's automatic Unicode conversion when text
is pasted into the Unicode TMemo.

The ExtractText demo is hit by the fact that a Delphi 2010 TMemo handles
all text as Unicode, including any ANSI pasted into it. So the HTML
which is passed to the function HtmlExtractText() is always Unicode.
This generally works very well with DIHtmlParser as the proper
ReadMethod (Read_UTF_16_LE) is passed along as well.

The problem only shows if the HTML contains an explicit character
encoing in the form of, for example:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If the demo sees this, it switches to UTF-8 decoding as instructed.
Unfortunately, the TMemo has converted the UTF-8 chars to Unicode chars,
which are now decoded as if they were UTF-8. This must obviously lead to
the kind of problems you have described.

The solution is to not apply automatic character set recognition for
Unicode Delphis. I have changed the code accordingly and attached it to
this message. CharSet evaluation is now turned off for all Unicode
controls and has turned into option for ANSI (pre Delphi 2009 with no
TntUnicode defined).

Ralf

Ralf, thanks for your answer and the fixed demo form.

Do you mean that, for unicode versions of delphi, as long as the SourceBuffer of the parser contains a unicode string, and the ReadMethod is also set to unicode (utf-16_le), the ChartSet plugin should never be used? As far as I get all my HTTP strings from 'unicode providers' (for ex, myunicodestring := idHTTP.get(aurl), or TStringList.LoadFromFile(afile)), can I simply forget about the TDiHtmlChartsetPlugin?

Thanks,
LB

_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts: