Le 23/01/2011 15:13, Delphi Inspiration a écrit :
On 23.01.2011 11:05, Laurent Breysse wrote:I've started evaluating the DiHTMLParser with Delphi 2010 Ent When using the extract text demo, the parsed text is not correct: the raw html code is displayed with white space separators, one for each displayed character If I clear the charset attribute of the pasted html code (ie replacing "chartset=UTF-8_or_any_other_charset" by "chartset="), the text is correctly extracted I've tried with various html pages using different charsets (google home page, ebay, ...), and also directly imported the html content using the indy http component (memoHtmlInput.text := idhttp.get(myurl)) with no change. What am I doing wrong with this demo?You are not doing anything wrong with the DIHtmlParser ExtractText demo, the problem is with Delphi 2010's automatic Unicode conversion when text is pasted into the Unicode TMemo. The ExtractText demo is hit by the fact that a Delphi 2010 TMemo handles all text as Unicode, including any ANSI pasted into it. So the HTML which is passed to the function HtmlExtractText() is always Unicode. This generally works very well with DIHtmlParser as the proper ReadMethod (Read_UTF_16_LE) is passed along as well. The problem only shows if the HTML contains an explicit character encoing in the form of, for example: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> If the demo sees this, it switches to UTF-8 decoding as instructed. Unfortunately, the TMemo has converted the UTF-8 chars to Unicode chars, which are now decoded as if they were UTF-8. This must obviously lead to the kind of problems you have described. The solution is to not apply automatic character set recognition for Unicode Delphis. I have changed the code accordingly and attached it to this message. CharSet evaluation is now turned off for all Unicode controls and has turned into option for ANSI (pre Delphi 2009 with no TntUnicode defined). Ralf
Ralf, thanks for your answer and the fixed demo form.Do you mean that, for unicode versions of delphi, as long as the SourceBuffer of the parser contains a unicode string, and the ReadMethod is also set to unicode (utf-16_le), the ChartSet plugin should never be used? As far as I get all my HTTP strings from 'unicode providers' (for ex, myunicodestring := idHTTP.get(aurl), or TStringList.LoadFromFile(afile)), can I simply forget about the TDiHtmlChartsetPlugin?
Thanks, LB _______________________________________________ Delphi Inspiration mailing list yunqa@xxxxxxxxxxxxx //www.freelists.org/list/yunqa