[yunqa.de] Re: DITidy to process www.163.com

  • From: Delphi Inspiration <delphi@xxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Sat, 22 Aug 2009 11:06:56 +0200

At 07:18 20.08.2009, Bear Xu wrote:

>1. I just copy the source code of the page in IE and paste it to Delphi TMemo, 
>and use it to call Tidy functions (to process Unicode html you provided to me 
>last time)

Copying / pasting to Delphi's TMemo can easily result in an implicit character 
conversion. Especially for Delphi 2009, we may not always want TMemo to convert 
out 8-bit text to Unicode because we must pass it on unchanged to further 
processing. Developers should be very aware of this when writing applications 
for char-set specific parsing like HTML and XML.

>2. I run your sourcecode, the same result:
> 
>all of the end tag is wrong!! ==>
> 
><\/table>
><\/center><\/div>
><\/div>
>
></div> == became ==><\/div>
>do not know why ?

Tidy has known problems with '/script>' inside of script element contents, for 
example:

  
http://sourceforge.net/tracker/?func=detail&aid=2712780&group_id=27659&atid=390963

You HTML contains a script element with the following content:

  document.write("<script type='text/javascript' 
src='http://61.135.253.47/ipquery'><" + "/script>");

Unfortunately, Tidy does not recognize that '/script' is part of a quoted 
string but gets all confused about it. It fails to determine the correct end of 
the script contents and outputs subsequent HTML closing tags as script 
contents. Cleaning this up leads it to escape theh forward slashes with '\/' 
according to the HTML specification:

  http://www.w3.org/TR/html401/appendix/notes.html#h-B.3.2

Unfortunately, many Tidy problems related to script contents are still 
unsolved. I will update the DITidy port as soon as fixes become available.

>3. when I pass unicode html source code to Tidy, will it to check the Meta 
>charset setings  in head section?
>I think it should not check that, or it is processing a html file.

According to the HTML specs, META charset information should change the 
parser's character decoding regardless of its initial setting.

>4. I continue test! 
>   Remove sourcecode piece by piece, finally I found :
>
>If there is no head, the all output will be empty
> 
>and it is caused by "</tbody></form></table>"
> 
>for example:
>==================================
><div> Test Content
></tbody></form></table>
></div>
> ==================================
>tidy it , the result is empty

For me it returns 

<html>
<head>
<title></title>
</head>
<body>
<div>Test Content</div>
</body>
</html>

>if I removed </form>, it will output "<div>Test Content</div>"

I get

<html>
<head>
<title></title>
</head>
<body>
<div>Test Content</div>
</body>
</html>

Both look fine to me.

>please have a check for Tidy SourceCode, thanks

Unfortunately, the latest changes in the Tidy source tree do not remedy the 
problems discussed here.

Ralf 

_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts: