[yunqa.de] Re: DITidy to process www.163.com

  • From: Bear Xu <bear.xy@xxxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Tue, 25 Aug 2009 14:31:33 +0800

Dear Ralf,

See attachment to delphi@xxxxxxxx for my source code,
I can remove script first from html file. I always use idHTTP to get html
souce, remove script , then tidy it via the unicode funciton you provide.
The core problem is if there is </form> before table etc, it will cause the
result empty.

If you fixed this, tidy 163.com source will be no problem.
I use DITidy 2.0 with Delphi 2009 on Vista SP2.

Thank you for your kind help.

Bear


On Sat, Aug 22, 2009 at 5:06 PM, Delphi Inspiration <delphi@xxxxxxxx> wrote:

> At 07:18 20.08.2009, Bear Xu wrote:
>
> >1. I just copy the source code of the page in IE and paste it to Delphi
> TMemo, and use it to call Tidy functions (to process Unicode html you
> provided to me last time)
>
> Copying / pasting to Delphi's TMemo can easily result in an implicit
> character conversion. Especially for Delphi 2009, we may not always want
> TMemo to convert out 8-bit text to Unicode because we must pass it on
> unchanged to further processing. Developers should be very aware of this
> when writing applications for char-set specific parsing like HTML and XML.
>
> >2. I run your sourcecode, the same result:
> >
> >all of the end tag is wrong!! ==>
> >
> ><\/table>
> ><\/center><\/div>
> ><\/div>
> >
> ></div> == became ==><\/div>
> >do not know why ?
>
> Tidy has known problems with '/script>' inside of script element contents,
> for example:
>
>
> http://sourceforge.net/tracker/?func=detail&aid=2712780&group_id=27659&atid=390963
>
> You HTML contains a script element with the following content:
>
>  document.write("<script type='text/javascript' src='
> http://61.135.253.47/ipquery'><" + "/script>");
>
> Unfortunately, Tidy does not recognize that '/script' is part of a quoted
> string but gets all confused about it. It fails to determine the correct end
> of the script contents and outputs subsequent HTML closing tags as script
> contents. Cleaning this up leads it to escape theh forward slashes with '\/'
> according to the HTML specification:
>
>  http://www.w3.org/TR/html401/appendix/notes.html#h-B.3.2
>
> Unfortunately, many Tidy problems related to script contents are still
> unsolved. I will update the DITidy port as soon as fixes become available.
>
> >3. when I pass unicode html source code to Tidy, will it to check the Meta
> charset setings  in head section?
> >I think it should not check that, or it is processing a html file.
>
> According to the HTML specs, META charset information should change the
> parser's character decoding regardless of its initial setting.
>
> >4. I continue test!
> >   Remove sourcecode piece by piece, finally I found :
> >
> >If there is no head, the all output will be empty
> >
> >and it is caused by "</tbody></form></table>"
> >
> >for example:
> >==================================
> ><div> Test Content
> ></tbody></form></table>
> ></div>
> > ==================================
> >tidy it , the result is empty
>
> For me it returns
>
> <html>
> <head>
> <title></title>
> </head>
> <body>
> <div>Test Content</div>
> </body>
> </html>
>
> >if I removed </form>, it will output "<div>Test Content</div>"
>
> I get
>
> <html>
> <head>
> <title></title>
> </head>
> <body>
> <div>Test Content</div>
> </body>
> </html>
>
> Both look fine to me.
>
> >please have a check for Tidy SourceCode, thanks
>
> Unfortunately, the latest changes in the Tidy source tree do not remedy the
> problems discussed here.
>
> Ralf
>
> _______________________________________________
> Delphi Inspiration mailing list
> yunqa@xxxxxxxxxxxxx
> //www.freelists.org/list/yunqa
>
>
>
>

Other related posts: