[yunqa.de] Re: DITidy to process www.163.com

  • From: Bear Xu <bear.xy@xxxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Sat, 29 Aug 2009 21:24:10 +0800

Hi Ralf,

How about my question in last email?
How to fix that?

thanks,

Bear

On Tue, Aug 25, 2009 at 2:31 PM, Bear Xu <bear.xy@xxxxxxxxx> wrote:

> Dear Ralf,
>
> See attachment to delphi@xxxxxxxx for my source code,
>
> I can remove script first from html file. I always use idHTTP to get html
> souce, remove script , then tidy it via the unicode funciton you provide.
> The core problem is if there is </form> before table etc, it will cause the
> result empty.
>
> If you fixed this, tidy 163.com source will be no problem.
> I use DITidy 2.0 with Delphi 2009 on Vista SP2.
>
> Thank you for your kind help.
>
> Bear
>
>
>   On Sat, Aug 22, 2009 at 5:06 PM, Delphi Inspiration <delphi@xxxxxxxx>wrote:
>
>> At 07:18 20.08.2009, Bear Xu wrote:
>>
>> >1. I just copy the source code of the page in IE and paste it to Delphi
>> TMemo, and use it to call Tidy functions (to process Unicode html you
>> provided to me last time)
>>
>> Copying / pasting to Delphi's TMemo can easily result in an implicit
>> character conversion. Especially for Delphi 2009, we may not always want
>> TMemo to convert out 8-bit text to Unicode because we must pass it on
>> unchanged to further processing. Developers should be very aware of this
>> when writing applications for char-set specific parsing like HTML and XML.
>>
>> >2. I run your sourcecode, the same result:
>> >
>> >all of the end tag is wrong!! ==>
>> >
>> ><\/table>
>> ><\/center><\/div>
>> ><\/div>
>> >
>> ></div> == became ==><\/div>
>> >do not know why ?
>>
>> Tidy has known problems with '/script>' inside of script element contents,
>> for example:
>>
>>
>> http://sourceforge.net/tracker/?func=detail&aid=2712780&group_id=27659&atid=390963
>>
>> You HTML contains a script element with the following content:
>>
>>  document.write("<script type='text/javascript' src='
>> http://61.135.253.47/ipquery'><" + "/script>");
>>
>> Unfortunately, Tidy does not recognize that '/script' is part of a quoted
>> string but gets all confused about it. It fails to determine the correct end
>> of the script contents and outputs subsequent HTML closing tags as script
>> contents. Cleaning this up leads it to escape theh forward slashes with '\/'
>> according to the HTML specification:
>>
>>  http://www.w3.org/TR/html401/appendix/notes.html#h-B.3.2
>>
>> Unfortunately, many Tidy problems related to script contents are still
>> unsolved. I will update the DITidy port as soon as fixes become available.
>>
>> >3. when I pass unicode html source code to Tidy, will it to check the
>> Meta charset setings  in head section?
>> >I think it should not check that, or it is processing a html file.
>>
>> According to the HTML specs, META charset information should change the
>> parser's character decoding regardless of its initial setting.
>>
>> >4. I continue test!
>> >   Remove sourcecode piece by piece, finally I found :
>> >
>> >If there is no head, the all output will be empty
>> >
>> >and it is caused by "</tbody></form></table>"
>> >
>> >for example:
>> >==================================
>> ><div> Test Content
>> ></tbody></form></table>
>> ></div>
>> > ==================================
>> >tidy it , the result is empty
>>
>> For me it returns
>>
>> <html>
>> <head>
>> <title></title>
>> </head>
>> <body>
>> <div>Test Content</div>
>> </body>
>> </html>
>>
>> >if I removed </form>, it will output "<div>Test Content</div>"
>>
>> I get
>>
>> <html>
>> <head>
>> <title></title>
>> </head>
>> <body>
>> <div>Test Content</div>
>> </body>
>> </html>
>>
>> Both look fine to me.
>>
>> >please have a check for Tidy SourceCode, thanks
>>
>> Unfortunately, the latest changes in the Tidy source tree do not remedy
>> the problems discussed here.
>>
>> Ralf
>>
>> _______________________________________________
>> Delphi Inspiration mailing list
>> yunqa@xxxxxxxxxxxxx
>> //www.freelists.org/list/yunqa
>>
>>
>>
>>
>

Other related posts: