Hi Ralf, How about my question in last email? How to fix that? thanks, Bear On Tue, Aug 25, 2009 at 2:31 PM, Bear Xu <bear.xy@xxxxxxxxx> wrote: > Dear Ralf, > > See attachment to delphi@xxxxxxxx for my source code, > > I can remove script first from html file. I always use idHTTP to get html > souce, remove script , then tidy it via the unicode funciton you provide. > The core problem is if there is </form> before table etc, it will cause the > result empty. > > If you fixed this, tidy 163.com source will be no problem. > I use DITidy 2.0 with Delphi 2009 on Vista SP2. > > Thank you for your kind help. > > Bear > > > On Sat, Aug 22, 2009 at 5:06 PM, Delphi Inspiration <delphi@xxxxxxxx>wrote: > >> At 07:18 20.08.2009, Bear Xu wrote: >> >> >1. I just copy the source code of the page in IE and paste it to Delphi >> TMemo, and use it to call Tidy functions (to process Unicode html you >> provided to me last time) >> >> Copying / pasting to Delphi's TMemo can easily result in an implicit >> character conversion. Especially for Delphi 2009, we may not always want >> TMemo to convert out 8-bit text to Unicode because we must pass it on >> unchanged to further processing. Developers should be very aware of this >> when writing applications for char-set specific parsing like HTML and XML. >> >> >2. I run your sourcecode, the same result: >> > >> >all of the end tag is wrong!! ==> >> > >> ><\/table> >> ><\/center><\/div> >> ><\/div> >> > >> ></div> == became ==><\/div> >> >do not know why ? >> >> Tidy has known problems with '/script>' inside of script element contents, >> for example: >> >> >> http://sourceforge.net/tracker/?func=detail&aid=2712780&group_id=27659&atid=390963 >> >> You HTML contains a script element with the following content: >> >> document.write("<script type='text/javascript' src=' >> http://61.135.253.47/ipquery'><" + "/script>"); >> >> Unfortunately, Tidy does not recognize that '/script' is part of a quoted >> string but gets all confused about it. It fails to determine the correct end >> of the script contents and outputs subsequent HTML closing tags as script >> contents. Cleaning this up leads it to escape theh forward slashes with '\/' >> according to the HTML specification: >> >> http://www.w3.org/TR/html401/appendix/notes.html#h-B.3.2 >> >> Unfortunately, many Tidy problems related to script contents are still >> unsolved. I will update the DITidy port as soon as fixes become available. >> >> >3. when I pass unicode html source code to Tidy, will it to check the >> Meta charset setings in head section? >> >I think it should not check that, or it is processing a html file. >> >> According to the HTML specs, META charset information should change the >> parser's character decoding regardless of its initial setting. >> >> >4. I continue test! >> > Remove sourcecode piece by piece, finally I found : >> > >> >If there is no head, the all output will be empty >> > >> >and it is caused by "</tbody></form></table>" >> > >> >for example: >> >================================== >> ><div> Test Content >> ></tbody></form></table> >> ></div> >> > ================================== >> >tidy it , the result is empty >> >> For me it returns >> >> <html> >> <head> >> <title></title> >> </head> >> <body> >> <div>Test Content</div> >> </body> >> </html> >> >> >if I removed </form>, it will output "<div>Test Content</div>" >> >> I get >> >> <html> >> <head> >> <title></title> >> </head> >> <body> >> <div>Test Content</div> >> </body> >> </html> >> >> Both look fine to me. >> >> >please have a check for Tidy SourceCode, thanks >> >> Unfortunately, the latest changes in the Tidy source tree do not remedy >> the problems discussed here. >> >> Ralf >> >> _______________________________________________ >> Delphi Inspiration mailing list >> yunqa@xxxxxxxxxxxxx >> //www.freelists.org/list/yunqa >> >> >> >> >