[yunqa.de] Re: DiHtmlParser problems and questions

  • From: Delphi Inspiration <delphi@xxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Thu, 08 Mar 2012 19:57:11 +0100

On 08.03.2012 00:50, Max Terentiev wrote:

> 1. My app use DiHtmlParser for bulk parsing thousands html pages... 
> Some pages contains little errors in html code and Parser can't 
> handle it !
> 
> For example:
> 
> <a href="http://some.url><form action="add.php" method=post>
> 
> In example above quotes are missing after some.url, so 
> ParseNextPiece() will NOT found next <form> tag because he thinks 
> it's still href string !

The <a href="...> tag is missing the quotation mark which terminates the
href attribute value. This is required by the HTML specification and
DIHtmlParser needs it to determine the end of the attribute value.

> Can I get the parser to look for closing tags > or /> in attrib 
> strings ? It's should help handle missing quotes !

We could do this, but then attribute values which contain such
characters (scripts, for example) will be terminated prematurely.
Accepting non-standard syntax will unfortunately introduce new
disambiguities.

> 2. Some pages make parser crazy ! See attached html file. In this 
> example Parser process first 13 strings of html code successfully but
> right after strings
> 
> <script language="JavaScript"> function proverka1()
> 
> ParseNextPiece goes to end of html file (to line 240) and skip all 
> tags between line 13 and line 240. Why?

This is caused by a bug in DIHtmlParser's JavaScript parser. Right now
it does not recognize the end of the regular expression in line 19:

  if(d.add_url.url.value.search(/^.+\.ru[/]{0,1}.*$/i)==-1)

I will provide a fix with the next version. I also sent it to you off-list.

Ralf
_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts: