[yunqa.de] Re: DIHTMLParser and msdn

From: Delphi Inspiration <delphi@xxxxxxxx>
To: yunqa@xxxxxxxxxxxxx
Date: Fri, 10 Oct 2008 12:12:54 +0200
Rael Bauer wrote:

>I could not find any attached file to previous message?

I sent the message two times - the 2nd posting contains the attached file. 
Other than that, it is identical to the 1st posting where I forgot the 
attachment. Sorry for that.

Important note: Attached files do not make it to the public archives. In case 
you have missed to retrieve the message via e-mail, you can temporarily 
download it from here (the file will eventually be deleted without further 
notice):

  http://www.yunqa.de/delphi/downloads/DIHtmlParser_WebDownload.zip

>>These changes solve almost all the problems you experienced. However, since 
>>the 
>>demo does not execute the many JavaScript scripts involved in rendering the 
>>above page, it does not download and save dynamically loaded content.
> 
>Ok. Do you have any information on how to achieve this? I realise this is not 
>the responsibility of DIHTMParser, however, I imagine a common use of 
>DIHTMLParser is to download a complete webpage, so perhaps you have some 
>information on this topic, or maybe there is even a plug-in for this?

Your question depends on how you define "complete webpage". With the fixed 
version of the DIHtmlParser_WebDownload demo, the web page downloads completely 
as specified by the HTML contents. On the other hand, it is yet incomplete 
because DIHtmlParser does not run JavaScrip which generates extra content by 
downloading additional information.

As another example, I am thinking of HTML pages which embed streaming media. 
They also downloaded completely but do not play the audio/video because of 
missing plugins and online access.

In order to download web pages for 100% identical offline viewing requires at 
least:

1. A CSS parser to extract linked files via CSS.

2. A JavaScript engine to process a HTML DOM structure and download additional 
information if necessary.

3. Analyzer code for installed browser plugins to extract links hidden in 
applets.

4. Possibly more?

Downloading as browsers do is far beyond what DIHtmlParser is capable of. 
Dedicated web crawler applications should do the job, but I did not find one  
which performs steps 2. and 3. as needed for your page. I tested HtTrack (which 
is supposed to parser JavaScript for URLs), but it did not do any better than 
DIHtmlParser. I expect others to fail just as well.

In short, you have to go a long way to download your page the same as browsers 
do. IMHO, DIHtmlParser already does a pretty good job and the demo can serve as 
a beginning for more elaborate code. Running JavaScript seems necessary for 
your page, but I do not know of a Delphi compatible JavaScript engine.

My best bet is to just use File -> Save As... from your favourite browser. To 
automate, crawler extensions might exist which make use of the browsers 
advanced JavaScript and DOM features. Other than that, I am afraid, you are 
pretty much stuck with what you already have.

Ralf 

_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa
References:
- [yunqa.de] Re: DIHTMLParser and msdn
  - From: Rael Bauer
[yunqa.de] Re: DIHTMLParser and msdn

Other related posts: