[pskmail] New screen scraper

  • From: Rein Couperus <rein@xxxxxxxxxxxx>
  • To: pskmail@xxxxxxxxxxxxx
  • Date: Sat, 14 Aug 2010 13:58:39 +0200 (CEST)

I noticed that most of the links to provide content for PI4TUE and others were 
dead.
That triggered me to write a better web scraper, which is now available in 
htpp://hermes.esrac.ele.tue.nl/pskmail/utililities

The files are scraper.pl and scraper.cfg.

The new scraper works like the URL downloads in the pskmail client, you can 
define a *begin* and an *end* word to get a defined number of lines, and also 
start and end of the column in the line.
So you can cut a square text field anywhere from the web page, killing 
links,banners, ads and nav columns.

The downloads are now in a config file called scraper.cfg, which could look 
like:
danish_sea_areas,http://www.dmi.dk/eng/print/index/forecasts/forecast_for_sea_areas.htm,Forecast,http
dutch_wx,http://www.knmi.nl/waarschuwingen_en_verwachtingen/,Weer,Uitleg,3,60

Each line contains:Filename, url,beginword, endword, begin column, endcolumn.

Just start scraper.pl periodically with a cron job...

An example, look at 
http://www.knmi.nl/waarschuwingen_en_verwachtingen/index.html ,
after scraping this looks like the attached file...

73,

Rein PA0R


http://pa0r.blogspirit.com

Other related posts: