[Linux-Anyway] Re: On monitoring Web pages....

  • From: Horror Vacui <horrorvacui@xxxxxxx>
  • To: Linux-Anyway@xxxxxxxxxxxxx
  • Date: Sat, 17 Apr 2004 15:47:29 +0200

On Sat, 17 Apr 2004 02:57:35 +0200
Horror wrote:

> 
> By the way: while browsing through "The Perl Cookbook" in the
> bookstore today, I saw a script doing something similar (download HTML
> source, extract string and print) - it'd be easy to write one that
> stores the source and compares it periodically (cron). Would that do
> the trick, or am I missing something?
>  

Ok, I wrote one semi-functional script to do this (attached at the end of
 message). If you're interested, you can try it from the commandline, or 
have it run by cron. It's utterly simple but may work if the pages you're 
tracking are simple too.

It's my third (or thereabouts) perl script ever so if anyone more
proficient at perl scripting (sideglance at Godwin) starts laughing at
the sight of it, well, have a good laugh ;)

Cheers

-- 
Horror Vacui

Registered Linux user #257714

Go get yourself... counted: http://counter.li.org/
- and keep following the GNU.


--------the script---------------------------
#!/usr/bin/perl

use strict;
use warnings;

########################################################################
# Variables to customise the script:
# Directory to store the cache files, make sure it exists and is writable:
my $cachedir = "/var/trackweb";
# Mail address to notify:
my $mailto = 'meph@xxxxxxxxxxx';
########################################################################

my @pagesource;
my @oldpagesource;

unless ( @ARGV )
{
        print <<EOF
Usage:
$0 <URL list>
- where <URL list> is a whitespace-separated list of one or more URL's
you want to track.
EOF
}

# Do some moves for each URL given as an argument to the script:
foreach ( @ARGV )
{

# First, fetch the source with w3m and store it in @pagesource array
        open ( WEBSITE , "-|", "w3m -dump_source $_") or die;
        @pagesource = <WEBSITE>;

# Check if there's a file containing cached source of the page, store it
# in the @oldsource array
        if ( -f "$cachedir/$_.cache" )
        {
                open ( CACHEFILE_IN , "<$cachedir/$_.cache")    or die "can't 
read file $cachedir/$_.cache:\n".$!;
        @oldpagesource=<CACHEFILE_IN>;
        }
        
# This is a bad thing: we're only checking if the page source is
# identical to the cached source - if the page sports dynamic content,
# we're screwed. This can be fixed (to check only certain parts of it,
# say, disregarding the rest) if you know what you're looking for.
        unless ( @oldpagesource eq @pagesource )
        {
                if ( -f "$cachedir/$_.cache" )
                {

# Here comes the notification. The problem is, until now I never needed
# to send e-mail non-interactively from the command line, so this
# probably won't work. I also don't have sendmail installed to work it 
# out (it conflicts with qmail on my gentoo), but this is approximately
# how I imagine it should work. Correct and uncomment the following 
# block if you know better; for now it just prints out a message.
#
#                       system ( "sendmail $mailto<<EOF
#Subject:Website $_ changed
#The website $_ has changed
#.
#EOF
#");
                        print "The website $_ has changed\n";
                }
                open ( CACHEFILE_OUT , ">$cachedir/$_.cache") or die "can't 
write file $cachedir/$_.cache:\n".$!;
                print CACHEFILE_OUT @pagesource;
        }
}
To unsubcribe send e-mail with the word unsubscribe in the body to:   
Linux-Anyway-Request@xxxxxxxxxxxxx?body=unsubscribe

Other related posts: