[JA] Saving Web Pages to Hard Drive

  • From: thepccat@xxxxxxxx
  • To: internet-l@xxxxxxxxxxxxx, juno_accmail@xxxxxxxxxxxxx
  • Date: Mon, 15 Jul 2002 08:54:32 -0700

I am still learning [aren't we all?]. This time the hard way, that saving
a webpage as a .mht "Web archive, single file" is a bad idea since it can
only be opened by the same major version of IE which created it [how
rotten is that?]. I'd like to get smart on saving web pages to my hard
drive compactly, simply, and without losing information.

OK, we see stuff on the Internet, and want to save it to our HD, because
we want it right here for use, and because they way things are going,
much is "here today, gone tomorrow." The goal of saving, naturally, is to
preserve the information -- but doing that is not easy nor obvious [at
least to me]. 

[1] I think the most compact means is to copy the portion of the web
page, and paste to a text editor, then copy and paste the URL so it is
also saved. You get all the text and the source URL. However, here you
lose the formatting, including [often] spaces between paragraphs [which
you can manually add back]. If you have a text editor like NoteTab Lite,
you can press F5 and insert a time/date stamp, in case you make future
revisions to the document. Relative html links stand out because they
have no "http://www.someplace/com"; stuff, but start right out with
"/folder/subfolder/item.html" so you know the true URL is
"http://www.someplace/com/folder/subfolder/item.html.";

[2] Next most compact might be to do the same as above, except paste
either to WordPad or Word [in my case Word 97, your mileage may vary
among versions]. You save some of the formatting but lose anything which
is a graphic.
* WordPad preserves the data and some of the formatting, rendering the
links with the URL included [however they are not "live" in that clicking
on them launches the URL -- you have to copy and paste into your
browser]. Saving WordPad as .doc does not retain centering, whereas
saving as .rtf does -- which you choose is up to you, for me centering on
the page is a hindrance. 
* Pasting into Word and saving as either .doc or .rtf saves the most
formatting, including making the links "live" -- you cannot see  the URL,
but if you click on them your browser tries to get them. To "see" them
you right click on the link, choose Hyperlink>Edit Hyperlink. I prefer
WordPad saved as .doc for my purposes.

[3] Using IE and choosing Save As "Web Page, HTML only" will save the
formatting and miss any graphics [sometimes big text is provided as
graphics]. However, you cannot see the URL, and I don't know a simple way
to add it to the source, does anyone else?  I believe if the HTML page
uses a relative reference to a URL, then the URL is corrupted because it
becomes relative to your folder on your HD rather than relative to the
URL of the web site.

[4] Using IE and choosing Save As "Web Archive, Single File" .mht will
sometimes fail and sometimes succeed, and there may be problems with
relative URL references per above, but you do get the graphics in one
file. 
* HOWEVER,  try upgrading from IE 5 to IE 6, and you will find that the
HTML is not rendered and you essentially are viewing the source -- this
means that all that good information you tried to save is like a 
needle:
-----------------
Jackass JoeJoe's Eventlog Dump [http://jackass.arsware.org/eld.shtml]
Eventlog Dump is a simple batch utily that reads a remote NT/2K eventlog
and dumps it into a comma delimited, tab delimited, or XML file. Read
more...[http://www.megapathdsl.net/~yandl/]
 -----------------
 hidden in:
~~~~~~~~~~~~~
<BR>
<BR>
<CENTER><HR COLOR="#efefef" WIDTH="100%" SIZE="1" NOSHADE></CENTER>
<BR>

<FONT COLOR="#000000" FACE="verdana, geneva, arial, sans-serif" SIZE="2">
<CENTER><A HREF="http://jackass.arsware.org/eld.shtml";
TARGET="_new"><B>Jackass JoeJoe's Eventlog Dump</B></A></CENTER>
</FONT>
Eventlog Dump is a simple batch utily that reads a remote NT/2K eventlog
and dumps it into a comma delimited, tab delimited, or XML file. <A
HREF="http://jackass.arsware.org/eld.shtml"; TARGET="_new"><B>Read
more...</B></A>

<BR>
<BR>
<CENTER><HR COLOR="#efefef" WIDTH="100%" SIZE="1" NOSHADE></CENTER>
<BR>
~~~~~~~~~~~~
* Of course, one *could* display this code in NoteTab Lite and choose
Modify>Strip HTML Tags> [either remove all tags, or preserve URL's] --
then the graphics are still blocks of nonsense text, and the text is
somewhat cluttered, but it is better than what you had before. What 
p****s me off, is I did not know that upgrading IE would render my neat
"single file web archives" back into gibberish! 
<rant, rave... ahh!> 

[5] A final choice for Save As in IE is "Web Page Complete" .htm, .html.
This saves, in the same directory, the html page and a folder with the
same name containing all the graphics. This is pretty nice [and large],
except you still don't have the originating URL, and the folder could get
lost later if you are not aware it needs to be kept with the html file. I
wonder if this choice would also have the problem of corrupting relative
html links in the page. I suppose one could save a text file with the
same name, containing the URL. Is there any simple means to insert the
URL dependably into the html file?

[6] One could choose to save web pages offline through IE, but I have not
tried this. I wonder if one could move these from folder to folder [for
example to archive the information] and preserve the relative links?

[TO CONCLUDE] A prime goal of using the Internet is to find and save
information. Some used to feel that once the page was there, it would
*always* be there. This assumption has not been valid for some time, and
us packrats have been storing up bits and pieces right and left. Do any
have suggestions on better ways to save Webpage information [to capture
all the information, and have it immune to destruction when one upgrades
ones &$^$#% Browser]? I'm all eyes!

thepccat

________________________________________________________________
GET INTERNET ACCESS FROM JUNO!
Juno offers FREE or PREMIUM Internet access for less!
Join Juno today!  For your FREE software, visit:
http://dl.www.juno.com/get/web/.


To unsubscribe, send a message to ecartis@xxxxxxxxxxxxx with
"unsubscribe juno_accmail" in the body or subject.
OR visit //freelists.org
~*~



Other related posts: