[yunqa.de] parsing web pages without a charset tag

  • From: Rael Bauer <rael.bauer@xxxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Wed, 9 Jun 2010 19:20:13 +0200

Hi,

My understanding is that with the HTMLParser - the ReadMethod (of Parser)
and WriteMethod (of WriterPlugin) always needs to be setup.

If a web page does not define a charset (in the meta tag), this leads to
problems in the parsed output. In my experience some such web pages are
Iso-8859-1 while others are utf-8 (and then there may still be others but
these seem to be the most common).

1. Do you have any recommendations on how to handle this situation?

2a. What would be nice is if HTMLParser could somehow detect which one is in
use (I have no problem with HTMLParser parsing the stream twice...), or

2b. What would also be very useful, is if HTMLParser had a "No Touch" mode -
meaning it only uses the defined Read/Write methods for how to interpret
parsed tags, however with regard to the web page content itself (anything
out of html tags), it leaves this untouched - simply reading it and writing
it without any modifications. This would significantly help bridge the gap
between latin-1 and utf-8 encoded pages where I imagine there is usually
little difference (or significantly less difference) in this regard between
the 2 charsets for html tag content (i.e. it's only the page content that is
likely to be different (use parts of the charset's that are different)).

Thanks
Rael

Other related posts: