[yunqa.de] Re: DIHtmlParser and Entities

  • From: Delphi Inspiration <delphi@xxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Thu, 09 Jul 2009 21:23:25 +0200

At 17:44 09.07.2009, Mike Dixon wrote:

>I'm trying to use the DIHtmlParser, DIHtmlCasePlugin, and DIHtmlWriterPlugin
>to simply convert HTML tags, attributes, etc to lowercase - all of that
>works fine.
>
>My problem is that entities are being converted to their single character
>equivalents.

This is by no means a problem in HTML. According to the specification, 
charcters and entities are equivalent in HTML. DIHtmlParser takes care to 
decode entities to their correct character representation so they will result 
in the same page rendering when viewed with web browsers.

>In other words, they're being decoded and I don't want them to be.

Wanting or not wanting is another issue. To approach this, it helps to 
understand how DIHtmlParser writes out text characters. In TDIHtmlWriterPlugin, 
this is a 3-step process:

1. Try to represent the character with the current character encoding.

2. If 1. fails (because the character is not available),
   try to escape it with a registered named entity.

3. If 2. fails as well (because there is no such named entity), 
   use a numeric entity instead. This will finally succeed because
   all chars can be represented through numeric entities.

This algorithm was deliberately choosen because it produces the shortest 
possible output.

>For example, &#169; is converted to C.

Corretly so (more precisely, it is the (c) copyright sign we are talking 
about). This happens because #169 is part of the HTML default character set 
ISO-8851-1, as well as of many others. Hence step 1. above immediately 
succeeds, producing the shortest possible output of one byte length only.

To suppress outputting the (c) copyright sign character, you must pick a 
character encoding which does not contain #169. The character set with the 
smallest number of characters is US-ASCII. If you choose this, 
TDIHtmlWrterPlugin will output '&copy;' instead of the (c) copyright sign 
character according to step 2. above.

Again, you might not like '&copy;' but prefer '&#169'. In this case, do not 
register '&copy;' as an encoding entity. This causes step 2. to fail and step 
3. to output the numeric entity. You can remove selected named entities from 
the encoding registry with the UnRegisterEncodingEntity procedure.

Ralf 

_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts: