[yunqa.de] Re: DIHtmlParser and Entities

  • From: "Mike Dixon" <mike@xxxxxxxxxxx>
  • To: <yunqa@xxxxxxxxxxxxx>
  • Date: Thu, 9 Jul 2009 15:29:39 -0500

Thanks, I'll give that a try. &copy; instead of &#169; would be just fine. 

> -----Original Message-----
> From: yunqa-bounce@xxxxxxxxxxxxx 
> [mailto:yunqa-bounce@xxxxxxxxxxxxx] On Behalf Of Delphi Inspiration
> Sent: Thursday, July 09, 2009 2:23 PM
> To: yunqa@xxxxxxxxxxxxx
> Subject: [yunqa.de] Re: DIHtmlParser and Entities
> 
> At 17:44 09.07.2009, Mike Dixon wrote:
> 
> >I'm trying to use the DIHtmlParser, DIHtmlCasePlugin, and 
> >DIHtmlWriterPlugin to simply convert HTML tags, attributes, etc to 
> >lowercase - all of that works fine.
> >
> >My problem is that entities are being converted to their single 
> >character equivalents.
> 
> This is by no means a problem in HTML. According to the 
> specification, charcters and entities are equivalent in HTML. 
> DIHtmlParser takes care to decode entities to their correct 
> character representation so they will result in the same page 
> rendering when viewed with web browsers.
> 
> >In other words, they're being decoded and I don't want them to be.
> 
> Wanting or not wanting is another issue. To approach this, it 
> helps to understand how DIHtmlParser writes out text 
> characters. In TDIHtmlWriterPlugin, this is a 3-step process:
> 
> 1. Try to represent the character with the current character encoding.
> 
> 2. If 1. fails (because the character is not available),
>    try to escape it with a registered named entity.
> 
> 3. If 2. fails as well (because there is no such named entity), 
>    use a numeric entity instead. This will finally succeed because
>    all chars can be represented through numeric entities.
> 
> This algorithm was deliberately choosen because it produces 
> the shortest possible output.
> 
> >For example, &#169; is converted to C.
> 
> Corretly so (more precisely, it is the (c) copyright sign we 
> are talking about). This happens because #169 is part of the 
> HTML default character set ISO-8851-1, as well as of many 
> others. Hence step 1. above immediately succeeds, producing 
> the shortest possible output of one byte length only.
> 
> To suppress outputting the (c) copyright sign character, you 
> must pick a character encoding which does not contain #169. 
> The character set with the smallest number of characters is 
> US-ASCII. If you choose this, TDIHtmlWrterPlugin will output 
> '&copy;' instead of the (c) copyright sign character 
> according to step 2. above.
> 
> Again, you might not like '&copy;' but prefer '&#169'. In 
> this case, do not register '&copy;' as an encoding entity. 
> This causes step 2. to fail and step 3. to output the numeric 
> entity. You can remove selected named entities from the 
> encoding registry with the UnRegisterEncodingEntity procedure.
> 
> Ralf 
> 
> _______________________________________________
> Delphi Inspiration mailing list
> yunqa@xxxxxxxxxxxxx
> //www.freelists.org/list/yunqa
> 
> 
> 

_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts: