Thanks, I'll give that a try. © instead of © would be just fine. > -----Original Message----- > From: yunqa-bounce@xxxxxxxxxxxxx > [mailto:yunqa-bounce@xxxxxxxxxxxxx] On Behalf Of Delphi Inspiration > Sent: Thursday, July 09, 2009 2:23 PM > To: yunqa@xxxxxxxxxxxxx > Subject: [yunqa.de] Re: DIHtmlParser and Entities > > At 17:44 09.07.2009, Mike Dixon wrote: > > >I'm trying to use the DIHtmlParser, DIHtmlCasePlugin, and > >DIHtmlWriterPlugin to simply convert HTML tags, attributes, etc to > >lowercase - all of that works fine. > > > >My problem is that entities are being converted to their single > >character equivalents. > > This is by no means a problem in HTML. According to the > specification, charcters and entities are equivalent in HTML. > DIHtmlParser takes care to decode entities to their correct > character representation so they will result in the same page > rendering when viewed with web browsers. > > >In other words, they're being decoded and I don't want them to be. > > Wanting or not wanting is another issue. To approach this, it > helps to understand how DIHtmlParser writes out text > characters. In TDIHtmlWriterPlugin, this is a 3-step process: > > 1. Try to represent the character with the current character encoding. > > 2. If 1. fails (because the character is not available), > try to escape it with a registered named entity. > > 3. If 2. fails as well (because there is no such named entity), > use a numeric entity instead. This will finally succeed because > all chars can be represented through numeric entities. > > This algorithm was deliberately choosen because it produces > the shortest possible output. > > >For example, © is converted to C. > > Corretly so (more precisely, it is the (c) copyright sign we > are talking about). This happens because #169 is part of the > HTML default character set ISO-8851-1, as well as of many > others. Hence step 1. above immediately succeeds, producing > the shortest possible output of one byte length only. > > To suppress outputting the (c) copyright sign character, you > must pick a character encoding which does not contain #169. > The character set with the smallest number of characters is > US-ASCII. If you choose this, TDIHtmlWrterPlugin will output > '©' instead of the (c) copyright sign character > according to step 2. above. > > Again, you might not like '©' but prefer '©'. In > this case, do not register '©' as an encoding entity. > This causes step 2. to fail and step 3. to output the numeric > entity. You can remove selected named entities from the > encoding registry with the UnRegisterEncodingEntity procedure. > > Ralf > > _______________________________________________ > Delphi Inspiration mailing list > yunqa@xxxxxxxxxxxxx > //www.freelists.org/list/yunqa > > > _______________________________________________ Delphi Inspiration mailing list yunqa@xxxxxxxxxxxxx //www.freelists.org/list/yunqa