[yunqa.de] Re: DIHtmlParser and Entities

  • From: "Mike Dixon" <mike@xxxxxxxxxxx>
  • To: <yunqa@xxxxxxxxxxxxx>
  • Date: Thu, 9 Jul 2009 15:52:08 -0500

I added a CharSet Plugin and assigned it to my parser.

I Added the following line:

DIHtmlCharSetPlugin.RegisterCharSet_US_ASCII;

And I still have the problem.

The use of this parser might be very familiar to you, but for someone who is
just learning it, it's pretty confusing. For example, there are is a
DIHtmlParser.ClearDecodingEntities method, but no CharSet Clear method.

I guess I'll have to put together a sample program. 

> -----Original Message-----
> From: yunqa-bounce@xxxxxxxxxxxxx 
> [mailto:yunqa-bounce@xxxxxxxxxxxxx] On Behalf Of Mike Dixon
> Sent: Thursday, July 09, 2009 3:30 PM
> To: yunqa@xxxxxxxxxxxxx
> Subject: [yunqa.de] Re: DIHtmlParser and Entities
> 
> Thanks, I'll give that a try. &copy; instead of &#169; would 
> be just fine. 
> 
> > -----Original Message-----
> > From: yunqa-bounce@xxxxxxxxxxxxx
> > [mailto:yunqa-bounce@xxxxxxxxxxxxx] On Behalf Of Delphi Inspiration
> > Sent: Thursday, July 09, 2009 2:23 PM
> > To: yunqa@xxxxxxxxxxxxx
> > Subject: [yunqa.de] Re: DIHtmlParser and Entities
> > 
> > At 17:44 09.07.2009, Mike Dixon wrote:
> > 
> > >I'm trying to use the DIHtmlParser, DIHtmlCasePlugin, and 
> > >DIHtmlWriterPlugin to simply convert HTML tags, attributes, etc to 
> > >lowercase - all of that works fine.
> > >
> > >My problem is that entities are being converted to their single 
> > >character equivalents.
> > 
> > This is by no means a problem in HTML. According to the 
> specification, 
> > charcters and entities are equivalent in HTML.
> > DIHtmlParser takes care to decode entities to their correct 
> character 
> > representation so they will result in the same page rendering when 
> > viewed with web browsers.
> > 
> > >In other words, they're being decoded and I don't want them to be.
> > 
> > Wanting or not wanting is another issue. To approach this, 
> it helps to 
> > understand how DIHtmlParser writes out text characters. In 
> > TDIHtmlWriterPlugin, this is a 3-step process:
> > 
> > 1. Try to represent the character with the current 
> character encoding.
> > 
> > 2. If 1. fails (because the character is not available),
> >    try to escape it with a registered named entity.
> > 
> > 3. If 2. fails as well (because there is no such named entity), 
> >    use a numeric entity instead. This will finally succeed because
> >    all chars can be represented through numeric entities.
> > 
> > This algorithm was deliberately choosen because it produces the 
> > shortest possible output.
> > 
> > >For example, &#169; is converted to C.
> > 
> > Corretly so (more precisely, it is the (c) copyright sign we are 
> > talking about). This happens because #169 is part of the 
> HTML default 
> > character set ISO-8851-1, as well as of many others. Hence step 1. 
> > above immediately succeeds, producing the shortest possible 
> output of 
> > one byte length only.
> > 
> > To suppress outputting the (c) copyright sign character, 
> you must pick 
> > a character encoding which does not contain #169.
> > The character set with the smallest number of characters is 
> US-ASCII. 
> > If you choose this, TDIHtmlWrterPlugin will output '&copy;' 
> instead of 
> > the (c) copyright sign character according to step 2. above.
> > 
> > Again, you might not like '&copy;' but prefer '&#169'. In 
> this case, 
> > do not register '&copy;' as an encoding entity.
> > This causes step 2. to fail and step 3. to output the 
> numeric entity. 
> > You can remove selected named entities from the encoding 
> registry with 
> > the UnRegisterEncodingEntity procedure.
> > 
> > Ralf
> > 
> > _______________________________________________
> > Delphi Inspiration mailing list
> > yunqa@xxxxxxxxxxxxx
> > //www.freelists.org/list/yunqa
> > 
> > 
> > 
> 
> _______________________________________________
> Delphi Inspiration mailing list
> yunqa@xxxxxxxxxxxxx
> //www.freelists.org/list/yunqa
> 
> 
> 

_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts: