[yunqa.de] Re: DiHtmlParser: How to change Html file charset without modifying html tags

  • From: "Max Terentiev" <maxteren@xxxxxxxxx>
  • To: <yunqa@xxxxxxxxxxxxx>
  • Date: Tue, 3 Jun 2014 19:48:21 +0400

Hi,

Yes, DIUnicode_CharSet_Converter is what i looking for.

Thank you very much !

---
With best regards, Max Terentiev.
Business Software Products.
AMS Development Team.
support@xxxxxxxxxx

-----Original Message-----
From: yunqa-bounce@xxxxxxxxxxxxx [mailto:yunqa-bounce@xxxxxxxxxxxxx] On
Behalf Of Delphi Inspiration
Sent: Tuesday, June 03, 2014 2:46 PM
To: yunqa@xxxxxxxxxxxxx
Subject: [yunqa.de] Re: DiHtmlParser: How to change Html file charset
without modifying html tags

On 03.06.2014 11:37, Max Terentiev wrote:

> DiHtmlParser change appearance of some html tags !
>
> For example, whey change < /BR> to <br> and so on. For my task is very 
> important to keep original tags appearance as is.

DIHtmlParser should not change the HTML source semantically. By semantically
I mean that browsers should render the output HTML identical to the source
HTML.

Keeping this in mind, DIHtmlParser might change HTML slightly. Most often
this happens for tags: DIHtmlParser strips space characters between tag
attributes and might change the attribute quotation character if it finds a
more compact representation ("abc'def" instead of 'abc&apos;def').

> It's possible to tune DiHtmlParser/Writer/ChasrsetPlugin to not change 
> tags apparance, not remove any extra spaces in my html code and so on ?
> I need just change charset and leave all tags appearance and html 
> structure as is without any changes.

Most important is to set:

   TDIHtmlParser.NormalizeWhiteSpace := False;

You can also experiment with:

   property TDIHtmlWriterPlugin.PredefinedEntities

Entity replacement is controlled by the entities registered. See

   function RegisterHtmlEncodingEntities;

You can use the DIHtmlParser_WriterPlugin demo project for testing these
properties.

Always keep in mind: Even with these tweaks, DIHtmlParser does not guarantee
that output code points are 100% identical to input. However, the documents
should be semantically equivalent in terms of the HTML specification.

Side note:

To simply change the character encoding of any text document without paying
attention to the HTML specifics, you can use DIUnicode. There is a demo
called DIUnicode_CharSet_Converter which does just that. However, this does
not look out for character set markers within the document and does not
change them either.

But maybe this is just what you are looking for?

Ralf
_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts: