[yunqa.de] Re: Is there some convert limits or errors in dihtmlparser demo version?

  • From: coolspace <coolspace04@xxxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Tue, 11 Sep 2012 10:52:17 -0400

Now the demo works if I turn of auto detection and select GBK charset for
src file. Thanks!
As for firefox, usually, a browser has severial ways to get the encoding of
html pages.
Based on w3c documents, the first is the content-type property in http
response header, this only works for online html file
the second is content-type property in meta data.
and the last one is auto detection based on universialchardet.
So maybe firefox find the meta property is wrong, then use the autodection
for src file charset.

But I tested with internet explorer, it use GB-2312 for this test file, but
gives the right results. I don't know the reason.

Anyway, thanks for your work.

On Tue, Sep 11, 2012 at 4:49 AM, Delphi Inspiration <delphi@xxxxxxxx> wrote:

> On 11.09.2012 02:43, coolspace wrote:
>
> > as GBK includes GB2312, that means all chars in gb2312 are contained by
> > gbk, so in order to deal with this kinds of wrong meta charset
> > propertys, can we use GBK decoder for both GBK and GB2312 charset?
>
> I am bit reluctant to change this because it would be against the
> standards. I will dig a into how Firefox handles this when I find some
> time.
>
> In the meantime, you can change DIHtmlCharSetPlugin.pas to register GBK
> for GB2312:
>
> procedure RegisterCharSet_GB_2312;
> begin
>   RegisterCharSet([
>     'EUC-CN',
>       'EUCCN',
>       'GB2312',
>       'CN-GB',
>       'csGB2312'
>       ], Read_ces_gbk); // Non-standard!
> end;
>
> Or call RegisterCharSet with Read_ces_gbk and it will override the
> previously registered character decoding(s).
>
> > BTW, with the charsetconverter demo from htmlparser, even if I uncheck
> > auto charset and select GBK for the src file. The output result is still
> > wrong.
>
> The conversion was correct, but there was a bug in the CharSetConverter
> demo: It did not update the <meta http-equiv=Content-Type
> content="text/html;charset=UTF-8"/> if auto detection was turned off.
>
> I have updated the demo and also fixed a bug in TDIHtmlCharSetPlugin
> which did reset the currenct character decoding to the default when it
> encountered a non-content-type meta tag (attached).
>
> I hope you find everything working well now.
>
> Ralf
>

Other related posts: