[yunqa.de] Re: DITidy to process www.163.com

  • From: Delphi Inspiration <delphi@xxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Mon, 17 Aug 2009 13:30:10 +0200

At 12:34 15.08.2009, Bear Xu wrote:

>I have sent the zip file to delphi@xxxxxxxxx

Thank you, I have received the file.

The HTML file is encoded in GB2312. This is specified in <meta 
http-equiv="Content-Type" content="text/html; charset=gb2312" />.

The code snippet you posted does not reveal how you load the file. It only 
tells that you are passing it as UnicodeString. In doing so, you must have 
converted it from GB2312 to UTF16-LE. Unfortunately, your code snippet does not 
reveal how you do so. But however you do, the DB2312 charset specification will 
no longer match your HTML, which is now in UTF16-LE. This will most likely lead 
to severe problems and errors.

I suggest you use the tidyParseFile() function instead of tidyParseBuffer() to 
avoid string conversion problems. Please see the attached project for an 
example. It seems to work fine after a quick inspection, but the document is 
too lengthy for me to come up with detailed analysis.

Ralf 
{ DITidy "Hello World" example project. Writes the string 'Hello World' in HTML.

  Visit the DITidy homepage for latest information and updates:

    http://www.yunqa.de/delphi/

  Copyright (c) 2007-2009 Ralf Junker, The Delphi Inspiration <delphi@xxxxxxxx>

------------------------------------------------------------------------------ }

program DITidy_Test_163;

{$APPTYPE CONSOLE}
{$I DI.inc}

uses
  FastMM4, Classes, DITidy;

var
  ResultCode: Integer;
  TidyHandle: TidyDoc;
  Target: TidyBuffer;
  ErrorBuffer: TidyBuffer;
begin

  TidyHandle := tidyCreate;

  { Remove generator meta tag (has no effect with DITidy demo). }
  if tidyOptSetBool(TidyHandle, TidyMark, 0) = 0 then
    begin
      WriteLn('Error setting option: "TidyMark"');
      Halt(1);
    end;

  { Some pretty-printing options. Error checking is omitted for brevity. }
  // tidyOptSetBool(TidyHandle, TidyXhtmlOut, 1); // Force XHTML.
  // tidyOptSetBool(TidyHandle, TidyBodyOnly, 1); // Output <BODY> content only.
  // tidyOptSetBool(TidyHandle, TidyHideEndTags, 1); // Hide </END> tags.

  ResultCode := tidySetErrorBuffer(TidyHandle, @ErrorBuffer);

  if ResultCode >= 0 then
    ResultCode := tidyParseFile(TidyHandle, '163_com_HomePage.htm');
  if ResultCode >= 0 then
    ResultCode := tidyCleanAndRepair(TidyHandle);
  if ResultCode >= 0 then
    ResultCode := tidyRunDiagnostics(TidyHandle);
  if ResultCode >= 1 then
    begin
      if tidyOptSetBool(TidyHandle, TidyForceOutput, 1) = 0 then
        begin
          WriteLn('Error setting option: "TidyForceOutput"');
          ExitCode := 1;
        end;
    end;

  if ResultCode >= 0 then
    ResultCode := tidySaveBuffer(TidyHandle, @Target);
  if ResultCode >= 0 then
    begin
      if ResultCode > 0 then // Warnings?
        begin
          WriteLn;
          WriteLn('Diagnostics:');
          WriteLn(PAnsiChar(ErrorBuffer.bp));
        end;
      WriteLn;
      WriteLn('Result HTML Ouput:');
      WriteLn;
      WriteLn(PAnsiChar(Target.bp));
    end
  else
    begin
      WriteLn('A severe error (', ResultCode, ') occurred: ');
    end;

  if Assigned(Target.bp) then
    tidyBufFree(@Target);
  tidyBufFree(@ErrorBuffer);
  tidyRelease(TidyHandle);

  WriteLn('Done - Press ENTER to exit');
  ReadLn;
end.

Other related posts: