[yunqa.de] Re: Capitalize function in DIUnicode

  • From: Rolf Lampa <rolf.lampa@xxxxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Thu, 03 Jul 2008 11:23:41 +0200



Delphi Inspiration skrev:
Hello Rolf Lampa,

I looking into DIUnicode, which seems interesting for my heavy processing of Mediawiki content. 

Did you also have a look at WikiTaxi? It includes a MediaWiki preprocessor, parser, and HTML generator. It also does UTF-8 case conversions, but this is used little only in Wikipedia sources, IIRC.
Yes, a Mediawiki preprocessor would be of some interest also for me, although my application is mainly aimed at performing text manipulations directly on the wiki text (I also produce wiki text, for various tools, I also validate wiki text, and log gross syntactical errors, and some of them I try to fix automagically).

The preprocessor, exactly what is the preprocessor doing? Does it handle ParserFunctions and the alike (btw, I saw your note on TeX etc), but Magicwords and ParserFunctions is important for me (I have implemented some of them). And further, is the MW preprocessor available separately, as a Delphi component?

<...>
In the following link (discussion at the Borland  \win32 NG) people gave me some hints & tips, but I'd like to have a very reliable and fast method alls in the same library.
http://preview.tinyurl.com/6clwcf

The solution suggested there is suboptimal because it asks you to convert entire strings before capitalization. You'd better do this on a character by character basis, where character here stands for "Unicode Code Point" (UCP).

My DIConverters library contains functions to read and write individual UCPs from and to multi-byte character sets and encodings, including UTF-8. You can combine this with DIUtils.CharToTitleW() which implements most of CaseFolding.txt (actually limited to WideChar, but sufficient for most applications).
Since my tool probably will be used also by people who sets up MW mirrors of WP dumps in various languages, there's a risk that I lose bits on capitalization if I only use WideChar. (although I have confirmed that the titles in enwiki is NOT losing bits using only WordChar).

Since I need CodeFolding-capitalization only for the purpose of matching titles (when picking up links from text, like "#REDIRECT [[Nms:Title]]", "{{template calls}}", and other [[Links]], it shouldn't be too costly to convert those short strings  to UCS-4, then capitalize and convert  back to Utf8 again. I think.

== Generating sql ==

Oh, but now I recall that
there's more to it than only the titles. I know for sure that Delphi's Utf8Decode/Encode loses bits on the texts in the WP dump. These methods namely often return empty strings when I use their methods on the wiki text to produce the sql tables from the xml dump. In any case, for the capitalization I need to convert only relatively short strings.

== Title repair ==

I also have a feature in my "MwProcessor" that lets people repair data dumps made with phpMyAdmin which trashes the titles by converting utf8 titles from VARCHAR fields (The blob text fields are not trashed though).

When you sketch the converter function, make sure to preallocate the result string to reduce time consuming memory reallocations for best performance.
Utf8 to UCS4 would always give predictable length so that should be simple, but do I need to preallocate internally in the readers/writers also, and if so, how? (see question marks ??? in the "prototype-code" below)

Regards,

// Rolf Lampa

(I only started coding this routine, so please don't laugh, I've never used your code libraries before and I guess I have to try this and that before I get a grasp of character encodings...)

procedure Utf8Capitalize_Test(var sUtf8: String);
var
  sUcs4: String; // Yes, string

    function ConvertUtf8ToUCS4(const aUtf8: Utf8String): String;
    begin
      Utf8Reader.SourceBufferAsStrA := aUtf8;

      // prealloc :       
      SetLength(???, Length(aUtf8)*4);

      while Utf8Reader.ReadChar do
        UCS4Writer.WriteCharW(Utf8Reader.Char);
      Result := Utf8Reader.DataToStrA; // sUcs4 :=
    end;

    function ConvertUCS4ToUtf8(const aUcs4: String): String;
    begin
      // TODO: Use DIUnicode lib here again
      // ...
      Result := aUcs4;
    end;

    function GetFoldCodeForUCS4Char(aCharCode: LongInt; out FoldCode: LongInt): Boolean;
    var
      tmp: LongInt;
    begin
      // ... lookup code in CodeFolding.txt table
      FoldCode := {GetFoldCode}(aCharCode);
      Result := FoldCode <> aCharCode;
    end;

var
  FoldCode, FirstCharCode: LongInt;
  B: Byte;
begin
  sUcs4 := ConvertUtf8ToUCS4(sUtf8); // Using DIUnicode

  // Byte-wise extract the encodeing of the first UCS-4 character
  FirstCharCode := (Ord(sUcs4[1]) shl 24);
  FirstCharCode := (Ord(sUcs4[2]) shl 16) or FirstCharCode;
  FirstCharCode := (Ord(sUcs4[3]) shl 8 ) or FirstCharCode;
  FirstCharCode :=  Ord(sUcs4[4])         or FirstCharCode;

  // Lookup new code in CodeFolding table, and modify only if code is different
  // (method GetFoldCode... not implemented yet)
  
  if GetFoldCodeForUCS4Char(FirstCharCode, FoldCode) then
  begin
    // Byte-wise mask the first UCS-4 character to uppercase
    B := (FoldCode and $F000) shr 24;
    sUcs4[1] := Char( B );
    B := (FoldCode and $0F00) shr 16;
    sUcs4[2] := Char( B );
    B := (FoldCode and $00F0) shr  8;
    sUcs4[3] := Char( B );
    sUcs4[4] := Char( FoldCode and $000F );
    sUtf8 := ConvertUCS4ToUtf8(sUcs4); // Using DIUnicode
  end {
  else
    sUtf8 := not modified; }
end;
_______________________________________________ Delphi Inspiration mailing list yunqa@xxxxxxxxxxxxx //www.freelists.org/list/yunqa

Other related posts: