[biblitfonts] Normalization Tester HTA

Folks ---

I think it's important that we think about normalization in the context
of the framework and algorithms that are defined by the standard.
Basically, I'm pretty sure we can get away with changing combining
classes for individual characters (or proposing new characters with sane
combining classes), but we will not get Unicode (or numerous vendors!)
to change the combining class concept or the algorithm used to implement
mark reordering.

Find attached an HTML Application that I'm using to interactively play
around with combining classes and Unicode normalization. The attached
ZIP file "Norm.zip" contains two files: Norm.hta and chardata.xml. Unzip
it so that the HTA and XML are in the same folder, then run the HTA. If
you have IE 5.5 or greater on a Windows box, it should work. It might
work on other platforms, but I can't be sure.

I. REFERENCE 

For your reference, here is what Unicode defines as the mark re-ordering
algorithm in a nutshell (for the full story, see
http://www.unicode.org/reports/tr15/#Decomposition). Each code point can
only have one combining class. Given a pair of adjacent marks on the
same base character, marks are swapped if the combining class of the
first mark is greater than the combining class of the second. Pairs with
the same combining class are obviously not swapped, thus preserving
their order. The pair swapping is done until there are no more possible
pairs to swap. 

Thus, when characters are in the same class, that means that they cannot
be rearranged by normalization. So, if you want to preserve the input
ordering between mark A and mark B, they should be in the same class. If
you want to make sure a particular order is enforced, you put them in
different classes, and if you want A before B, the combining class
number for A should be lower than B. Lower combining class = closer to
the base glyph.

II. INTERFACE

The HTA doesn't have a very advanced interface (I just threw it
together, afterall), but here are the basics:

(1) You can type the font and font-size at the upper-left corner of the
HTA.

(2) You can paste Hebrew into the text box at the upper-right corner of
the HTA. This Hebrew should be encoded in UTF-8 and should be decomposed
(most Unicode Hebrew already is). You can copy and paste Hebrew strings
from the OT text files that were posted to the list a while back.

(3) On the left is a table of all the Hebrew marks, with their name,
codepoint, the existing combining class as defined in the Unicode
standard, and (most interestingly) an input column where you can play
Unicode consortium and arbitrarily change the combining class for that
character. Note that I have given the existing Unicode combining class
so you can know how far away from the standard you are headed.

(4) When you've got the classes the way you think they should be, hit
the "Normalize" button, and the Hebrew text in the upper-right text box
is normalized using the combining class data you have defined on the
left. The normalized text then appears at the bottom-right corner of the
HTA, and the "before/after" table on the right middle is populated with
the names of all the characters. The before column is the order of
characters before normalization, and the after column shows the
characters in order after normalization. When a row in the before/after
table is light yellow, that means that the order of marks was changed
from input to output.

If you put in something other than Hebrew, you'll probably get a script
error. If you do something else unexpected, I'm sure it'll barf. :-)

III. THE COMBINING SCHEME GIVEN

The enclosed chardata.xml represents the following scheme:

(1) All cantillation and vowels above the letter are in class 230. 230
is the combining class that all above-the-letter marks are given in
Unicode.

(2) All cantillation and vowels below the letter are in class 220.
Unfortunately, this means that below-the-letter marks will always appear
before those above the letter in normalized text, which I find somewhat
illogical, but those numbers were chosen early on by Unicode, and are
utilized throughout the rest of the standard.

(3) Dagesh and Rafe are left with their current combining classes. In
the few instances where they appear together, their order is not
important. Thus, Dagesh will always come before Rafe in normalized text.

(4) Meteg is in class 220 with all the other below-the-letter marks.
This will allow the greatest flexibility in putting in funky ordering.

(5) I changed Shin dot and Sin dot to low class numbers so they will end
up closer to the consonant.

I think this pretty well accomodates all of the changes necessary to
accomplish all of the extant renderings for marks in the Hebrew OT Bible
(including right, left, and center meteg) with the fewest changes to the
existing standard. (Paul, I do understand that the combining classes for
existing points can't actually be changed, but you will perhaps be
proposing some additional "Biblical Hebrew" versions of whichever points
need to be "changed"; when that time comes, we can exchange the existing
codepoints in our data one-to-one with the new codepoints and we will
have anticipated the new behavior of those points. I hope.)

This scheme does only minimal reordering of marks. But you'll see that
when you run the HTA.

You are hereby invited to come up with problem cases and/or tweak the
scheme to whatever *you* think is the best solution.

I will be posting a non-normalized Michegan-Claremont (Oxford) Hebrew OT
text encoded in UTF-8 shortly ... I don't think I'll get to it today;
perhaps tomorrow.

Humbly submitted,
Eli

-- 
Eli Evans, Text Preparation Manager
Libronix Corporation
http://www.libronix.com - eli@xxxxxxxxxxxx
Ec. 12:12
 

Other related posts: