[openbeostranslationkit] Text Translation

This is just some stuff I've been tinkering with for the future translation kit. I really could use some input...
I should have posted this some time ago, but for reasons unknown I forgot :/


----
Current translation of images (or many other data types) convert from a specific format, to a generic format and from that
generic format to a specific format. Thus translation from PNG to JPEG involves two steps:


1 - convert PNG to generic format
2 - convert generic format to JPEG

The reason for this two step process is that if a translator should be able to convert from PNG to JPG the writer
would have to know both the PNG type (to read it) and the JPEG (to write it). By using an intermediary format the translation process is greatly simplified.


The problem with Text Translation
By using a temporary format it is possible to simplify the whole translation process. Defining a temporary
format for images, and even sound isn't a big problem* since data can be described rather easily. Thus images are described as this:


struct TranslatorBitmap {
 int32 magic;
 BRect bounds;
 uint32 rowBytes;
 color_space colors;
 uint32 dataSize;
}

and sound:

struct TranslatorSound {
 int32 magic;
 uint32 channels;
 float sampleFreq;
 uint32 numFrames;
}

The above data structures define image and sound data - not any meta data. What this means in terms of text translation
is that we need to define a data format for text too. Currently the 'B_TRANSLATOR_TEXT' format is just defined as
plain old ASCII text. This fits nicely with sound and image data. However, unlike sound and images, Text looses a great deal of information by loosing it's metadata layer. By removing the metadata layer only text will be left, thus all formatting will be lost, images or other embedded data will be lost.


We therefore need to establish a generic format that is understandable by all translators. The current ASCII solution just isn't usefull.

possible solutions:
 - Binary format
 - OpenOffice.org document format
 - XHTML (strict!)
 - own XML format

There is no correct format, but I am leaning against XML formats, since this would allow us to create translators in both binary format and create a XSLT translator.
I am a bit withholding about using openoffice.org as format since it is rather complex.
I actually prefer XHTML, since it is easy to understand, and quite widespread.


need some more pros & cons..........

* By converting to a temporary data-only format all meta data is lost (this is not a problem for most images and most sound)


Other related posts: