[openbeostranslationkit] Re: Structured Text Translation

Wow.  This is really unbelievable.  I was entirely just thinking about this in 
the shower this morning.  As part of my preferences based work I have been 
thinking about XML on BeOS.  It seemed  natural to have an XMLTranslator.  I 
couldn't remember whether or not the translator kit had a primitive 
B_TRANSLATOR_STRUCTURED_TEXT type though.

Here's something that we could use perhaps.  It's just a rough thing I came up 
with so I'm not attached to the fields or the values for the various constants. 
 You get the point, I'm sure.

Andrew Bachmann

============================ TranslatorFormats.h

enum {
  ...
  B_TRANSLATOR_TEXT            = 'TEXT', /* B_ASCII_TYPE */
  B_TRANSLATOR_STRUCTURED_TEXT = 'XTXT', /* Structured text */
  ...
};

struct TranslatorStructuredText {
  int32 magic;     // B_TRANSLATOR_STRUCTURED_TEXT
  int32 charset;   // strongly recommend B_UTF8
  char escapeChar; // recommend B_UTF8_ESCAPE
  uint32 dataSize;
}

enum {
/* structured text file formats B_TRANSLATOR_STRUCTURED_TEXT */
  B_HTML_FORMAT = 'HTML',
  B_XML_FORMAT = 'XML ',
  ...
};

// this byte will not occur in the leading byte of a UTF8
// character.  see http://www.talisman.org/utf8.html for example
#define B_UTF8_ESCAPE 0b10101010

// these bytes denote the structure
#define B_STRUCTURED_TEXT_PROPERTY_NAME  '$'
#define B_STRUCTURED_TEXT_PROPERTY_VALUE '='
#define B_STRUCTURED_TEXT_CHILD_BEGIN    '<'
#define B_STRUCTURED_TEXT_CHILD_END      '>'
#define B_STRUCTURED_TEXT_CONTENT        '!'

============================

The escape byte escapes only the immediately following byte.  As is usual, if 
there are two escape bytes in a row, rather than interpreting it as an escape 
sequence, it should be  nterpreted as the literal byte in the expected 
encoding.  In the case of UTF-8 this will(should) never happen because the 
escape character would be illegal if it literally occurred at that location. 
(It's value was chosen for this property)

The structure bytes were picked to try to enhance readability of the raw 
stream.  For example the file:

---------------------------- OBOSTranslatorKitRules.html
<html>
<head><title>OpenBeOS Translator Kit</title></head>
<body onLoad=doJavaScript();>
<h1>It Rules.</h1>
</body>
</html>
----------------------------

Original size= 124 bytes

Could be translated into the following struct and data by an HTMLTranslator:

struct TranslatorStructuredText t;

t.magic = B_HTML_FORMAT;
t.charset = B_UTF8;
t.escapeChar = B_UTF8_ESCAPE;
t.dataSize = 145; // includes the escape bytes and structure bytes

data: (please mentally translate _ into the B_UTF8_ESCAPE byte)

---------------------------- <internal format>
_<_$tag_=html_>
_<_$tag_=head_<_$tag_=title_!OpenBeOS Translator Kit_>_>
_<_$tag_=body_$onLoad_=doJavaScript();_<
_<_$tag_=h1!It Rules._>
_>
_>
----------------------------

In the case of UTF8, a nonstructured file is encoded directly, as one may 
expect. (no additional byte cost)



Other related posts: