[liblouis-liblouisxml] Re: UCS2 and UCS4

From: Michael Whapples <mwhapples@xxxxxxx>
To: liblouis-liblouisxml@xxxxxxxxxxxxx
Date: Wed, 07 Aug 2013 15:05:52 +0100

Thanks, I think I now understand.

So to just check:

* If not using typeforms, hyphens, or positions, there would be no errorcaused (one would of course need to be careful about the length of thebuffer, it might not actually be the same as the length of the inputstring).* If typeforms, hyphens or positions are used then these potentiallywill give wrong information. One could possibly fix it by findingcharacters which require more than 16 bits and correcting the arrays,but it might just be simpler to find such characters before translationand raise an error then (IE. not support the characters for UCS2 builds)..

It is a shame that there is no UCS2 encoding in some of these higherlevel languages, if there were then when encoding the string suchcharacters would raise an encoding exception, but that is why thisquestion came up as they may just pass through if using UTF-16.


Michael Whapples
On 07/08/2013 13:16, Bert Frees wrote:


2013/8/6 Michael Whapples <mwhapples@xxxxxxx <mailto:mwhapples@xxxxxxx>>

    That still does not fully answer my question.

    My main concern is that all the time the 16-bit encoding of
    liblouis is referred to as UCS2 which is a fixed width encoding
    for 16-bit unicode code points (IE. characters between \x0000 and
    \xffff). UTF-16 on the other hand while being based on 16-bit code
    points, is not fixed width as it can accept characters up to
    \x10ffff by using surrogate pairs.

    For some details on what I am getting at, may be read
    http://en.wikipedia.org/wiki/UCS2

    So my question is, what happens should a ucs2 build of liblouis be
    passed one of these surrogate pairs for characters between \xffff
    and \x10ffff?

    Python and Java, from what I can tell do not seem to have a codec
    for UCS2, and the wikipedia article seems to suggest that UTF-16
    superseeds UCS2 in version 2.0 of the unicode standard. Thus if I
    use the UTF-16 encoding to prepare inbuf, I could easily end up
    with one of these surrogate pairs.

    Is the use of UCS2 in liblouis terminology accurate (IE. being
    fixed width and not accepting the surrogate pairs) or is the term
    UCS2 just used either for historic or other reasons but actually
    is UTF-16.

    What will happen should I pass one of these surrogate pairs in inbuf?


Hi Michael,

I don't think it would work because the parameters typeform,inputPositions, outputPositions, hyphens, etc. all assume each 2 bytesrepresent one character/position in the string. If we'd want to handleUTF-16 encoded strings, we would need to do something similar to whatwe do with UTF-8 encoded strings.

Follow-Ups:
- [liblouis-liblouisxml] Re: UCS2 and UCS4
  - From: Bert Frees

References:
- [liblouis-liblouisxml] UCS2 and UCS4
  - From: John J. Boyer
- [liblouis-liblouisxml] Re: UCS2 and UCS4
  - From: Michael Whapples
- [liblouis-liblouisxml] Re: UCS2 and UCS4
  - From: Bert Frees

[liblouis-liblouisxml] Re: UCS2 and UCS4

Other related posts: