[liblouis-liblouisxml] Re: UCS2 and UCS4

  • From: Michael Whapples <mwhapples@xxxxxxx>
  • To: liblouis-liblouisxml@xxxxxxxxxxxxx
  • Date: Wed, 07 Aug 2013 15:05:52 +0100

Thanks, I think I now understand.

So to just check:
* If not using typeforms, hyphens, or positions, there would be no error caused (one would of course need to be careful about the length of the buffer, it might not actually be the same as the length of the input string). * If typeforms, hyphens or positions are used then these potentially will give wrong information. One could possibly fix it by finding characters which require more than 16 bits and correcting the arrays, but it might just be simpler to find such characters before translation and raise an error then (IE. not support the characters for UCS2 builds)..

It is a shame that there is no UCS2 encoding in some of these higher level languages, if there were then when encoding the string such characters would raise an encoding exception, but that is why this question came up as they may just pass through if using UTF-16.

Michael Whapples
On 07/08/2013 13:16, Bert Frees wrote:

2013/8/6 Michael Whapples <mwhapples@xxxxxxx <mailto:mwhapples@xxxxxxx>>

    That still does not fully answer my question.

    My main concern is that all the time the 16-bit encoding of
    liblouis is referred to as UCS2 which is a fixed width encoding
    for 16-bit unicode code points (IE. characters between \x0000 and
    \xffff). UTF-16 on the other hand while being based on 16-bit code
    points, is not fixed width as it can accept characters up to
    \x10ffff by using surrogate pairs.

    For some details on what I am getting at, may be read
    http://en.wikipedia.org/wiki/UCS2

    So my question is, what happens should a ucs2 build of liblouis be
    passed one of these surrogate pairs for characters between \xffff
    and \x10ffff?

    Python and Java, from what I can tell do not seem to have a codec
    for UCS2, and the wikipedia article seems to suggest that UTF-16
    superseeds UCS2 in version 2.0 of the unicode standard. Thus if I
    use the UTF-16 encoding to prepare inbuf, I could easily end up
    with one of these surrogate pairs.

    Is the use of UCS2 in liblouis terminology accurate (IE. being
    fixed width and not accepting the surrogate pairs) or is the term
    UCS2 just used either for historic or other reasons but actually
    is UTF-16.

    What will happen should I pass one of these surrogate pairs in inbuf?


Hi Michael,

I don't think it would work because the parameters typeform, inputPositions, outputPositions, hyphens, etc. all assume each 2 bytes represent one character/position in the string. If we'd want to handle UTF-16 encoded strings, we would need to do something similar to what we do with UTF-8 encoded strings.




Other related posts: