[liblouis-liblouisxml] Re: UCS2 and UCS4

  • From: Bert Frees <bertfrees@xxxxxxxxx>
  • To: liblouis-liblouisxml@xxxxxxxxxxxxx
  • Date: Wed, 7 Aug 2013 14:16:44 +0200

2013/8/6 Michael Whapples <mwhapples@xxxxxxx>

> That still does not fully answer my question.
>
> My main concern is that all the time the 16-bit encoding of liblouis is
> referred to as UCS2 which is a fixed width encoding for 16-bit unicode code
> points (IE. characters between \x0000 and \xffff). UTF-16 on the other hand
> while being based on 16-bit code points, is not fixed width as it can
> accept characters up to \x10ffff by using surrogate pairs.
>
> For some details on what I am getting at, may be read
> http://en.wikipedia.org/wiki/**UCS2 <http://en.wikipedia.org/wiki/UCS2>
>
> So my question is, what happens should a ucs2 build of liblouis be passed
> one of these surrogate pairs for characters between \xffff and \x10ffff?
>
> Python and Java, from what I can tell do not seem to have a codec for
> UCS2, and the wikipedia article seems to suggest that UTF-16 superseeds
> UCS2 in version 2.0 of the unicode standard. Thus if I use the UTF-16
> encoding to prepare inbuf, I could easily end up with one of these
> surrogate pairs.
>
> Is the use of UCS2 in liblouis terminology accurate (IE. being fixed width
> and not accepting the surrogate pairs) or is the term UCS2 just used either
> for historic or other reasons but actually is UTF-16.
>
> What will happen should I pass one of these surrogate pairs in inbuf?
>
>
Hi Michael,

I don't think it would work because the parameters typeform,
inputPositions, outputPositions, hyphens, etc. all assume each 2 bytes
represent one character/position in the string. If we'd want to handle
UTF-16 encoded strings, we would need to do something similar to what we do
with UTF-8 encoded strings.

Other related posts: