2013/8/6 Michael Whapples <mwhapples@xxxxxxx> > That still does not fully answer my question. > > My main concern is that all the time the 16-bit encoding of liblouis is > referred to as UCS2 which is a fixed width encoding for 16-bit unicode code > points (IE. characters between \x0000 and \xffff). UTF-16 on the other hand > while being based on 16-bit code points, is not fixed width as it can > accept characters up to \x10ffff by using surrogate pairs. > > For some details on what I am getting at, may be read > http://en.wikipedia.org/wiki/**UCS2 <http://en.wikipedia.org/wiki/UCS2> > > So my question is, what happens should a ucs2 build of liblouis be passed > one of these surrogate pairs for characters between \xffff and \x10ffff? > > Python and Java, from what I can tell do not seem to have a codec for > UCS2, and the wikipedia article seems to suggest that UTF-16 superseeds > UCS2 in version 2.0 of the unicode standard. Thus if I use the UTF-16 > encoding to prepare inbuf, I could easily end up with one of these > surrogate pairs. > > Is the use of UCS2 in liblouis terminology accurate (IE. being fixed width > and not accepting the surrogate pairs) or is the term UCS2 just used either > for historic or other reasons but actually is UTF-16. > > What will happen should I pass one of these surrogate pairs in inbuf? > > Hi Michael, I don't think it would work because the parameters typeform, inputPositions, outputPositions, hyphens, etc. all assume each 2 bytes represent one character/position in the string. If we'd want to handle UTF-16 encoded strings, we would need to do something similar to what we do with UTF-8 encoded strings.