Thanks, I think I now understand. So to just check:* If not using typeforms, hyphens, or positions, there would be no error caused (one would of course need to be careful about the length of the buffer, it might not actually be the same as the length of the input string). * If typeforms, hyphens or positions are used then these potentially will give wrong information. One could possibly fix it by finding characters which require more than 16 bits and correcting the arrays, but it might just be simpler to find such characters before translation and raise an error then (IE. not support the characters for UCS2 builds)..
It is a shame that there is no UCS2 encoding in some of these higher level languages, if there were then when encoding the string such characters would raise an encoding exception, but that is why this question came up as they may just pass through if using UTF-16.
Michael Whapples On 07/08/2013 13:16, Bert Frees wrote:
2013/8/6 Michael Whapples <mwhapples@xxxxxxx <mailto:mwhapples@xxxxxxx>> That still does not fully answer my question. My main concern is that all the time the 16-bit encoding of liblouis is referred to as UCS2 which is a fixed width encoding for 16-bit unicode code points (IE. characters between \x0000 and \xffff). UTF-16 on the other hand while being based on 16-bit code points, is not fixed width as it can accept characters up to \x10ffff by using surrogate pairs. For some details on what I am getting at, may be read http://en.wikipedia.org/wiki/UCS2 So my question is, what happens should a ucs2 build of liblouis be passed one of these surrogate pairs for characters between \xffff and \x10ffff? Python and Java, from what I can tell do not seem to have a codec for UCS2, and the wikipedia article seems to suggest that UTF-16 superseeds UCS2 in version 2.0 of the unicode standard. Thus if I use the UTF-16 encoding to prepare inbuf, I could easily end up with one of these surrogate pairs. Is the use of UCS2 in liblouis terminology accurate (IE. being fixed width and not accepting the surrogate pairs) or is the term UCS2 just used either for historic or other reasons but actually is UTF-16. What will happen should I pass one of these surrogate pairs in inbuf? Hi Michael,I don't think it would work because the parameters typeform, inputPositions, outputPositions, hyphens, etc. all assume each 2 bytes represent one character/position in the string. If we'd want to handle UTF-16 encoded strings, we would need to do something similar to what we do with UTF-8 encoded strings.