Another problem that came to mind is that your tables will also have to be UTF-16 encoded, so some exotic characters will have 4 bytes. But liblouis only allows 2 bytes for character definitions in UCS2 mode. 2013/8/7 Michael Whapples <mwhapples@xxxxxxx> > Thanks, I think I now understand. > > So to just check: > * If not using typeforms, hyphens, or positions, there would be no error > caused (one would of course need to be careful about the length of the > buffer, it might not actually be the same as the length of the input > string). > * If typeforms, hyphens or positions are used then these potentially will > give wrong information. One could possibly fix it by finding characters > which require more than 16 bits and correcting the arrays, but it might > just be simpler to find such characters before translation and raise an > error then (IE. not support the characters for UCS2 builds).. > > It is a shame that there is no UCS2 encoding in some of these higher level > languages, if there were then when encoding the string such characters > would raise an encoding exception, but that is why this question came up as > they may just pass through if using UTF-16. > > Michael Whapples > On 07/08/2013 13:16, Bert Frees wrote: > > > 2013/8/6 Michael Whapples <mwhapples@xxxxxxx> > >> [...] >> What will happen should I pass one of these surrogate pairs in inbuf? >> >> > Hi Michael, > > I don't think it would work because the parameters typeform, > inputPositions, outputPositions, hyphens, etc. all assume each 2 bytes > represent one character/position in the string. If we'd want to handle > UTF-16 encoded strings, we would need to do something similar to what we do > with UTF-8 encoded strings. > >