[liblouis-liblouisxml] Re: UCS2 and UCS4

  • From: Bert Frees <bertfrees@xxxxxxxxx>
  • To: liblouis-liblouisxml@xxxxxxxxxxxxx
  • Date: Wed, 7 Aug 2013 16:50:40 +0200

Another problem that came to mind is that your tables will also have to be
UTF-16 encoded, so some exotic characters will have 4 bytes. But
liblouis only allows 2 bytes for character definitions in UCS2 mode.



2013/8/7 Michael Whapples <mwhapples@xxxxxxx>

>  Thanks, I think I now understand.
>
> So to just check:
> * If not using typeforms, hyphens, or positions, there would be no error
> caused (one would of course need to be careful about the length of the
> buffer, it might not actually be the same as the length of the input
> string).
> * If typeforms, hyphens or positions are used then these potentially will
> give wrong information. One could possibly fix it by finding characters
> which require more than 16 bits and correcting the arrays, but it might
> just be simpler to find such characters before translation and raise an
> error then (IE. not support the characters for UCS2 builds)..
>
> It is a shame that there is no UCS2 encoding in some of these higher level
> languages, if there were then when encoding the string such characters
> would raise an encoding exception, but that is why this question came up as
> they may just pass through if using UTF-16.
>
> Michael Whapples
> On 07/08/2013 13:16, Bert Frees wrote:
>
>
>  2013/8/6 Michael Whapples <mwhapples@xxxxxxx>
>
>> [...]
>> What will happen should I pass one of these surrogate pairs in inbuf?
>>
>>
>  Hi Michael,
>
>  I don't think it would work because the parameters typeform,
> inputPositions, outputPositions, hyphens, etc. all assume each 2 bytes
> represent one character/position in the string. If we'd want to handle
> UTF-16 encoded strings, we would need to do something similar to what we do
> with UTF-8 encoded strings.
>
>

Other related posts: