[liblouis-liblouisxml] Re: UCS2 and UCS4

  • From: "John J. Boyer" <john.boyer@xxxxxxxxxxxxxxxxx>
  • To: liblouis-liblouisxml@xxxxxxxxxxxxx
  • Date: Wed, 7 Aug 2013 08:32:26 -0500

Also, the definition of widechar changes according to whether UCS = 2 or 
4 in configure. 

John

On Wed, Aug 07, 2013 at 02:16:44PM +0200, Bert Frees wrote:
> 2013/8/6 Michael Whapples <mwhapples@xxxxxxx>
> 
> > That still does not fully answer my question.
> >
> > My main concern is that all the time the 16-bit encoding of liblouis is
> > referred to as UCS2 which is a fixed width encoding for 16-bit unicode code
> > points (IE. characters between \x0000 and \xffff). UTF-16 on the other hand
> > while being based on 16-bit code points, is not fixed width as it can
> > accept characters up to \x10ffff by using surrogate pairs.
> >
> > For some details on what I am getting at, may be read
> > http://en.wikipedia.org/wiki/**UCS2 <http://en.wikipedia.org/wiki/UCS2>
> >
> > So my question is, what happens should a ucs2 build of liblouis be passed
> > one of these surrogate pairs for characters between \xffff and \x10ffff?
> >
> > Python and Java, from what I can tell do not seem to have a codec for
> > UCS2, and the wikipedia article seems to suggest that UTF-16 superseeds
> > UCS2 in version 2.0 of the unicode standard. Thus if I use the UTF-16
> > encoding to prepare inbuf, I could easily end up with one of these
> > surrogate pairs.
> >
> > Is the use of UCS2 in liblouis terminology accurate (IE. being fixed width
> > and not accepting the surrogate pairs) or is the term UCS2 just used either
> > for historic or other reasons but actually is UTF-16.
> >
> > What will happen should I pass one of these surrogate pairs in inbuf?
> >
> >
> Hi Michael,
> 
> I don't think it would work because the parameters typeform,
> inputPositions, outputPositions, hyphens, etc. all assume each 2 bytes
> represent one character/position in the string. If we'd want to handle
> UTF-16 encoded strings, we would need to do something similar to what we do
> with UTF-8 encoded strings.

-- 
John J. Boyer; President, Chief Software Developer
Abilitiessoft, Inc.
http://www.abilitiessoft.com
Madison, Wisconsin USA
Developing software for people with disabilities

For a description of the software, to download it and links to
project pages go to http://www.abilitiessoft.com

Other related posts: