[haiku-development] Re: Font Encoding

  • From: Ingo Weinhold <ingo_weinhold@xxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx
  • Date: Wed, 16 Feb 2011 12:12:52 +0100

On 2011-02-16 at 11:23:55 [+0100], François Revol <revol@xxxxxxx> wrote:
> Le 16 févr. 2011 à 10:56, Ingo Weinhold a écrit :
> >> 
> >> iconv (and ICU) need to open then close the context. With this API, we
> >> don't know when to close the context, which has two problems :
> > 
> > But that is, as Michael said, just a problem of our implementation. No one
> > forces us to use iconv or ICU to convert between UTF-* and UTF-8. We need
> > only 21 bits to represent a Unicode code point and have 32 state bits
> > available. So there should be sufficient space for the algorithm to cache
> > the not-yet-processed bits of the current/next character, which I believe
> > is all that's needed to convert between different Unicode encodings.
> 
> This doesn't solve the pending chars issue though...
> We discussed this some time ago but didn't find a solution.

Simply don't leave more than one char pending? I don't see a fundamental 
problem with that solution at least.

> Maybe we could agree that calling the thing with a NULL input buffer means 
> close the context and flush the remaining chars ?

Requiring a final call with NULL input (at least when state is != 0) seems 
reasonable. Though a new three-phase API (init, convert, finish) with an 
arbitrarily complex context would be even more reasonable, I suppose.


On 2011-02-16 at 11:30:19 [+0100], pulkomandy <pulkomandy@xxxxxxxxxxxxxxxxx> 
wrote:
> > But that is, as Michael said, just a problem of our implementation. No
> one
> > forces us to use iconv or ICU to convert between UTF-* and UTF-8. We
> need
> > only 21 bits to represent a Unicode code point and have 32 state bits
> > available. So there should be sufficient space for the algorithm to
> cache
> > the not-yet-processed bits of the current/next character, which I
> believe
> > is all that's needed to convert between different Unicode encodings.
> 
> We need to know that a given call to convert_from_utf8 is the last one, so
> that we can insert an end marker in the resulting utf-7 string.
> I don't see how we can automagically guess that a given call to the
> function will be the last one. Unless we insert the end marker at each
> call, and then remove it on the next one.

Yes, I would always insert an end marker, if necessary. Removing it in a 
later call is not possible, but also not necessary. This is potentially less 
space efficient, but would be correct at least. Space efficiency is a concern 
only when very small input chunks are used, anyway.

So, yes, the API is not optimal, but AFAICT it should be possible to 
implement it to work correctly for pretty much all conversions.

CU, Ingo

Other related posts: