[haiku-development] Re: Font Encoding

  • From: pulkomandy <pulkomandy@xxxxxxxxxxxxxxxxx>
  • To: <haiku-development@xxxxxxxxxxxxx>
  • Date: Wed, 16 Feb 2011 11:30:19 +0100


> But that is, as Michael said, just a problem of our implementation. No
one 
> forces us to use iconv or ICU to convert between UTF-* and UTF-8. We
need 
> only 21 bits to represent a Unicode code point and have 32 state bits 
> available. So there should be sufficient space for the algorithm to
cache 
> the not-yet-processed bits of the current/next character, which I
believe 
> is all that's needed to convert between different Unicode encodings.

We need to know that a given call to convert_from_utf8 is the last one, so
that we can insert an end marker in the resulting utf-7 string.
I don't see how we can automagically guess that a given call to the
function will be the last one. Unless we insert the end marker at each
call, and then remove it on the next one.

Sure, utf-7 is a bit of a corner case, but there may be other similar
strange encodings. For example, there may be more context-involved stuff
like ligatures in some languages that depend on characters after and
before, and this may need to revert the work done on the previous call
quite a lot (several chars).

If we really want to keep the ABI, the only solution I see is to let the
caller set a bit in the token to say "I'm done, this is the last part of
the string", and close the context then. But this breaks the API use
anyway, as we'd end up with a convert_close(int32* token){token |=
IM_DONE;}.

http://en.wikipedia.org/wiki/Context-sensitive_shaping

So, it may work without open/close for the most usual latin1<>utf8, but
likely not for more complex stuff.

-- 
Adrien.

Other related posts: