[haiku-development] Re: Font Encoding

  • From: Oliver Tappe <zooey@xxxxxxxxxxxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx
  • Date: Thu, 17 Feb 2011 09:49:20 +0100

Hi Michael,

On 2011-02-17 at 07:33:15 [+0100], Michael Pfeiffer 
<michael.w.pfeiffer@xxxxxxxxx> wrote:
> 
> Am 16.02.2011 um 13:32 schrieb Oliver Tappe:
[ ... ]
> > Substitution can't be deactivated, i.e. you have no chance to learn that
> > something could not be converted.
> 
> Agreed. That IS a restriction of that API. However is this a major issue?
> How often do you need to know that? At least in real uses cases this was
> no issue (again BePDF, PDF Writer, StyledEdit).

For Beam, this made me switch to iconv, as there was no way to tell if a 
conversion from the supposed charset of a mail into utf-8 actually worked or 
not. With iconv, Beam can ask the API to not substitute but fail instead, so 
the application has a chance to try different encodings until it has found 
one that works.

> > The BeBook only mentions B_ERROR as negative status, so it's impossible to
> > tell whether the input or the output buffer was too small or any other 
> > error
> > occured.
> 
> The function does NOT return B_ERROR if one of the buffers was too small,
> instead in tells in sourceLength, destLength how many chars have been
> read/written.

That's precisely the problem - in a streaming environment, it's quite 
possible that you have received a buffer that's so small (a couple of bytes) 
that it can't be converted, because it contains only an incomplete multibyte 
character sequence. According to our API documentation, you wouldn't be able 
to tell what the problem was, all you knew was that both sourceLength and 
destinationLength was zero, so nothing was converted, but why?
This could be solved, of course, by defining specific error states for source 
too small and destination too small. Maybe our actual implementation does 
that already, I don't know.

> > Apart from opening and closing the converters, there isn't any API that
> > allows iterating over the supported encodings (for instance in order to 
> > get
> > a list of encoding names that can be presented in a menu).
> 
> Agreed. An API is needed for that, but it does not make the current API
> broken.

Well, it does not make the functions convert_{to,from}_utf8() broken, but it 
indicates that the character encoding conversion API is incomplete.

> > Additionally, there's no support for activating/deactivating
> > transliteration, for applying unicode canonicalization, etc.
> 
> I don't know what this is. Any pointers how this is related to
> encoding into utf8 and decoding from utf8?

Transliteration is the process of replacing characters that aren't available 
in the destination character encoding by character sequences that are, for 
instance replacing "ä" by "ae" when converting from utf-8 to us-ascii. 
Unicode canonicalization is about dealing with the problem that any unicode 
encoding allows some character strings to be represented in different ways, 
e.g. an "ä" can be represented as the single character 'ä' or by an 'a' 
followed by a combining diacritical mark character. The canonicalization 
transforms these strings into a specific format, such that visually identical 
strings have the same binary representation. This is especially useful when 
comparing strings but has an effect on conversion, too (as the destination 
charset encoding may support 'ä' but not the diacritical mark).

> > That why, from my POV, there's no doubt that we do need a new API for
> > conversion between different character encodings.
> 
> Besides the transliteration issue that I do not understand,
> I still think the API is not broken and works fine for the majority of use 
> cases.
> If a new API is needed for special cases that's fine by me too.

Well, we need to stay compatible, so the old API has to be kept around 
anyway, but I seriously think we should come up with a new API and deprecate 
those utf-8 helper functions.

cheers,
        Oliver

Other related posts: