[haiku-development] Re: Font Encoding

  • From: Michael Pfeiffer <michael.w.pfeiffer@xxxxxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx
  • Date: Thu, 17 Feb 2011 07:33:15 +0100

Am 16.02.2011 um 13:32 schrieb Oliver Tappe:

> 
> On 2011-02-16 at 10:56:08 [+0100], Ingo Weinhold <ingo_weinhold@xxxxxx> 
> wrote:
>> On 2011-02-16 at 10:30:30 [+0100], pulkomandy
>> <pulkomandy@xxxxxxxxxxxxxxxxx> wrote:
>>> On Wed, 16 Feb 2011 10:13:29 +0100, Ingo Weinhold <ingo_weinhold@xxxxxx>
>>> wrote:
>>>> On 2011-02-16 at 09:51:02 [+0100], pulkomandy
>>>> <pulkomandy@xxxxxxxxxxxxxxxxx>
>>>> wrote:
>>>>> If you can implement it in a wa that works in all case, we'll accept
>>> the
>>>>> patch. But I know I can't do it without a context token for utf-7 or
>>>>> utf-16.
>>>> 
>>>> The convert_{from,to}_utf8() functions do have an "int32* state"
>>>> parameter.
>>>> I'm not familiar with UTF-7, but for UTF-16 that definitely suffices to
>>>> store the first surrogate of a pair. Do I miss something else?
>>> 
>>> iconv (and ICU) need to open then close the context. With this API, we
>>> don't know when to close the context, which has two problems :
>> 
>> But that is, as Michael said, just a problem of our implementation. No one
>> forces us to use iconv or ICU to convert between UTF-* and UTF-8. We need
>> only 21 bits to represent a Unicode code point and have 32 state bits
>> available. So there should be sufficient space for the algorithm to cache
>> the not-yet-processed bits of the current/next character, which I believe
>> is all that's needed to convert between different Unicode encodings.
> 
> ICU doesn't need to open converters for algorithmic conversions (the ones 
> between all the different unicode representations and asc-ii).  
> 
> But trying anything in order to keep our API really doesn't make sense, as 
> it *is* borked:
> 
> status_t convert_to_utf8(uint32 sourceEncoding, const char* source,
>       int32* sourceLength, char* dest, int32* destLength, int32* state,
>       char substitute = B_SUBSTITUTE);
> 
> 'substitute' is a char, so one is kind of limited to which substitute 
> characters might be used (and incidentally, which encoding is this in?).
> 
> Substitution can't be deactivated, i.e. you have no chance to learn that 
> something could not be converted.

Agreed. That IS a restriction of that API. However is this a major issue?
How often do you need to know that? At least in real uses cases this was
no issue (again BePDF, PDF Writer, StyledEdit).

> The BeBook only mentions B_ERROR as negative status, so it's impossible to 
> tell whether the input or the output buffer was too small or any other error 
> occured.

The function does NOT return B_ERROR if one of the buffers was too small,
instead in tells in sourceLength, destLength how many chars have been
read/written. 

> Apart from opening and closing the converters, there isn't any API that 
> allows iterating over the supported encodings (for instance in order to get 
> a list of encoding names that can be presented in a menu).

Agreed. An API is needed for that, but it does not make the current API
broken.

> Additionally, there's no support for activating/deactivating 
> transliteration, for applying unicode canonicalization, etc. 

I don't know what this is. Any pointers how this is related to
encoding into utf8 and decoding from utf8?

> That why, from my POV, there's no doubt that we do need a new API for 
> conversion between different character encodings.

Besides the transliteration issue that I do not understand,
I still think the API is not broken and works fine for the majority of use 
cases.
If a new API is needed for special cases that's fine by me too.

Bye,
Michael



Other related posts: