Am 16.02.2011 um 13:32 schrieb Oliver Tappe: > > On 2011-02-16 at 10:56:08 [+0100], Ingo Weinhold <ingo_weinhold@xxxxxx> > wrote: >> On 2011-02-16 at 10:30:30 [+0100], pulkomandy >> <pulkomandy@xxxxxxxxxxxxxxxxx> wrote: >>> On Wed, 16 Feb 2011 10:13:29 +0100, Ingo Weinhold <ingo_weinhold@xxxxxx> >>> wrote: >>>> On 2011-02-16 at 09:51:02 [+0100], pulkomandy >>>> <pulkomandy@xxxxxxxxxxxxxxxxx> >>>> wrote: >>>>> If you can implement it in a wa that works in all case, we'll accept >>> the >>>>> patch. But I know I can't do it without a context token for utf-7 or >>>>> utf-16. >>>> >>>> The convert_{from,to}_utf8() functions do have an "int32* state" >>>> parameter. >>>> I'm not familiar with UTF-7, but for UTF-16 that definitely suffices to >>>> store the first surrogate of a pair. Do I miss something else? >>> >>> iconv (and ICU) need to open then close the context. With this API, we >>> don't know when to close the context, which has two problems : >> >> But that is, as Michael said, just a problem of our implementation. No one >> forces us to use iconv or ICU to convert between UTF-* and UTF-8. We need >> only 21 bits to represent a Unicode code point and have 32 state bits >> available. So there should be sufficient space for the algorithm to cache >> the not-yet-processed bits of the current/next character, which I believe >> is all that's needed to convert between different Unicode encodings. > > ICU doesn't need to open converters for algorithmic conversions (the ones > between all the different unicode representations and asc-ii). > > But trying anything in order to keep our API really doesn't make sense, as > it *is* borked: > > status_t convert_to_utf8(uint32 sourceEncoding, const char* source, > int32* sourceLength, char* dest, int32* destLength, int32* state, > char substitute = B_SUBSTITUTE); > > 'substitute' is a char, so one is kind of limited to which substitute > characters might be used (and incidentally, which encoding is this in?). > > Substitution can't be deactivated, i.e. you have no chance to learn that > something could not be converted. Agreed. That IS a restriction of that API. However is this a major issue? How often do you need to know that? At least in real uses cases this was no issue (again BePDF, PDF Writer, StyledEdit). > The BeBook only mentions B_ERROR as negative status, so it's impossible to > tell whether the input or the output buffer was too small or any other error > occured. The function does NOT return B_ERROR if one of the buffers was too small, instead in tells in sourceLength, destLength how many chars have been read/written. > Apart from opening and closing the converters, there isn't any API that > allows iterating over the supported encodings (for instance in order to get > a list of encoding names that can be presented in a menu). Agreed. An API is needed for that, but it does not make the current API broken. > Additionally, there's no support for activating/deactivating > transliteration, for applying unicode canonicalization, etc. I don't know what this is. Any pointers how this is related to encoding into utf8 and decoding from utf8? > That why, from my POV, there's no doubt that we do need a new API for > conversion between different character encodings. Besides the transliteration issue that I do not understand, I still think the API is not broken and works fine for the majority of use cases. If a new API is needed for special cases that's fine by me too. Bye, Michael