Hi Michael, On 2011-02-17 at 07:33:15 [+0100], Michael Pfeiffer <michael.w.pfeiffer@xxxxxxxxx> wrote: > > Am 16.02.2011 um 13:32 schrieb Oliver Tappe: [ ... ] > > Substitution can't be deactivated, i.e. you have no chance to learn that > > something could not be converted. > > Agreed. That IS a restriction of that API. However is this a major issue? > How often do you need to know that? At least in real uses cases this was > no issue (again BePDF, PDF Writer, StyledEdit). For Beam, this made me switch to iconv, as there was no way to tell if a conversion from the supposed charset of a mail into utf-8 actually worked or not. With iconv, Beam can ask the API to not substitute but fail instead, so the application has a chance to try different encodings until it has found one that works. > > The BeBook only mentions B_ERROR as negative status, so it's impossible to > > tell whether the input or the output buffer was too small or any other > > error > > occured. > > The function does NOT return B_ERROR if one of the buffers was too small, > instead in tells in sourceLength, destLength how many chars have been > read/written. That's precisely the problem - in a streaming environment, it's quite possible that you have received a buffer that's so small (a couple of bytes) that it can't be converted, because it contains only an incomplete multibyte character sequence. According to our API documentation, you wouldn't be able to tell what the problem was, all you knew was that both sourceLength and destinationLength was zero, so nothing was converted, but why? This could be solved, of course, by defining specific error states for source too small and destination too small. Maybe our actual implementation does that already, I don't know. > > Apart from opening and closing the converters, there isn't any API that > > allows iterating over the supported encodings (for instance in order to > > get > > a list of encoding names that can be presented in a menu). > > Agreed. An API is needed for that, but it does not make the current API > broken. Well, it does not make the functions convert_{to,from}_utf8() broken, but it indicates that the character encoding conversion API is incomplete. > > Additionally, there's no support for activating/deactivating > > transliteration, for applying unicode canonicalization, etc. > > I don't know what this is. Any pointers how this is related to > encoding into utf8 and decoding from utf8? Transliteration is the process of replacing characters that aren't available in the destination character encoding by character sequences that are, for instance replacing "ä" by "ae" when converting from utf-8 to us-ascii. Unicode canonicalization is about dealing with the problem that any unicode encoding allows some character strings to be represented in different ways, e.g. an "ä" can be represented as the single character 'ä' or by an 'a' followed by a combining diacritical mark character. The canonicalization transforms these strings into a specific format, such that visually identical strings have the same binary representation. This is especially useful when comparing strings but has an effect on conversion, too (as the destination charset encoding may support 'ä' but not the diacritical mark). > > That why, from my POV, there's no doubt that we do need a new API for > > conversion between different character encodings. > > Besides the transliteration issue that I do not understand, > I still think the API is not broken and works fine for the majority of use > cases. > If a new API is needed for special cases that's fine by me too. Well, we need to stay compatible, so the old API has to be kept around anyway, but I seriously think we should come up with a new API and deprecate those utf-8 helper functions. cheers, Oliver