[interfacekit] Re: MAJOR UTF8 bugs ...
- From: "Erik Jakowatz" <erik@xxxxxxxxxxxxxx>
- To: interfacekit@xxxxxxxxxxxxx
- Date: Sat, 26 Jan 2002 22:19:41 -0800
>Ok, I don't want to annoye everybody with the same question but ...
That's what this list is for. =)
>I'm currently working on the BString class. And on R5 this class is
filled
>with MAJORS bugs about UTF-8 encoded strings. Nothing's wrong when
using
[snip]
>Here's a simple example of what I mean :
>
>BString string1 = "Steve ";
>BString string2 = "Vallée"; // Note the "é" character
>string1.Append( string2, 6 );
>printf("%s", string1.String() );
>
>will produce ... "Steve Vallé" (without the final "e", because
gthe "é"
>require 2 bytes)
I'm curious about something. If you assume that all your counts are in
bytes, rather than characters (which may be more than one byte), do
these "bugs" go away? The reason I ask this is because whether these
are bugs may largely be a matter of perspective. Sure, if you expect
string1.Append( string2, 6 );
to be counting six *characters* the behaviour is buggy. However, if you
expect that six to mean six *bytes* the behaviour is correct. Does this
make sense? BString's underlying assumption is that all counts are
bytes, and if one approaches it expecting it to count characters, a lot
of functionality doesn't work as expected.
>I know I can use simple #ifdef and produce 2 codes for each functions.
But
>exactly because of the tremendous size of BString class, I'm not very
fond
>to double-job each single methods implementations.
I can certainly understand your feeling here.
>My opinion is: because of those many bugs, I'm 100% sure nobody ever
used
>this class in the context of a localized program. It just make not
sense at
>all.
I'm of two minds here. On the one hand, what you are proposing results
in changing the underlying assumption of how BString works, which is a
risky thing to do. On the other hand, who could possibly be using
BString to *split* multibyte characters? I mean, what would be the
utility in that? It does occur to me, though, that a number of programs
may be explicitely taking the current behaviour into account when
dealing with multibyte text, and changing how BString interprets counts
may really mess those apps up.
Are there other UTF-8 related bugs that don't involve counts? If there
are, let us know what they are so we can make an intelligent decision.
If not, my feeling is that the class should be implemented as-is, with
more emphasis in the docs that everything but CountChars() deals in
bytes, not UTF-8 characters, and to be careful lest those multibyte
characters get chopped. =)
Thoughts, anyone?
e
Data is not information, and information is not knowledge: knowledge is
not understanding, and understanding is not wisdom.
- Philip Adams
- Follow-Ups:
- [interfacekit] Re: MAJOR UTF8 bugs ...
- From: Steve Vallée
- References:
- [interfacekit] MAJOR UTF8 bugs ...
- From: Steve Vallée
Other related posts:
- » [interfacekit] MAJOR UTF8 bugs ...
- » [interfacekit] Re: MAJOR UTF8 bugs ...
- » [interfacekit] Re: MAJOR UTF8 bugs ...
- » [interfacekit] Re: MAJOR UTF8 bugs ...
- » [interfacekit] Re: MAJOR UTF8 bugs ...
- [interfacekit] Re: MAJOR UTF8 bugs ...
- From: Steve Vallée
- [interfacekit] MAJOR UTF8 bugs ...
- From: Steve Vallée