[interfacekit] Re: MAJOR UTF8 bugs ...

>Ok, I don't want to annoye everybody with the same question but ...

That's what this list is for. =)

>I'm currently working on the BString class.  And on R5 this class is 
filled
>with MAJORS bugs about UTF-8 encoded strings. Nothing's wrong when 
using

[snip]

>Here's a simple example of what I mean :
>
>BString  string1 = "Steve ";
>BString  string2 = "Vallée";   // Note the "é" character
>string1.Append( string2, 6 );
>printf("%s", string1.String() );
>
>will produce ...  "Steve Vallé"    (without the final "e", because 
gthe "é"
>require 2 bytes)

I'm curious about something.  If you assume that all your counts are in 
bytes, rather than characters (which may be more than one byte), do 
these "bugs" go away?  The reason I ask this is because whether these 
are bugs may largely be a matter of perspective.  Sure, if you expect

string1.Append( string2, 6 );

to be counting six *characters* the behaviour is buggy.  However, if you 
expect that six to mean six *bytes* the behaviour is correct.  Does this 
make sense?  BString's underlying assumption is that all counts are 
bytes, and if one approaches it expecting it to count characters, a lot 
of functionality doesn't work as expected.

>I know I can use simple #ifdef and produce 2 codes for each functions. 
But
>exactly because of the tremendous size of BString class, I'm not very 
fond
>to double-job each single methods implementations.

I can certainly understand your feeling here.

>My opinion is: because of those many bugs, I'm 100% sure nobody ever 
used
>this class in the context of a localized program. It just make not 
sense at
>all.

I'm of two minds here.  On the one hand, what you are proposing results 
in changing the underlying assumption of how BString works, which is a 
risky thing to do.  On the other hand, who could possibly be using 
BString to *split* multibyte characters?  I mean, what would be the 
utility in that?  It does occur to me, though, that a number of programs 
may be explicitely taking the current behaviour into account when 
dealing with multibyte text, and changing how BString interprets counts 
may really mess those apps up.

Are there other UTF-8 related bugs that don't involve counts?  If there 
are, let us know what they are so we can make an intelligent decision.  
If not, my feeling is that the class should be implemented as-is, with 
more emphasis in the docs that everything but CountChars() deals in 
bytes, not UTF-8 characters, and to be careful lest those multibyte 
characters get chopped. =)

Thoughts, anyone?

e

Data is not information, and information is not knowledge: knowledge is 
not understanding, and understanding is not wisdom.
        - Philip Adams


Other related posts: