[haiku-appserver] Re: moreUTF8.h

  • From: "Axel Dörfler" <axeld@xxxxxxxxxxxxxxxx>
  • To: haiku-appserver@xxxxxxxxxxxxx
  • Date: Wed, 15 Jun 2005 20:58:17 +0200 CEST

"Stephan Assmus" <superstippi@xxxxxx> wrote:
> static inline bool
> IsInsideGlyph(uchar ch)
> {
>       return (ch & 0xC0) == 0x80;
> }
> 
> This code returns true for the following pattern, right?
> 
> 10?? ????

Exactly. Note, that this is only correct for the subsequent characters, 
not the first one.

> This code...
> 
>       const char *ptr = text;
> 
>       do {
>               ptr++;
>       } while (IsInsideGlyph(*ptr));
>                               
>       return ptr - text;
> 
> ...increments the ptr once, then tests for IsInsideGlyph. Which will 
> return true in case only the first high bit is set. So how does this 
> work for three byte glyphs?
> 
> A three byte glyph looks like this (correct me if I'm wrong):
> 
> 1110 ????
> 110? ????
> 10?? ????

That's not correct, for bytes inside the glyph, 10 is set always, only 
the other 6 bits are used for character data. The first 3 bits of the 
first byte determines the length of the character.
So the code looks okay, AFAICT.

> So when IsInsideGlyph tests the second byte, it would return false, 
> no? 
> Which means moreUTF8.h only works for 2 byte glyphs. Can someone 
> confirm? If my observation is correct, I'm going to fix the problem 
> with count_utf8_bytes() that I introduced in my last commit. If there 
> is a better way, speak up! :-)

Unless I am wrong, there is no need to do this :-)

Bye,
   Axel.


Other related posts: