"Stephan Assmus" <superstippi@xxxxxx> wrote: > static inline bool > IsInsideGlyph(uchar ch) > { > return (ch & 0xC0) == 0x80; > } > > This code returns true for the following pattern, right? > > 10?? ???? Exactly. Note, that this is only correct for the subsequent characters, not the first one. > This code... > > const char *ptr = text; > > do { > ptr++; > } while (IsInsideGlyph(*ptr)); > > return ptr - text; > > ...increments the ptr once, then tests for IsInsideGlyph. Which will > return true in case only the first high bit is set. So how does this > work for three byte glyphs? > > A three byte glyph looks like this (correct me if I'm wrong): > > 1110 ???? > 110? ???? > 10?? ???? That's not correct, for bytes inside the glyph, 10 is set always, only the other 6 bits are used for character data. The first 3 bits of the first byte determines the length of the character. So the code looks okay, AFAICT. > So when IsInsideGlyph tests the second byte, it would return false, > no? > Which means moreUTF8.h only works for 2 byte glyphs. Can someone > confirm? If my observation is correct, I'm going to fix the problem > with count_utf8_bytes() that I introduced in my last commit. If there > is a better way, speak up! :-) Unless I am wrong, there is no need to do this :-) Bye, Axel.