Hi, I just resubscribed to this list just to send out this information, so please read on. All software developers suffer from varying degrees of the Acute Premature Optimization Syndrome (APOS). Some people even say: "premature optimization is the root of all evil". Some days ago, BGA suggested to use this function inline uint32 utf8_char_len(uchar c) { return (((0xE5000000 >> (((c) >> 3) & 0x1E)) & 3) + 1); } Well, seems like it is able to compute the length (byte count) of one utf character if you feed it with the first byte. Nice thing, but should it really be used? I don't even understand why it works. BString::CountChars() used a loope like this one: for (; (*ptr & 0xc0) == 0x80; ptr++); to skip bytes that have bit 7 set, but bit 6 not set. These are the n+x bytes (except n) of the UTF-8 character that should not be counted if you want to get the number of characters in a string. BString is one of the classes that needs to be fast, and may need to be optimized later. However it already got optimized and now uses utf8_char_len(). I have to admit that the code looks a little bit nicer now. But was it really useful to optimize it? And since it already got optimized, would you consider to optimize it again? And who does all the testing? Was the time spend on optimizing it worth the result? Well, I didn't believe that the new code would be better than the old code, and wrote a small test. See yourself. The tests are run a couple of times on a string consisting mostly of UTF-8 characters, and also on a string containing no UTF-8 characters. Of cause, test result is dependent on hardware, and this is a result for x86, results on PPC might be different (if you have one, please try it and post the result) Functions are derived from the BString.cpp file in CVS. On my system, the old code is 30% to 60% faster. /* $ gcc -O0 utf8test.cpp $ utf8test CountCharsOld length UTF-8 = 636 CountCharsOld length ASCII = 630 CountCharsNew length UTF-8 = 636 CountCharsNew length ASCII = 630 50000 times CountCharsOld UTF-8 needed 338634 usec 50000 times CountCharsNew UTF-8 needed 1333298 usec 50000 times CountCharsOld ASCII needed 192420 usec 50000 times CountCharsNew ASCII needed 1317663 usec 50000 times CountCharsOld UTF-8 needed 334040 usec 50000 times CountCharsNew UTF-8 needed 1339184 usec 50000 times CountCharsOld ASCII needed 188263 usec 50000 times CountCharsNew ASCII needed 1320069 usec $ gcc -O3 utf8test.cpp $ utf8test CountCharsOld length UTF-8 = 636 CountCharsOld length ASCII = 630 CountCharsNew length UTF-8 = 636 CountCharsNew length ASCII = 630 50000 times CountCharsOld UTF-8 needed 220002 usec 50000 times CountCharsNew UTF-8 needed 327496 usec 50000 times CountCharsOld ASCII needed 109744 usec 50000 times CountCharsNew ASCII needed 324075 usec 50000 times CountCharsOld UTF-8 needed 229708 usec 50000 times CountCharsNew UTF-8 needed 332800 usec 50000 times CountCharsOld ASCII needed 111203 usec 50000 times CountCharsNew ASCII needed 327301 usec */ #include <OS.h> #include <stdio.h> int32 CountCharsNew(const char *ptr); int32 CountCharsOld(const char *ptr); const char *teststringUTF8 = "äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ" " äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ" " äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ" " äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ" " äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ" " äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ" " äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ äöüÄÖÜ"; const char *teststringASCII = "This is a not a UTF-8 String. This is a not a UTF-8 String. This is a not a UTF-8 String. " "This is a not a UTF-8 String. This is a not a UTF-8 String. This is a not a UTF-8 String. " "This is a not a UTF-8 String. This is a not a UTF-8 String. This is a not a UTF-8 String. " "This is a not a UTF-8 String. This is a not a UTF-8 String. This is a not a UTF-8 String. " "This is a not a UTF-8 String. This is a not a UTF-8 String. This is a not a UTF-8 String. " "This is a not a UTF-8 String. This is a not a UTF-8 String. This is a not a UTF-8 String. " "This is a not a UTF-8 String. This is a not a UTF-8 String. This is a not a UTF-8 String. "; int main() { int i; bigtime_t start; bigtime_t end; printf("CountCharsOld length UTF-8 = %ld\n", CountCharsOld(teststringUTF8)); printf("CountCharsNew length UTF-8 = %ld\n", CountCharsNew(teststringUTF8)); printf("CountCharsOld length ASCII = %ld\n", CountCharsOld(teststringASCII)); printf("CountCharsNew length ASCII = %ld\n", CountCharsNew(teststringASCII)); // first run both test a couple of times to initialize the cache for (i = 0; i < 50000; i++) { CountCharsOld(teststringUTF8); CountCharsNew(teststringUTF8); CountCharsOld(teststringASCII); CountCharsNew(teststringASCII); } // run the old test 50000 times start = system_time(); for (i = 0; i < 50000; i++) CountCharsOld(teststringUTF8); end = system_time(); printf("50000 times CountCharsOld UTF-8 needed %9Ld usec\n", end - start); // run the new test 50000 times start = system_time(); for (i = 0; i < 50000; i++) CountCharsNew(teststringUTF8); end = system_time(); printf("50000 times CountCharsNew UTF-8 needed %9Ld usec\n", end - start); // run the old test 50000 times start = system_time(); for (i = 0; i < 50000; i++) CountCharsOld(teststringASCII); end = system_time(); printf("50000 times CountCharsOld ASCII needed %9Ld usec\n", end - start); // run the new test 50000 times start = system_time(); for (i = 0; i < 50000; i++) CountCharsNew(teststringASCII); end = system_time(); printf("50000 times CountCharsNew ASCII needed %9Ld usec\n", end - start); // run the old test 50000 times start = system_time(); for (i = 0; i < 50000; i++) CountCharsOld(teststringUTF8); end = system_time(); printf("50000 times CountCharsOld UTF-8 needed %9Ld usec\n", end - start); // run the new test 50000 times start = system_time(); for (i = 0; i < 50000; i++) CountCharsNew(teststringUTF8); end = system_time(); printf("50000 times CountCharsNew UTF-8 needed %9Ld usec\n", end - start); // run the old test 50000 times start = system_time(); for (i = 0; i < 50000; i++) CountCharsOld(teststringASCII); end = system_time(); printf("50000 times CountCharsOld ASCII needed %9Ld usec\n", end - start); // run the new test 50000 times start = system_time(); for (i = 0; i < 50000; i++) CountCharsNew(teststringASCII); end = system_time(); printf("50000 times CountCharsNew ASCII needed %9Ld usec\n", end - start); return 0; } inline uint32 utf8_char_len(uchar c) { return (((0xE5000000 >> (((c) >> 3) & 0x1E)) & 3) + 1); } int32 CountCharsNew(const char *ptr) { int32 count = 0; while (*ptr) { // Jump to next UTF8 character // ejaesler: BGA's nifty function ptr += utf8_char_len(*ptr); count++; } return count; } int32 CountCharsOld(const char *ptr) { int32 count = 0; while (*ptr++) { count++; // Jump to next UTF8 character for (; (*ptr & 0xc0) == 0x80; ptr++); } return count; } /* I really don't think that premature optimization is useful, and even if you test it and it's better, * the result on a different hardware may be worse. */