[gmpi] string encoding in teh API (UTFs)

  • From: thockin@xxxxxxxxxx
  • To: gmpi@xxxxxxxxxxxxx
  • Date: Wed, 14 Dec 2005 00:10:03 -0800

First I'm going to make some comments, and then free-form discuss at the
end. :)


On Wed, Dec 14, 2005 at 11:34:25AM +1300, Jeff McClintock wrote:
> >Can we simply say that GMPI keeps all strings in UTF-8 and that conversion
> >is the responsibility of the host?
> 
> It's not that simple. You can't treat UTF-8 like ascii, it's a 
> multi-byte char set. strlen() etc will fail.

No, strlen will succeed - it will return the number of bytes in a string,
as expected.

>   Multi-byte string handling can be 10 to 100 times slower than the 
> fixed-size alternatives.

Baloney - back that up with numbers?  Further, Windows has to use
variable-length encoding too.

>   Even on Linux, It's better to use wchar_t internally, and use C 
> library functions to convert to external character encodings.  That 
> gives us UTF-8 support plus 'free' support for language-specific 
> extensions to ASCII like shift-JIS (Japan) or ISO 8859 (Europe).

This disagrees with my own experience and everything I can find to read.
Further, we really only want to support Unicode if possible.  If we start
supporting other character sets, then we have to start passing that info
around, too.

> I suggest we use wchar_t exclusively to provide the best cross-platform 
> support, the same plugin code on Linux and PC.  String manipulation is 
> simple and fast. No special cases.

Do you mean a 16 bit wchar_t or a 32 bit wchar_t?  On linux I can find
implementations that use either, but 32 bits is correct.  ISO C does not
define the size of wchar_t, but states it shall be as wide as necessary to
hold the largest character in the code sets of the locales that an
implementation supports.

Let's do a quick summary of what I understand.  Any experts, please
correct me if I am wrong.

Unicode is the de facto character set to support.  There are several
encodings of that character set, each with pros and cons.  Unicode
actually encodes "code points", so let's use that term.

UTF-8 is a variable-length encoding.  The smallest single code-point is 8
bits.  The longest single code-point is 32 bits.

UTF-16 is a variable-length encoding.  The smallest single code-point is
16 bits.  The longest single code-point is 32 bits.

UTF-32 (or UCS-4) is a fixed-length encoding.  All code-points are 32
bits.

As best I can tell, Windows and Mac both use UTF-16.  Before Unicode grew
over 64k code-points, this was enough to cover everything in a fixed-width
encoding.  Now it has grown.  UTF-16 is not ASCII compatible and has
byte-order issues, but it covers the most common characters in 16 bits.

Linux and most UNIXes, by and large, use UTF-8.  UTF-8 is ASCII compatible
and has no byte-order issues, but requires 16 or 24 bytes to cover the
most common characters in the world.

It seems that nobody uses UTF-32 because it is fairly inefficient at
storage and the ability to randomly index into a string is not really that
useful.

So UTF-16 has the same main disadvantage as UTF-8: it's variable-length
encoded.  It has further issues over UTF-8: byte ordering and ASCII
incompatibility.  The only thing it has going for it is that Windows and
Mac use it.

UTF-8, on the other hand, is more efficient for English and European
languages, has no byte ordering problems, and is safe for existing ASCII
systems.  It also is the standard format for XML and the web.

Before we argue more, I should confess that most of what I know is either
fairly stale or I read it today. :)  Here are the best links I could find
on the topic.

Read up and let's compare thoughts?

ISO C Amendment 1:
        http://www.unix.org/version2/whatsnew/login_mse.html

C and C++ programming with Unicode:
        http://www.cprogramming.com/tutorial/unicode.html

UTF-8 and Unicode FAQ for Unix/Linux:
        http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

The Unicode Standard: A Technical Introductionn:
        http://www.unicode.org/standard/principles.html

Forms of Unicode:
        http://icu.sourceforge.net/docs/papers/forms_of_unicode/

UTF-8, UTF-16, UTF-32 & BOM:
        http://www.unicode.org/faq/utf_bom.html

Introduction to Extended Characters (GNU C Library):
        
http://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html

Character Set Handling
        http://www.linuxselfhelp.com/gnu/glibc/html_chapter/libc_6.html

----------------------------------------------------------------------
Generalized Music Plugin Interface (GMPI) public discussion list
Participation in this list is contingent upon your abiding by the
following rules:  Please stay on topic.  You are responsible for your own
words.  Please respect your fellow subscribers.  Please do not
redistribute anyone else's words without their permission.

Archive: //www.freelists.org/archives/gmpi
Email gmpi-request@xxxxxxxxxxxxx w/ subject "unsubscribe" to unsubscribe

Other related posts: