[haiku-development] BString and UTF-8
- From: Michael Bridgers <mibrid@xxxxxxx>
- To: haiku-development@xxxxxxxxxxxxx
- Date: Fri, 02 Dec 2011 06:58:10 -0500
I'm in the process of enhancing the BString class and am having trouble
understanding how to make changes to the Jamfiles to support the change.
I have read all of Jam the documentation I can find, but I still can't
figure out how to make the changes I need. Is there someone who can help me?
I have most of these changes working in a copy of the BString, and I'm
trying to build Haiku incorporating this so I can verify that these
changes are backwardly compatible.
Some of the things that I need the Jamfiles to do:
- Add additional include directories to the BString compile
- Add additional dependent libraries to the libbe.so link step
I am looking at several possible approaches to providing the
functionality in the BString class:
- Have the BString class use ICU directly
- Have the BString class use ICU through the LocaleBackend class
- Have the BString class use a combination of the above
The changes to the BString class fall into several categories:
- Making sure the BString class always holds valid UTF-8 strings
(Allowing invalid UTF-8 strings is both a security risk, as well as
making operations on existing strings difficult or impossible.)
- Making locale-sensitive methods respect the locale (such as case
conversion)
- Making the "Chars" methods work with all normalization forms of UTF-8
strings (Currently, the "Chars" methods operate on "code points". A
Unicode "character" can be one OR MORE code points.)
- Adding both "Chars" and "CodePoint" methods, as appropriate, so the
BString class has full functionality when used with the Locale Kit classes.
- Adding Unicode-character-aware regular expression support for "Find",
"Replace", and "Remove" functions (This would allow things such as:
"find a word that starts with a case insensitive 'c', has as the third
character either a lower case o-umlaut or an upper case 'M', and the
word length is between 5 and 7 characters".)
I want to make it very clear that these changes will be backwardly
compatible. Of course, by its definition, there will be changes to the
behavior of the BString. But for current operations that maintain a
valid UTF-8 string, the behavior will not change. The behavior will only
be different when the operation would have created an invalid UTF-8
string. Plus, I will add enhanced operations to the class.
In addition to the changes to the BString class, I'm updating the
HaikuBook documentation. This will include both API description for the
BString class, and a document (to be included in the "overview" section
of the HaikuBook) desribing UTF-8 in detail, with examples showing how
to write code using the BString in a locale-free manner.
Thanks for any help on this.
Michael Bridgers
Other related posts: