[haiku-gsoc] Re: Do we ICU or do we not ICU?

From: "Axel Dörfler" <axeld@xxxxxxxxxxxxxxxx>
To: haiku-gsoc@xxxxxxxxxxxxx
Date: Tue, 12 May 2009 10:48:32 +0200 CEST
PulkoMandy <pulkomandy@xxxxxxxxx> wrote:
> First, for the current code in the locale kit, there is already some
> work done on collators. There is a base implementation and two
> variations for french and german. For the monetary, number and date,
> there are only stubs that do nothing except returning the string they
> get as an argument.
> The idea of ICU is to gather all the language data. I don't think I
> can write collators and formatting rules for all the languages in the
> world (I'm not even sure waht should be the right sorting order for a
> collator in French :)). I don't think duplicating ICU work is a good
> idea. If we decide not to get their sourcecode, I think I would use
> the same data files anyways.

That was basically what I did for the BUnicodeChar class in the current 
locale kit - it uses the same data files as ICU, but implements its own 
API to access them. I think we should use the ICU backend or data files 
for most or all where things is possible; it would be a lot of 
(superfluous) work to maintain this stuff else.

> We probably don't need the whole ICU source tree. Actually it is
> already split in multiple parts. We can probably avoid the layout
> engine (which help to draw different scripts in a gui). I think I can
> start with icu/source/common and icu/source/i18n, adding the compiled
> data files. This should do most of the work. Then the layout engine
> could be added later when we start looking at right-to-left scripts,
> which will probably need some rework in the interface kit.

I think the main problem we will run into is the fact that we need to 
support the C locale stuff - and that is tightly integrated into some 
other libroot functionality. That means it probably does not make sense 
to keep that part in an external library.
As Ingo said, a prerequisite would be to sort out our wchar support, as 
that will be needed by ICU as well. It would be nice if we could use 
UTF-8 as the native character set throughout the system, and ICU as 
well, although I don't know if it would support that properly (it 
probably uses wchar_t everywhere, I would guess).

> The main drawback of ICU is of course the lack of integration with 
> the
> Be API. No BStrings in there... There are two solutions if we decide
> to use it. Either we can make a fork, and integrate it tightly to
> Haiku and use the available API, avoiding a stack of layers. Either 
> we
> build the locale kit as a wrapper between ICU world and the Be API.
> That doesn't sound as clean, but it would allow to keep ICU's files
> mostly unchanged and follow the updates of their main branch.

I would have a look at ICU to see how much of the libc functionality it 
duplicates (like wchar support), and how much we can use directly.
What I would like to see is not so large base libraries with data files 
that could be removed or stripped down if only a few or one language 
should be supported, to be able to create a very small but functional 
distribution. The base install should come with all files by default, 
though.

> Finally, ICU uses the CLDR data from Unicode. So we could use this
> data and write our own code to use it as well. This data is enough 
> for
> collation and number/date/money formatting. There is also data for
> segmenting text (inserting line breaks in the right places). There is
> only 3MB of zipped data in the CLDR, so it is much smaller than ICU.
> 
> As CLDR data is just xml files, we can use them and avoid adding a 
> big
> piece of code like ICU. That could mean i'll spend my summer writting
> C++ instead of messing with Jamfiles to get ICU in the Haiku 
> buildtree
> :)

I think that depends on how much of ICU we need for libc, and how it 
could be integrated with the rest of the API -- and in libroot, we 
don't have BStrings either. IIRC I liked the API of the current locale 
kit, so if we can somewhat preserve it for the most part in a wrapper, 
that would be a good start. If we only want to use the data files for 
some parts, we could still decide to directly access the data files for 
those parts.

In any case, I would start with the libc integration, and only then see 
how to do the Locale Kit on top of that, or at least using the same 
backend (ie. it could use private functionality in libroot for this, 
too).

Bye,
   Axel.
References:
- [haiku-gsoc] Re: Do we ICU or do we not ICU?
  - From: PulkoMandy
[haiku-gsoc] Re: Do we ICU or do we not ICU?

Other related posts: