[openbeos] Re: AW: Re: AW: Locale Kit

  • From: "Axel Dörfler" <axeld@xxxxxxxxxxxxxxxx>
  • To: openbeos@xxxxxxxxxxxxx
  • Date: Mon, 15 Dec 2003 14:05:08 +0100 CET

Pascal Goguey <pascal@xxxxxxxxxx> wrote:
> In case of french, if I understood Axel's method correctly,
> well, let's take an example: illettré, île, iliaque cannot be
> sorted by a plain sort function because î is outside of ASCII,
> and therefore greater than any of the other letters. The regular
> sort would put île after zythum.

Right, that's the ASCII only problem.

> So the proposed method (apparently)  consists in first stripping
> these strings to temporary ascii strings, sorting, and then ordering 
> the
> original strings in the same order.
> 
> But there is a logical mistake here. Let's call:
> Strip: a function that removes accents and alike.
> A_Order : ascii order
> F_Order : french dictionary order
> 
> A_Order ( strip (s1) , strip ( s2 ) ) can be deduced from F_Order 
> (s1, 
> s2)
> BUT:
> F_Order(s1, s2) cannot be deduced from A_Order( strip(s1), strip(s2))
> 
> Here is an example:
> 
> These two words : cote and côte should happen in this sequence.
> côte should be after cote.
> 
> If you perform an ASCII sort of the stripped strings, you end up
> sorting cote and cote, and since the strings are equal, you cannot
> decide which of the original strings comes first. No surprise here,
> you loose information by stripping.
> It's a good quick approximation, but not a fully working method.

It's fully working for many languages, but you can easily extend it to 
do what you what it to do. The current implementation just translates 
"à" to "a", for example. It could also do something like:
        "a" -> "a0"
        "á" -> "a1"
        "à" -> "a2"
        "â" -> "a3"

The current implementation allows to compare strings as is, but also to 
get the string that represents its order and allows for direct memcmp() 
or strcmp() of two strings.
Also, we need to differentiate between the primary and secondary 
collation level. The primary should not differentiate between "a" and 
"á" while the secondary should. I will have to recheck about how 
exactly this is done in other localisation efforts, though (currently, 
I have implemented the German telephone book order to change the 
primary level; I am not sure this is correct).

Bye,
   Axel.


Other related posts: