On 2003-12-15 at 14:05:08 [+0100], you wrote: > Pascal Goguey <pascal@xxxxxxxxxx> wrote: > > In case of french, if I understood Axel's method correctly, > > well, let's take an example: illettré, île, iliaque cannot be > > sorted by a plain sort function because î is outside of ASCII, > > and therefore greater than any of the other letters. The regular > > sort would put île after zythum. > > Right, that's the ASCII only problem. > > > So the proposed method (apparently) consists in first stripping > > these strings to temporary ascii strings, sorting, and then ordering > > the > > original strings in the same order. > > > > But there is a logical mistake here. Let's call: > > Strip: a function that removes accents and alike. > > A_Order : ascii order > > F_Order : french dictionary order > > > > A_Order ( strip (s1) , strip ( s2 ) ) can be deduced from F_Order > > (s1, > > s2) > > BUT: > > F_Order(s1, s2) cannot be deduced from A_Order( strip(s1), strip(s2)) > > > > Here is an example: > > > > These two words : cote and côte should happen in this sequence. > > côte should be after cote. > > > > If you perform an ASCII sort of the stripped strings, you end up > > sorting cote and cote, and since the strings are equal, you cannot > > decide which of the original strings comes first. No surprise here, > > you loose information by stripping. > > It's a good quick approximation, but not a fully working method. > > It's fully working for many languages, but you can easily extend it to > do what you what it to do. The current implementation just translates > "à" to "a", for example. It could also do something like: > "a" -> "a0" > "á" -> "a1" > "à" -> "a2" > "â" -> "a3" > > The current implementation allows to compare strings as is, but also to > get the string that represents its order and allows for direct memcmp() > or strcmp() of two strings. > Also, we need to differentiate between the primary and secondary > collation level. The primary should not differentiate between "a" and > "á" while the secondary should. I will have to recheck about how > exactly this is done in other localisation efforts, though (currently, > I have implemented the German telephone book order to change the > primary level; I am not sure this is correct). Well, I guess we simply (at least for french) just need to sort depending the "number of changes". for instance, to change "côte" to ASCII you have 1 change (ô => o) so if you have to compare it to "cote" it should be after. Now, I guess another problem is to compare "été" with "ètè" (even if the second doesnt exist. It's an example). Both get 2 changes to enter ASCII code. Here you ahve to set an internal order which you then use as there is no real order. my 0.002 cents, Olivier -- "A man does what he does because he sees the world as he sees it" A.K