On Tuesday 16 December 2003 05:36 am, you wrote: > On 2003-12-15 at 14:05:08 [+0100], you wrote: > > Pascal Goguey <pascal@xxxxxxxxxx> wrote: > > > In case of french, if I understood Axel's method correctly, > > > well, let's take an example: illettré, île, iliaque cannot be > > > sorted by a plain sort function because î is outside of ASCII, > > > and therefore greater than any of the other letters. The regular > > > sort would put île after zythum. > > > > Right, that's the ASCII only problem. > > > > > So the proposed method (apparently) consists in first stripping > > > these strings to temporary ascii strings, sorting, and then ordering > > > the > > > original strings in the same order. > > > > > > But there is a logical mistake here. Let's call: > > > Strip: a function that removes accents and alike. > > > A_Order : ascii order > > > F_Order : french dictionary order > > > > > > A_Order ( strip (s1) , strip ( s2 ) ) can be deduced from F_Order > > > (s1, > > > s2) > > > BUT: > > > F_Order(s1, s2) cannot be deduced from A_Order( strip(s1), strip(s2)) > > > > > > Here is an example: > > > > > > These two words : cote and côte should happen in this sequence. > > > côte should be after cote. > > > > > > If you perform an ASCII sort of the stripped strings, you end up > > > sorting cote and cote, and since the strings are equal, you cannot > > > decide which of the original strings comes first. No surprise here, > > > you loose information by stripping. > > > It's a good quick approximation, but not a fully working method. > > > > It's fully working for many languages, but you can easily extend it to > > do what you what it to do. The current implementation just translates > > "à" to "a", for example. It could also do something like: > > "a" -> "a0" > > "á" -> "a1" > > "à" -> "a2" > > "â" -> "a3" > > > > The current implementation allows to compare strings as is, but also to > > get the string that represents its order and allows for direct memcmp() > > or strcmp() of two strings. > > Also, we need to differentiate between the primary and secondary > > collation level. The primary should not differentiate between "a" and > > "á" while the secondary should. I will have to recheck about how > > exactly this is done in other localisation efforts, though (currently, > > I have implemented the German telephone book order to change the > > primary level; I am not sure this is correct). > > Well, I guess we simply (at least for french) just need to sort depending > the "number of changes". > for instance, to change "côte" to ASCII you have 1 change (ô => o) > so if you have to compare it to "cote" it should be after. > > Now, I guess another problem is to compare "été" with "ètè" (even if the > second doesnt exist. It's an example). > Both get 2 changes to enter ASCII code. Here you ahve to set an internal > order which you then use as there is no real order. > Spanish provides some interesting problems. I am not sure how they handle comparisons with accents (a vowel with an accent is not a different letter, it just changes which syllable gets the emphasis). The n with a ~ (I don't know how to type it correctly), is a different letter than n and should follow n. Those quirks are not much different than the quirks discussed for the other languages. The weird part is that "LL" and "RR" are each considered a single letter and follow their single counter parts (so "llamar" comes after "luz").