[openbeos] Re: AW: Re: AW: Locale Kit

  • From: "François Revol" <revol@xxxxxxx>
  • To: openbeos@xxxxxxxxxxxxx
  • Date: Mon, 15 Dec 2003 12:54:52 +0100 CET

> Hello,
> 
> >  Romanian language uses latin characters(it is a latin language)and 
> > has some
> > special characters but they are obitional. I mean you can sort it 
> > like 
> > you
> > said you can for french. the special characters are special forms 
> > of 
> > i,t,s
> > and a .
> 
> In case of french, if I understood Axel's method correctly,
> well, let's take an example: illettré, île, iliaque cannot be
> sorted by a plain sort function because î is outside of ASCII,
> and therefore greater than any of the other letters. The regular
> sort would put île after zythum.
> 
> So the proposed method (apparently)  consists in first stripping
> these strings to temporary ascii strings, sorting, and then ordering 
> the
> original strings in the same order.
> 
> But there is a logical mistake here. Let's call:
> Strip: a function that removes accents and alike.
> A_Order : ascii order
> F_Order : french dictionary order
> 
> A_Order ( strip (s1) , strip ( s2 ) ) can be deduced from F_Order 
> (s1, 
> s2)
> BUT:
> F_Order(s1, s2) cannot be deduced from A_Order( strip(s1), strip(s2))
> 
> Here is an example:
> 
> These two words : cote and côte should happen in this sequence.
> côte should be after cote.
> 
> If you perform an ASCII sort of the stripped strings, you end up
> sorting cote and cote, and since the strings are equal, you cannot
> decide which of the original strings comes first. No surprise here,
> you loose information by stripping.
> It's a good quick approximation, but not a fully working method.
> 
> In Japanese, it is even a bit more tricky since all the characters
> that can be ordered come in at least 2 versions (hiragana and 
> katakana).
> The chineese characters used in Japanese cannot really be sorted.
> Well, of course, they can, but the chineese character ordering is not
> used in dictionaries for instance. At least the dictionaries I know.
> 
At least stripping is useful when trying to match strings, and 
validating them. For exemple, under Unix, the LC_* environment vars
modify the behaviour of isalpha(). If LC_ALL=fr, then isapha("é") 
returns
true.
Now for sorting, still, getting
"cote, côte, cité"
is better in this order, than
"cote, cité, côte"
which is what you get with plain ASCII sorting, 
as it will first put all the ASCII chars, then the accented chars.

François.

Other related posts: