[recoll-user] Re: Problems with character "substitution"

  • From: jfd@xxxxxxxxxx
  • To: recoll-user@xxxxxxxxxxxxx
  • Date: Mon, 2 Apr 2012 12:40:43 +0200

Anders Johansson writes:
 > 
 > 2012-03-31 20:05, jfd@xxxxxxxxxx skrev:
 > > Does Google use the accents when searching in Swedish ?
 > >
 > > Cheers,
 > >
 > > jf
 > >
 > 
 > Yes. Actually, those aren't really accents in Swedish but completely 
 > different wovels which means the words have different meanings.
 > For example:
 > "Vän" means "friend" and "van" means "used to". They surely have 
 > different etymological roots.
 > 
 > ä, å and a and ö and o should be considered different characters in 
 > Swedish and this is also what google (and other services as well) do for 
 > Swedish.
 > 
 > Maybe there could be an option to not consider those specific "accents" 
 > as accents? Maybe set automatically if users locale is something with 
 > "sv". The same thing should apply for the letter å in Danish and 
 > Norwegian (where the characters æ=ä and ø=ö probably are counted 
 > separetely already)
 > More information:
 > http://en.wikipedia.org/wiki/Swedish_alphabet (first paragraph is sums 
 > it up well)
 > http://en.wikipedia.org/wiki/Danish_and_Norwegian_alphabet
 > 
 > Greetings,
 > Anders Johansson

Thanks for this information.

Actually, this is not a simple issue at all because of the way
unaccenting currently works in Recoll (it's based on the Unicode
decomposition, which makes no distinctions between essential and
non-essential diacritics).

There is not much difficulty with å, I can just make an exception while
generating the tables.

The ä or ö are more of difficult because of their common use in, ie, German.

If there is a German speaker on the list, maybe he could tell us if, for
German, it's appropriate to keep the umlaut for searching or if it's better
to merge searches for ä and a ? That is, will you always search for
"Bäume", or just type "Baume" when feeling lazy (as I'd do in equivalent
French) ?

If ok for German, I guess I can just keep these three letters intact in all
cases. Else I have a big difficulty as there is currently no way in Recoll
to have a language-dependant handling of unaccenting, because the
unaccenting tables are static.

Cheers,

jf

Other related posts: