[recoll-user] Re: Problems with character "substitution"

From: <jfd@xxxxxxxxxx>
To: recoll-user@xxxxxxxxxxxxx
Date: Mon, 9 Apr 2012 14:51:55 +0200
Anders Johansson writes:
 > 
 > 2012-03-31 20:05, jfd@xxxxxxxxxx skrev:
 > > Does Google use the accents when searching in Swedish ?
 > >
 > > Cheers,
 > >
 > > jf
 > >
 > 
 > Yes. Actually, those aren't really accents in Swedish but completely 
 > different wovels which means the words have different meanings.
 > For example:
 > "Vän" means "friend" and "van" means "used to". They surely have 
 > different etymological roots.
 > 
 > ä, å and a and ö and o should be considered different characters in 
 > Swedish and this is also what google (and other services as well) do for 
 > Swedish.
 > 
 > Maybe there could be an option to not consider those specific "accents" 
 > as accents? Maybe set automatically if users locale is something with 
 > "sv". The same thing should apply for the letter å in Danish and 
 > Norwegian (where the characters æ=ä and ø=ö probably are counted 
 > separetely already)
 > More information:
 > http://en.wikipedia.org/wiki/Swedish_alphabet (first paragraph is sums 
 > it up well)
 > http://en.wikipedia.org/wiki/Danish_and_Norwegian_alphabet
 > 
 > Greetings,
 > Anders Johansson

Hi,

I have implemented a tentative solution to this issue, as a new
configuration parameter specifying a list of characters the translation of
which should not be done strictly according to the Unicode database.

The parameter goes into ~/.recoll/recoll.conf, and should be
encoded as UTF-8. Here follow the config file comments, with the exemple
for Swedish (to be uncommented of course):

# A list of characters, encoded in UTF-8, which should be handled specially
# when converting text to unaccented lowercase. For example, in Swedish,
# the letter a with diaeresis has full alphabet citizenship and should not
# be turned into an a. 
# Each element in the space-separated list has the special character as
# first element and the translation following. The handling of both the
# lowercase and upper-case versions of a character should be specified, as
# appartenance to the list will turn-off both standard accent and case
# processing. Example for Swedish:
# unac_except_trans =  åå Åå ää Ää öö Öö

The list can be complemented/modified to taste. The main constraint is that
it should be consistent with what you actually type when searching. You
need a full reindex each time you change it.

The source changes implementing this are currently in the main branch on
Bitbucket, and there is also a snapshot on the web site:
  http://www.recoll.org/betarecoll-2682.tar.gz

I believe the code to be stable, this snapshot is release 17.1 with a few
minor fixes and this, slightly bigger, unaccenting exceptions change.

I have not implemented an automatic solution based on the locale yet, but
this would be an easy addition in the future, now that the base mechanism
is in place.

If someone wants to test this, I'd be glad to have some feedback.  Contact
me if you need a binary package. There will probably be an 1.17.2 not too
far away, but not date is set.

Regards,

jf
References:
- [recoll-user] Re: Problems with character "substitution"
  - From: jfd
[recoll-user] Re: Problems with character "substitution"

Other related posts: