[greenstone_pt] Re: About a multilingual prototype

  • From: John Rose <john.rose1@xxxxxxx>
  • To: greenstone_pt@xxxxxxxxxxxxx
  • Date: Thu, 19 Feb 2009 10:58:30 +0100

Dear Claudia,

I am a bit confused. I thought that the subject of this list was to discuss (in Portuguese) the evaluation/improvement, promotion and use of Greenstone in Portuguese speaking countries (including use of local languages in those countries) and to provide help to users with questions/problems.

If we want to have a general discussion on mulitlingualism in digital libraries, then perhaps we should have another list for this, in which we would invite participants worldwide who are interested in this problem. I guess that in such a discussion the contributions would probably be in English to ensure maximum mutual understanding.

Coming back to Chinese (but not sure why Nadia has been focusing on this, rather than for example on Arabic or Russian which like Chinese are UNESCO languages using non-Latin characters and with full operational Greenstone interfaces. I don't think that the problem of pinyin versus Chinese ideograms is so fundamentally different from correctly transliterating Arabic or Russian into Latin script (of course Chinese is more complicated since there is I believe not always a unique mapping between a pinyin phoneme, even with the tone indicated, and the corresponding Chinese ideogram, but some ambiguities exist in almost all transliteration schemes - as well as the problem that many scholarly works, especially older ones, use non-standard or alternative transliteration schemes). Greenstone has no special functionality to support double use of a language - in its native character form and in transliterated form. This could be interesting for linguistic scholars but the vast majority of speakers of a language would want to access information in their native character set, not through transliterated characters. It would technically be possible to provide a pinyin user interface and also to search on metadata and/or full text in pinyin or ideograms or even (I believe but not certain) mixed combinations, but I have not seen an example of this sort of specialized linguistic DL application.

Greenstone is trying to provide, evaluate and maintain the largest number possible of language interfaces. Because of the immense amount of work involved, and the importance of having users take responsibility for deciding which languages to use, all of the language interface work is undertaken by volunteer translators.

Hope this clarifies, perhaps it would be best to move the discussion on Chinese to individual correspondence if you want to proceed? Our Chinese specialist Anna Huang is receiving this message and could perhaps provide any further advice which she might have on this specific subject directly to you and Nadia. Best regards, John

At 02:43 19/02/2009, you wrote:
Dears,
as a linguist, not understanding very well what you`re talking about, if we put the chinese data in - Nadia, I found the name - pinyin (the romanization of mandarin ), could it work? Meaning, is it possible to build the chinese data in both systems, pinyin and chinese ideograms, in a way that they are equivalent for this system? Is this GLI translation capable of inter/trans-characters translations, or better is there transliteration availability?
Best,
Claudia

2009/2/18 John Rose <<mailto:john.rose1@xxxxxxx>john.rose1@xxxxxxx>
Dear Nadia,

I thought we were supposed to be speaking in Portuguese on this list (except for me) (-:

There are 4 different aspects to the language interface: i) the spreadsheets you have to translate the user interface, ii) translations of the metadata names (there is a facility in GLI for translation of terms which are not already included in the metadata reference files, which could also be modified if you choose) iii) the language of the metadata, and iv) the language(s) of the documents themselves. All of these can easily be handled for a single language applying to a given collection, and it is also straightforward to separate a collection of documents in several languages into sub-collections (by cross collection searching or by partitioning the indexes).

But right now, I understand, the metadata names in the search boxes will not change to the language of a changed language preference (they will stay in the language in which the collection was built). However, the classifier names will change if you have translated them with the GLI translation facility. I also understand that the former situation will be improved in the next version (v2.82).

There is a bug in v2.81 with exploding CDS/ISIS databases, and there is a rather complicated procedure to get around this that I could provide. Else this works find with 2.80 and will be fixed in next release (probably already in the nightly snapshot releases if you want to use this). Probably it is the same thing with BibTex, for which v2.80 should also be fine.

Chinese is special in that they do not separate words. v2.80 separates the characters internally so that text searches are possible. v2.81 extends this to searches of metadata content. I'm not surprised that there were problems with v2.73. Please not that this segmentation problem is special for Chinese. Other languages with non-Latin character sets (Arabic, Tamil, etc.) have worked fine before because the words are separated by spaces.

Bonne continuation, very interesting, waiting for further experiments, John


At 20:39 18/02/2009, you wrote:
Hi John (and all),

Right now I got a small prototype with the languages listed below, mainly from
portuguese countries.
I am at the first step,  checking how far can we go  with the languages,
and trying to discover if we got a frontier. At least for now, the only
problem is listing utf8 languages with a different alphabet like chinese.
The idea is having documents and interfaces on several languages,
so if one knows only kaigang, this person would be able  to access the system.
The next step would be translate the dublin core information for each item
so someone who speaks kaigang  knows that there is something  in kabuverdianu
about the subject he is searching.

I am using Greenstone 2.73 only because I wasn't able to explode some bibtex
data on the last version (and I was already used with it...). But other versions
and applications are welcome. We can exchange experience too.

I am attaching a printscreen of title's list and the languages list. You can see
that the chinese title is missing, but I am able to do a search
in chinese.(Since it's just a first prototype, please
forgive me for the simple interface).

Languages list:
 Chechewa
 Forro
 Ganda
 Guinea Bissau Creole
 kabuverdianu
 Kaigang
 Kikongo
 Mandarin
 Oshiwambo


Regards,
nadia.
Content-Type: image/jpeg; name="titles.JPG"
Content-Disposition: attachment; filename="titles.JPG"
X-Attachment-Id: f_frcedd0w0


Content-Type: image/jpeg; name="languages.JPG"
Content-Disposition: attachment; filename="languages.JPG"
X-Attachment-Id: f_frcednok1


Content-Type: image/jpeg; name="search_chinese.JPG"
Content-Disposition: attachment; filename="search_chinese.JPG"
X-Attachment-Id: f_frceomw02



               John B. Rose
               1 Bis, Rue des Châtre-Sacs
               92310 Sèvres
               France
               Email: <<mailto:john.rose1@xxxxxxx>john.rose1@xxxxxxx>
(in case of bounce then send to <<mailto:johnrose@xxxxxxxxxxxxxxxxxx>johnrose@xxxxxxxxxxxxxxxxxx>)




--
Claudia Wanderley
tel. +55 19 91362441


                John B. Rose
                1 Bis, Rue des Châtre-Sacs
                92310 Sèvres
                France
                Email: <john.rose1@xxxxxxx>
(in case of bounce then send to <johnrose@xxxxxxxxxxxxxxxxxx>)

Other related posts: