[greenstone_pt] Re: About a multilingual prototype

From: John Rose <john.rose1@xxxxxxx>
To: greenstone_pt@xxxxxxxxxxxxx
Date: Thu, 19 Feb 2009 10:58:30 +0100

Dear Claudia,

I am a bit confused. I thought that the subjectof this list was to discuss (in Portuguese) theevaluation/improvement, promotion and use ofGreenstone in Portuguese speaking countries(including use of local languages in thosecountries) and to provide help to users with questions/problems.

If we want to have a general discussion onmulitlingualism in digital libraries, thenperhaps we should have another list for this, inwhich we would invite participants worldwide whoare interested in this problem. I guess that insuch a discussion the contributions wouldprobably be in English to ensure maximum mutual understanding.

Coming back to Chinese (but not sure why Nadiahas been focusing on this, rather than forexample on Arabic or Russian which like Chineseare UNESCO languages using non-Latin charactersand with full operational Greenstone interfaces.I don't think that the problem of pinyin versusChinese ideograms is so fundamentally differentfrom correctly transliterating Arabic or Russianinto Latin script (of course Chinese is morecomplicated since there is I believe not always aunique mapping between a pinyin phoneme, evenwith the tone indicated, and the correspondingChinese ideogram, but some ambiguities exist inalmost all transliteration schemes - as well asthe problem that many scholarly works, especiallyolder ones, use non-standard or alternativetransliteration schemes). Greenstone has nospecial functionality to support double use of alanguage - in its native character form and intransliterated form. This could be interestingfor linguistic scholars but the vast majority ofspeakers of a language would want to accessinformation in their native character set, notthrough transliterated characters. It wouldtechnically be possible to provide a pinyin userinterface and also to search on metadata and/orfull text in pinyin or ideograms or even (Ibelieve but not certain) mixed combinations, butI have not seen an example of this sort ofspecialized linguistic DL application.

Greenstone is trying to provide, evaluate andmaintain the largest number possible of languageinterfaces. Because of the immense amount of workinvolved, and the importance of having users takeresponsibility for deciding which languages touse, all of the language interface work is undertaken by volunteer translators.

Hope this clarifies, perhaps it would be best tomove the discussion on Chinese to individualcorrespondence if you want to proceed? OurChinese specialist Anna Huang is receiving thismessage and could perhaps provide any furtheradvice which she might have on this specificsubject directly to you and Nadia. Best regards, John


At 02:43 19/02/2009, you wrote:

Dears,
as a linguist, not understanding very well whatyou`re talking about, if we put the chinese datain - Nadia, I found the name - pinyin (theromanization of mandarin ), could it work?Meaning, is it possible to build the chinesedata in both systems, pinyin and chineseideograms, in a way that they are equivalent forthis system? Is this GLI translation capable ofinter/trans-characters translations, or betteris there transliteration availability?
Best,
Claudia

2009/2/18 John Rose <<mailto:john.rose1@xxxxxxx>john.rose1@xxxxxxx>
Dear Nadia,
I thought we were supposed to be speaking inPortuguese on this list (except for me) (-:
There are 4 different aspects to the languageinterface: i) the spreadsheets you have totranslate the user interface, ii) translationsof the metadata names (there is a facility inGLI for translation of terms which are notalready included in the metadata referencefiles, which could also be modified if youchoose) iii) the language of the metadata, andiv) the language(s) of the documents themselves.All of these can easily be handled for a singlelanguage applying to a given collection, and itis also straightforward to separate a collectionof documents in several languages intosub-collections (by cross collection searching or by partitioning the indexes).
But right now, I understand, the metadata namesin the search boxes will not change to thelanguage of a changed language preference (theywill stay in the language in which thecollection was built). However, the classifiernames will change if you have translated themwith the GLI translation facility. I alsounderstand that the former situation will beimproved in the next version (v2.82).
There is a bug in v2.81 with exploding CDS/ISISdatabases, and there is a rather complicatedprocedure to get around this that I couldprovide. Else this works find with 2.80 and willbe fixed in next release (probably already inthe nightly snapshot releases if you want to usethis). Probably it is the same thing withBibTex, for which v2.80 should also be fine.
Chinese is special in that they do not separatewords. v2.80 separates the characters internallyso that text searches are possible. v2.81extends this to searches of metadatacontent. I'm not surprised that there wereproblems with v2.73. Please not that thissegmentation problem is special for Chinese.Other languages with non-Latin character sets(Arabic, Tamil, etc.) have worked fine beforebecause the words are separated by spaces.
Bonnecontinuation, very interesting, waiting for further experiments, John
At 20:39 18/02/2009, you wrote:
Hi John (and all),

Right now I got a small prototype with the languages listed below, mainly from
portuguese countries.
I am at the first step,  checking how far can we go  with the languages,
and trying to discover if we got a frontier. At least for now, the only
problem is listing utf8 languages with a different alphabet like chinese.
The idea is having documents and interfaces on several languages,
so if one knows only kaigang, this person would be able  to access the system.
The next step would be translate the dublin core information for each item
so someone who speaks kaigang  knows that there is something  in kabuverdianu
about the subject he is searching.

I am using Greenstone 2.73 only because I wasn't able to explode some bibtex
data on the last version (and I was already usedwith it...). But other versions
and applications are welcome. We can exchange experience too.
I am attaching a printscreen of title's list andthe languages list. You can see
that the chinese title is missing, but I am able to do a search
in chinese.(Since it's just a first prototype, please
forgive me for the simple interface).

Languages list:
 Chechewa
 Forro
 Ganda
 Guinea Bissau Creole
 kabuverdianu
 Kaigang
 Kikongo
 Mandarin
 Oshiwambo


Regards,
nadia.
Content-Type: image/jpeg; name="titles.JPG"
Content-Disposition: attachment; filename="titles.JPG"
X-Attachment-Id: f_frcedd0w0


Content-Type: image/jpeg; name="languages.JPG"
Content-Disposition: attachment; filename="languages.JPG"
X-Attachment-Id: f_frcednok1


Content-Type: image/jpeg; name="search_chinese.JPG"
Content-Disposition: attachment; filename="search_chinese.JPG"
X-Attachment-Id: f_frceomw02



               John B. Rose
               1 Bis, Rue des Châtre-Sacs
               92310 Sèvres
               France
               Email: <<mailto:john.rose1@xxxxxxx>john.rose1@xxxxxxx>
(in case of bounce thensend to <<mailto:johnrose@xxxxxxxxxxxxxxxxxx>johnrose@xxxxxxxxxxxxxxxxxx>)
--
Claudia Wanderley
tel. +55 19 91362441



                John B. Rose
                1 Bis, Rue des Châtre-Sacs
                92310 Sèvres
                France
                Email: <john.rose1@xxxxxxx>

(in case of bounce thensend to <johnrose@xxxxxxxxxxxxxxxxxx>)

Follow-Ups:
- [greenstone_pt] Re: About a multilingual prototype
  - From: Claudia Wanderley

References:
- [greenstone_pt] About a multilingual prototype
  - From: nadia pk
- [greenstone_pt] Re: About a multilingual prototype
  - From: John Rose
- [greenstone_pt] Re: About a multilingual prototype
  - From: Claudia Wanderley

[greenstone_pt] Re: About a multilingual prototype

Other related posts: