[greenstone_pt] Re: About a multilingual prototype
- From: John Rose <john.rose1@xxxxxxx>
- To: greenstone_pt@xxxxxxxxxxxxx
- Date: Thu, 19 Feb 2009 10:58:30 +0100
Dear Claudia,
I am a bit confused. I thought that the subject
of this list was to discuss (in Portuguese) the
evaluation/improvement, promotion and use of
Greenstone in Portuguese speaking countries
(including use of local languages in those
countries) and to provide help to users with questions/problems.
If we want to have a general discussion on
mulitlingualism in digital libraries, then
perhaps we should have another list for this, in
which we would invite participants worldwide who
are interested in this problem. I guess that in
such a discussion the contributions would
probably be in English to ensure maximum mutual understanding.
Coming back to Chinese (but not sure why Nadia
has been focusing on this, rather than for
example on Arabic or Russian which like Chinese
are UNESCO languages using non-Latin characters
and with full operational Greenstone interfaces.
I don't think that the problem of pinyin versus
Chinese ideograms is so fundamentally different
from correctly transliterating Arabic or Russian
into Latin script (of course Chinese is more
complicated since there is I believe not always a
unique mapping between a pinyin phoneme, even
with the tone indicated, and the corresponding
Chinese ideogram, but some ambiguities exist in
almost all transliteration schemes - as well as
the problem that many scholarly works, especially
older ones, use non-standard or alternative
transliteration schemes). Greenstone has no
special functionality to support double use of a
language - in its native character form and in
transliterated form. This could be interesting
for linguistic scholars but the vast majority of
speakers of a language would want to access
information in their native character set, not
through transliterated characters. It would
technically be possible to provide a pinyin user
interface and also to search on metadata and/or
full text in pinyin or ideograms or even (I
believe but not certain) mixed combinations, but
I have not seen an example of this sort of
specialized linguistic DL application.
Greenstone is trying to provide, evaluate and
maintain the largest number possible of language
interfaces. Because of the immense amount of work
involved, and the importance of having users take
responsibility for deciding which languages to
use, all of the language interface work is undertaken by volunteer translators.
Hope this clarifies, perhaps it would be best to
move the discussion on Chinese to individual
correspondence if you want to proceed? Our
Chinese specialist Anna Huang is receiving this
message and could perhaps provide any further
advice which she might have on this specific
subject directly to you and Nadia. Best regards, John
At 02:43 19/02/2009, you wrote:
Dears,
as a linguist, not understanding very well what
you`re talking about, if we put the chinese data
in - Nadia, I found the name - pinyin (the
romanization of mandarin ), could it work?
Meaning, is it possible to build the chinese
data in both systems, pinyin and chinese
ideograms, in a way that they are equivalent for
this system? Is this GLI translation capable of
inter/trans-characters translations, or better
is there transliteration availability?
Best,
Claudia
2009/2/18 John Rose <<mailto:john.rose1@xxxxxxx>john.rose1@xxxxxxx>
Dear Nadia,
I thought we were supposed to be speaking in
Portuguese on this list (except for me) (-:
There are 4 different aspects to the language
interface: i) the spreadsheets you have to
translate the user interface, ii) translations
of the metadata names (there is a facility in
GLI for translation of terms which are not
already included in the metadata reference
files, which could also be modified if you
choose) iii) the language of the metadata, and
iv) the language(s) of the documents themselves.
All of these can easily be handled for a single
language applying to a given collection, and it
is also straightforward to separate a collection
of documents in several languages into
sub-collections (by cross collection searching or by partitioning the indexes).
But right now, I understand, the metadata names
in the search boxes will not change to the
language of a changed language preference (they
will stay in the language in which the
collection was built). However, the classifier
names will change if you have translated them
with the GLI translation facility. I also
understand that the former situation will be
improved in the next version (v2.82).
There is a bug in v2.81 with exploding CDS/ISIS
databases, and there is a rather complicated
procedure to get around this that I could
provide. Else this works find with 2.80 and will
be fixed in next release (probably already in
the nightly snapshot releases if you want to use
this). Probably it is the same thing with
BibTex, for which v2.80 should also be fine.
Chinese is special in that they do not separate
words. v2.80 separates the characters internally
so that text searches are possible. v2.81
extends this to searches of metadata
content. I'm not surprised that there were
problems with v2.73. Please not that this
segmentation problem is special for Chinese.
Other languages with non-Latin character sets
(Arabic, Tamil, etc.) have worked fine before
because the words are separated by spaces.
Bonne
continuation, very interesting, waiting for further experiments, John
At 20:39 18/02/2009, you wrote:
Hi John (and all),
Right now I got a small prototype with the languages listed below, mainly from
portuguese countries.
I am at the first step, checking how far can we go with the languages,
and trying to discover if we got a frontier. At least for now, the only
problem is listing utf8 languages with a different alphabet like chinese.
The idea is having documents and interfaces on several languages,
so if one knows only kaigang, this person would be able to access the system.
The next step would be translate the dublin core information for each item
so someone who speaks kaigang knows that there is something in kabuverdianu
about the subject he is searching.
I am using Greenstone 2.73 only because I wasn't able to explode some bibtex
data on the last version (and I was already used
with it...). But other versions
and applications are welcome. We can exchange experience too.
I am attaching a printscreen of title's list and
the languages list. You can see
that the chinese title is missing, but I am able to do a search
in chinese.(Since it's just a first prototype, please
forgive me for the simple interface).
Languages list:
Chechewa
Forro
Ganda
Guinea Bissau Creole
kabuverdianu
Kaigang
Kikongo
Mandarin
Oshiwambo
Regards,
nadia.
Content-Type: image/jpeg; name="titles.JPG"
Content-Disposition: attachment; filename="titles.JPG"
X-Attachment-Id: f_frcedd0w0
Content-Type: image/jpeg; name="languages.JPG"
Content-Disposition: attachment; filename="languages.JPG"
X-Attachment-Id: f_frcednok1
Content-Type: image/jpeg; name="search_chinese.JPG"
Content-Disposition: attachment; filename="search_chinese.JPG"
X-Attachment-Id: f_frceomw02
John B. Rose
1 Bis, Rue des Châtre-Sacs
92310 Sèvres
France
Email: <<mailto:john.rose1@xxxxxxx>john.rose1@xxxxxxx>
(in case of bounce then
send to <<mailto:johnrose@xxxxxxxxxxxxxxxxxx>johnrose@xxxxxxxxxxxxxxxxxx>)
--
Claudia Wanderley
tel. +55 19 91362441
John B. Rose
1 Bis, Rue des Châtre-Sacs
92310 Sèvres
France
Email: <john.rose1@xxxxxxx>
(in case of bounce then
send to <johnrose@xxxxxxxxxxxxxxxxxx>)
Other related posts: