[haiku-doc] Machine translation (was: Re: Wiki for translation/localization teams)

From: Sean Healy <jalopeura@xxxxxxxxxxx>
To: haiku-development@xxxxxxxxxxxxx, "haiku-doc@xxxxxxxxxxxxx" <haiku-doc@xxxxxxxxxxxxx>
Date: Sat, 31 Oct 2009 22:56:29 +0100

Double posting this to the development and doc lists

Star's seed wrote:

I tested a translation software software to
translate the documentation: OmegaT (java / gpl)

I am currently in a Masters program in computational linguistics, so Iam familiar with this subject area.

Machine translation is nowhere near good enough to provide adequatetranslations of arbitrary text. Especially considering that one of thedesired target languages is Hungarian, which doesn't even belong to thesame language superfamily as English, and is significantly structurallydifferent.

I have occasionally done freelance translating from Finnish (in the samesuperfamily as Hungarian) to English. In that work, I have seen whatmachine translation does when converting between two such fundamentallydifferent languages. You end up with incomprehensible gibberish.

If you want a translation that is both good and fully automatic, youneed to restrict the source language to a particular vocabulary and aparticular set of syntactic structures.

To reformulate a meme common among developers: Quality of Translation.Full Automation. Freedom of Linguistic Expression. Pick any two.

It can be done, and in fact, I'd be willing to help do it; I'm actuallylooking for a project for this year that's supposed to lead into myMasters project and thesis next year. I'd love to be able to givesomething to Haiku and fulfill scholastic requirements at the same time.

But it will require work from our doc team. That's why, despite myinterest in the area, I have held off mentioning machine translation.Now that someone else has brought it up, I'd like to point out therequirements.

1) We would have to pick a set of vocabulary to use in our Englishsource docs, and we'd have to restrict those docs to a relatively simplesyntax. In particular, we couldn't use vocabulary or syntax that couldbe ambiguous. In the case of an ambiguous word we really couldn't dowithout, we'd have to select a particular meaning which that word wouldalways have, and find a different word for each additional meaning.

2) Then we'd have to reformulate our existing documentation to fit thoserestrictions.

3) Then our translators would have to determine a one-to-one mapping ofour vocabulary items, as well as a mapping of syntactic structures. Infact, the translators would probably have to help in step one indetermining the source vocabulary. (Note: Only content words have to beone-to-one. Function words like prepositions and conjunctions can bemore fluid, and would be handled together with the syntactic structures.)

Essentially, this is a lot of work up front. But IF we can get it allset up, and IF our doc writers stick to the restrictions, then it'sfully automated afterward, and any updates to the English docs areimmediately reflected in the target languages. So it's either a lot ofwork up front, or little bits of work every time a change is made tokeep the translations up to date.

So it depends on what resources the doc team has, and where it's willingto spend those resources.

[haiku-doc] Machine translation (was: Re: Wiki for translation/localization teams)

Other related posts: