[haiku-doc] Machine translation (was: Re: Wiki for translation/localization teams)

  • From: Sean Healy <jalopeura@xxxxxxxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx, "haiku-doc@xxxxxxxxxxxxx" <haiku-doc@xxxxxxxxxxxxx>
  • Date: Sat, 31 Oct 2009 22:56:29 +0100

Double posting this to the development and doc lists

Star's seed wrote:

I tested a translation software software to
translate the documentation: OmegaT (java / gpl)

I am currently in a Masters program in computational linguistics, so I am familiar with this subject area.

Machine translation is nowhere near good enough to provide adequate translations of arbitrary text. Especially considering that one of the desired target languages is Hungarian, which doesn't even belong to the same language superfamily as English, and is significantly structurally different.

I have occasionally done freelance translating from Finnish (in the same superfamily as Hungarian) to English. In that work, I have seen what machine translation does when converting between two such fundamentally different languages. You end up with incomprehensible gibberish.

If you want a translation that is both good and fully automatic, you need to restrict the source language to a particular vocabulary and a particular set of syntactic structures.

To reformulate a meme common among developers: Quality of Translation. Full Automation. Freedom of Linguistic Expression. Pick any two.

It can be done, and in fact, I'd be willing to help do it; I'm actually looking for a project for this year that's supposed to lead into my Masters project and thesis next year. I'd love to be able to give something to Haiku and fulfill scholastic requirements at the same time.

But it will require work from our doc team. That's why, despite my interest in the area, I have held off mentioning machine translation. Now that someone else has brought it up, I'd like to point out the requirements.

1) We would have to pick a set of vocabulary to use in our English source docs, and we'd have to restrict those docs to a relatively simple syntax. In particular, we couldn't use vocabulary or syntax that could be ambiguous. In the case of an ambiguous word we really couldn't do without, we'd have to select a particular meaning which that word would always have, and find a different word for each additional meaning.

2) Then we'd have to reformulate our existing documentation to fit those restrictions.

3) Then our translators would have to determine a one-to-one mapping of our vocabulary items, as well as a mapping of syntactic structures. In fact, the translators would probably have to help in step one in determining the source vocabulary. (Note: Only content words have to be one-to-one. Function words like prepositions and conjunctions can be more fluid, and would be handled together with the syntactic structures.)

Essentially, this is a lot of work up front. But IF we can get it all set up, and IF our doc writers stick to the restrictions, then it's fully automated afterward, and any updates to the English docs are immediately reflected in the target languages. So it's either a lot of work up front, or little bits of work every time a change is made to keep the translations up to date.

So it depends on what resources the doc team has, and where it's willing to spend those resources.

Other related posts: