[liblouis-liblouisxml] SV: Re: SV: Re: SV: Re: SV: 8 dots contracted with caps was: are the swap opcodes broken?

From: Bue Vester-Andersen <bue@xxxxxxxxxxxxxxxxxx>
To: <liblouis-liblouisxml@xxxxxxxxxxxxx>
Date: Fri, 20 Jan 2017 14:48:10 +0100

Hi Bert,

I think I have a good example. Let us still use the string “txt” or rather
“.txt”. This is the mark of a file extension, and according to the Danish
rules, file names urls and email addresses etc. should be written in grade 1,
i.e. no contractions. However, it is still unclear, whether the x should be
preceded by letsign, since a human is supposed to be able to interpret it as a
file name and then know that the x should be read as “x” and not as “mm” or in
Danish “or”. In other words, we have something like a circular definition or
process.

Without the letsign, this example is completely ambiguous, because liblouis has
no way of knowing if you want to write “.txt” or “.tmmt” (or in Danish “.tort”).

Forward translation also has its odd corners, especially in many Germanic
languages where contraction always takes place within the boundaries of
syllables. In Danish, the word “vandret” can mean two things: “vand-ret” means
horizontal, and “van-dret” means walked or migrated. In the first case, you
may/should use the contraction for “nd” (the letter q), but in the second case,
this contraction is not allowed. Liblouis has no way of knowing which is the
correct word. I have heard of similar examples in other languages.

In this particular case, I chose to omit the “nd” contraction, reasoning that
it is better not to use a contraction that is allowed than to actually use one
that is not allowed.

Also dropped signs can be a problem, both when translating and
back-translating. Many of them has a meaning, both as a punctuation sign and as
a contraction, e.g. an 235. Usually, there are rules on how to use these signs,
so you don’t confuse contractions and punctuations when reading. However, it is
usually easy to come up with examples, which are obvious to the human mind but
ambiguous to the computer. Perhaps not so much in Danish as in English or
German.

Letters with accents is another example of completely ambiguity. The letter e
can have many accents, but if the accents are not a part of the given Braille
code, there is usually only one way to mark a “foreign” accent. So, when
back-translating, Liblouis cannot know which accent was used in the original
text.

Btw: This last case would not be caught by a back-translate/re-translate test
cycle. The incorrectly back-translated accent would still result in the same
Braille accent marker when re-translated.

Usually, I test back-translation with a translation/back-translation cycle and
then test the back-translated text against the original.

On the whole, the problem with Braille is the fact that it was never designed
to be a one-to-one representation of ink print, rather a practical system to
enable blind people to read and write. The rules were never made by
mathematicians to comply with strict logic, but by people who were at many
times willing to sacrifice clarity for brevity and logic for practical
usefulness. So, I don’t think we can ever reach perfection in translation and
back-translation, but we can strive for excellency.

That said, I think both we and Liblouis are doing a great job, especially in
the areas where focused work is being put in. Concerning Danish Braille, I am
particularly impressed by what the hyphenation algorithm has done for correct
contraction of compound words. This has been haunting all previous attempts at
Danish Braille translation and has usually led to light-year-long lists of
exceptions.

When the Danish tables have become somewhat more stable (and I don’t have to
fear for them being broken with every commit :-), I would like to have a look
at some proper tables for German back-translation, that is if no one more
qualified is already doing it. I already have quite good working knowledge of
German Braille, but, of course, I would need to brush up on the rules. Do you,
by chance, have any authoritative material on German braille in electronic
format?

Bue

Fra: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx
[mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] På vegne af Bert Frees
Sendt: 20. januar 2017 11:44
Til: liblouis-liblouisxml@xxxxxxxxxxxxx
Emne: [liblouis-liblouisxml] Re: SV: Re: SV: Re: SV: 8 dots contracted with
caps was: are the swap opcodes broken?

OK thanks. Well, in this particular example it's pretty clear what the correct
back-translation is, right? And this case isn't that hard for a computer
program to solve. Do you also have examples where it is less clear, or even
completely ambiguous?

I imagine a lot of braille codes have cases even without capitals that pose
challenges on automatic back-translation. I have to admit I have no idea what
Liblouis does at the moment, and haven't thought about back-translation in
general much at all, so this could be a pointless or naive question, but I'll
ask anyway: wouldn't it be a good strategy to validate different
back-translation scenarios by forward-translating them again?

2017-01-19 23:37 GMT+01:00 Bue Vester-Andersen <bue@xxxxxxxxxxxxxxxxxx
<mailto:bue@xxxxxxxxxxxxxxxxxx> >:

Hi Bert,

Technical or computer unfriendly? Probably a bit of both, but not impossible, I
think.

I will try to give an example where back-translation might go wrong:

Take the string “TXT”. Never mind that it is also a computer term and should
probably therefore not be contracted in the first place.

If capsnocont is in effect, it will be translated as either ,,txt or ,t,x,t
depending on the status of capsword (plain TXT in 8 dots). So far, so good. No
contraction anyway.

Back-translating ,,txt you get TXT because the begcapsword tells liblouis to
not use contraction rules when back-translating.

However, back-translating ,t,x,t or TXT, you get TMmT, unless Liblouis knows
that it should use the capsnocont rule whenever it sees two consecutive caps,
or unless the x had a letsign in addition to the capslettersign.

The rules for letsigns in this context might be different from language to
language, hence the computer unfriendliness. The Danish rules are unclear on
this, but I think most people would use a letsign in a case like this one.

So, it is mainly a question of securing the correct back-translation, even if
there is no begcapsword sign to indicate clearly that contraction rules should
not be used here.

Hope it makes more sense now.

Bue

Fra: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx
<mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx>
[mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx ;
<mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx> ] På vegne af Bert Frees
Sendt: 19. januar 2017 09:55
Til: liblouis-liblouisxml@xxxxxxxxxxxxx
<mailto:liblouis-liblouisxml@xxxxxxxxxxxxx>
Emne: [liblouis-liblouisxml] Re: SV: Re: SV: 8 dots contracted with caps was:
are the swap opcodes broken?

2017-01-18 21:17 GMT+01:00 Bue Vester-Andersen <bue@xxxxxxxxxxxxxxxxxx
<mailto:bue@xxxxxxxxxxxxxxxxxx> >:

Hi Bert,

Regarding your first "btw" I don't quite understand what the problem is.
Maybe you are overthinking it?

The problem is that the back-translator could apply contraction rules because
it does not know that it is in a no-contractions state. A German example would
be the letters that are also used as partword contractions, i.e. q, x, and y.
In Danish, we have similar letters: q, w, x, and z. If capsnocont is defined
and the back-translator sees a begcapsword, it knows that contraction rules
should not be applied. But if no begcapsword is used, it should react on seeing
two or more capital letters. Asimilar problem occurs with the nocont opcode
where a certain text string triggers the no-contractions state, e.g. http://, ;
.txt, or .zip. Hope it makes sense.

Sorry, didn't realize you were talking about backward translation at first. But
I'm still not sure whether this is about a technical difficulty (that can be
solved), about a non-computer-friendly braille code, or about a fundamental
problem in the braille code? Some real examples would be nice. (Sorry I must
sound stupid but some things are just not so easy to grasp without proper
braille knowledge). Thanks.

Regarding your second btw, yes perhaps you are right. But in which category
fall words that are not fully uppercase, but also not only the first letter?

Hmm, good question. I don’t know about the rules for this in other languages,
but I would say that mixed caps should be treated like all caps. Otherwise, you
could have some very confusing combinations of contracted and uncontracted
braille within the same word. The alternative is to have three separate
opcodes: singlecapsnocont, mixedcapsnocont, and allcapsnocont. I think that
would be overkill, but of course I might be proven wrong. :)

Yes I agree extra opcodes would probably be overkill. Just wanted to know how
mixed caps should be treated. Documentation should make this clear.

2017-01-17 20:56 GMT+01:00 Bue Vester-Andersen <bue@xxxxxxxxxxxxxxxxxx
<mailto:bue@xxxxxxxxxxxxxxxxxx> >:

Btw: Testing backwards made me aware of a little snag: If capsnocont has been
defined, contraction rules should of course not be used when in capsword mode.
This should be easy enough when begcapsword/endcapsword are also defined.
However, if begcapsword/endcapsword are not defined, we have to assume a
capsword situation and activate capsnocont if capital letters or contractions
appear after each other.

Btw: according to the manual, capsnocont only affects all caps words, not words
with only the first letter capitalized. This is fine for the current purpose,
but I think there are languages where you cannot contract words with first cap
either. Until recently, this was the case in Danish 6 dots grade 2, but the
rules have been changed, so that it now behaves more like English in this
respect. Perhaps “allcapsnocont” would be a better name in respect to what it
does. If we then need an opcode to stop contraction of single caps, we could
use the name capsnocont. What do you say?

Follow-Ups:
- [liblouis-liblouisxml] Re: SV: Re: SV: Re: SV: Re: SV: 8 dots contracted with caps was: are the swap opcodes broken?
  - From: Bert Frees
- [liblouis-liblouisxml] Re: SV: Re: SV: Re: SV: Re: SV: 8 dots contracted with caps was: are the swap opcodes broken?
  - From: Christian Egli

References:
- [liblouis-liblouisxml] SV: 8 dots contracted with caps was: are the swap opcodes broken?
  - From: Bue Vester-Andersen
- [liblouis-liblouisxml] Re: SV: 8 dots contracted with caps was: are the swap opcodes broken?
  - From: Bert Frees
- [liblouis-liblouisxml] SV: Re: SV: 8 dots contracted with caps was: are the swap opcodes broken?
  - From: Bue Vester-Andersen
- [liblouis-liblouisxml] Re: SV: Re: SV: 8 dots contracted with caps was: are the swap opcodes broken?
  - From: Bert Frees
- [liblouis-liblouisxml] SV: Re: SV: Re: SV: 8 dots contracted with caps was: are the swap opcodes broken?
  - From: Bue Vester-Andersen
- [liblouis-liblouisxml] Re: SV: Re: SV: Re: SV: 8 dots contracted with caps was: are the swap opcodes broken?
  - From: Bert Frees

[liblouis-liblouisxml] SV: Re: SV: Re: SV: Re: SV: 8 dots contracted with caps was: are the swap opcodes broken?

Other related posts: