[liblouis-liblouisxml] Re: SV: Re: SV: Re: Inconsistent behavior with begcaps and endcaps

  • From: Davy Kager <DavyKager@xxxxxxxxxx>
  • To: "'liblouis-liblouisxml@xxxxxxxxxxxxx'" <liblouis-liblouisxml@xxxxxxxxxxxxx>
  • Date: Thu, 6 Aug 2015 06:48:03 +0000

Hi Bue,

The capsmodechars opcode has been merged into the feature/ueb_update_code
branch, so you could give that a try. This is mostly Michael Gray's work, my
work focused mainly on applying his new opcodes to the Dutch table.
Unfortunately I haven't looked at back-translation at all (and nor do I think
has Michael), since my goal was to get forward-translation working. I may have
time to look at back-translation in the fall. We need to validate the new
opcodes and their regressions with Bert anyway, so if you find something odd
please write in!

Also be aware that the branch currently also contains the emphmodechars opcode.
I reverted this afterwards because it has issues. So ignore that one for now.

Regarding the 'cancelsign', maybe Michael's numericmodechars and
numericnocontchars opcodes can be of use to you. The latter proved quite useful
if you defined the contractsign, but sadly the Dutch rules are too twisty to
make number translation work with it. Even with multi-pass rules there is still
an unsolved corner case. Soapbox moment: 8-dot uncontracted braille is the
future!

Davy

-----Oorspronkelijk bericht-----
Van: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx
[mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] Namens Bue Vester-Andersen
Verzonden: dinsdag 4 augustus 2015 15:24
Aan: liblouis-liblouisxml@xxxxxxxxxxxxx
Onderwerp: [liblouis-liblouisxml] SV: Re: SV: Re: Inconsistent behavior with
begcaps and endcaps

Hi Davy,

Yes, I think the capsmodechars opcode would benefit the Danish tables as well.
Especially, if it was legal to have one dot pattern represent multiple signs,
e.g. letsign and endcaps. Like in Dutch, our letsign is really more a
cancelsign (or whatever would be the better English word for it). Dot 6 cancels
contraction (only for the next character), caps mode and number mode. Also,
most anything that isn't a capital letter will cancel the caps mode.

Emphasis is different. we use the same dot pattern for all emphasis (dots 56),
and also for end emphasis. However, I am luckier than you in that an emphasis
sequence must always be terminated explicitly, even if it is just one single
letter.

Has the capsmodechar been merged into master yet, and do you know how well it
works for back-translation?

Bue

-----Oprindelig meddelelse-----
Fra: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx
[mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] På vegne af Davy Kager
Sendt: 4. august 2015 08:45
Til: 'liblouis-liblouisxml@xxxxxxxxxxxxx'
Emne: [liblouis-liblouisxml] Re: SV: Re: Inconsistent behavior with begcaps and
endcaps

Hi Bue,

the opcode is the opposite of what you suggest, e.g. 'do not end caps
characters'. The behavior is as follows:
* By default everyththing except an uppercase letter ends a sequence of caps
within a word.
* All characters defined with 'capsmodechars' also do not end a sequence of
caps within a word. In Dutch the hyphen is one such character.

How would this rule be defined for back-translation? Unless the translator
adds an endcaps/letsign/some_equivalent_sign, the back-translator would not
know when to stop outputting capitals. Of course, the presence of a
back-translated character which is not a letter and not a capsmodechar should
switch to small letters.

In Dutch, and I think in UEB and EBAE as well, the 'multiple consecutive caps'
sign is different from the 'single caps' sign. For the single caps it's always
obvious when to stop. For the multi-caps sign you stay in capital mode until
the word ends (indicated by whitespace or the end of the input) or until a
non-capsmode character is back-translated. Were you to write USB-cable in Dutch
you would need an endcaps sign after the hyphen to indicate the end of the caps
sequence. This is needed because the hyphen is a 'capsmodechar'. So yes, you do
need some extra signs to make clear when caps end, but that is true for forward
translation as well.

It gets more problematic with emphasis. Not only are the bold/ital/underline
signs all the same in Dutch, but they also indicate multiple things. In
particular, the indicator for a single emphasized character is the same as that
for indicating multiple consecutive characters in a word. The only way to know
which applies is to look for an end emphasis sign, and if you find one you know
you were dealing with multiple emphasized characters. But if emphasis ends at a
word boundary you don't insert the end sign because whitespace is defined to
end emphasis mode. So this can't really be back-translated properly. I think
this is a problem with many tables.

I think that some of the problems with our different definition and role of
letsign could be solved by allowing one cell to represent multiple signs. In
Danish, we could then have:

letsign 6
endcaps 6

Currently, these lines won't work. When the dot 6 is seen by the the letsign
test, the character is eaten, so it will never be seen by endcaps. At least,
that is how I think it works.

Dutch defines two signs: the end caps/emphasis sign (dot 6) and the 'second
meaning' sign (dot 5). The first applies to both caps and all types of
emphasis, and this works reasonably well for forward-translation. The 'second
meaning' sign is currently handled by multi-pass rules. This too works well
except for back-translation. Dutch has no contractions, so I think letsign is
less relevant. The table doesn't define it and so I'm not running into issues
with that.

The only problem with the end caps/emphasis sign is that one occurrence is
defined to end all modes that were active at the time. So if you wanted to
write USB-cable and emphasize the whole word except for the last two letters,
i.e. 'USB-cab', then you would need to 'restart' emphasis mode after the hyphen
because an end caps sign is inserted there which also cancels emphasis. I don't
think liblouis handles this correctly right now, and I'm not sure if it can be
fixed in a way that isn't totally confusing. In any case, this makes
back-translation of emphasis pretty much infeasible.

Davy

-----Oorspronkelijk bericht-----
Van: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx
[mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] Namens Bue Vester-Andersen
Verzonden: maandag 3 augustus 2015 16:38
Aan: liblouis-liblouisxml@xxxxxxxxxxxxx
Onderwerp: [liblouis-liblouisxml] SV: Re: Inconsistent behavior with begcaps
and endcaps

Hi Davy,

You wrote:

the opcode is the opposite of what you suggest, e.g. 'do not end caps
characters'. The behavior is as follows:
* By default everyththing except an uppercase letter ends a sequence of caps
within a word.
* All characters defined with 'capsmodechars' also do not end a sequence of
caps within a word. In Dutch the hyphen is one such character.

How would this rule be defined for back-translation? Unless the translator adds
an endcaps/letsign/some_equivalent_sign, the back-translator would not know
when to stop outputting capitals. Of course, the presence of a back-translated
character which is not a letter and not a capsmodechar should switch to small
letters.

I think that some of the problems with our different definition and role of
letsign could be solved by allowing one cell to represent multiple signs. In
Danish, we could then have:

letsign 6
endcaps 6

Currently, these lines won't work. When the dot 6 is seen by the the letsign
test, the character is eaten, so it will never be seen by endcaps. At least,
that is how I think it works.

Hope it makes sense.

Bue

-----Oprindelig meddelelse-----
Fra: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx
[mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] På vegne af Davy Kager
Sendt: 3. august 2015 09:00
Til: 'liblouis-liblouisxml@xxxxxxxxxxxxx'
Emne: [liblouis-liblouisxml] Re: Inconsistent behavior with begcaps and endcaps

Hi Bue,

Funny, this is the set of corner cases I've been writing about for a few weeks
now. Maybe capturing all languages in a handful of opcodes is too ambitious? I
also can't help but notice that, again, the contracted words you and Susan
provided aren't any shorter than the print version. Interesting to see a
multi-cell caps sign and endcaps sign used like that.

Besides, in many languages we don't have a specific endcaps marker other than
letsign or the presence of a punctuation character.
Yes, and even the word 'letsign' doesn't work well for all languages, or
languages have more than one such sign.

I would suggest the following:

1. Behavior should be consistent for begcaps between forward and backward
translation.
Yes.

2. A new opcode called "autoendcaps characters". When this opcode is used,
the characters following autoendcaps imply the end of a sequence of capital
letters without the need for any further marker. It tells the translator to
use a new begcaps sign if another sequence of capital letters follows. It
also tells the back-translator to end the sequence of capital letters when
one of the characters after autoendcaps is encountered.

I have added such an opcode to the UEB work done by Michael. Of course this is
based on opcodes and semantics that were introduced for UEB, which as Susan
wrote has different rules from EBAE. This means that the opcode is the opposite
of what you suggest, e.g. 'do not end caps characters'. The behavior is as
follows:
* By default everyththing except an uppercase letter ends a sequence of caps
within a word.
* All characters defined with 'capsmodechars' also do not end a sequence of
caps within a word. In Dutch the hyphen is one such character.

I'm not sure if this is enough for EBAE and if it solves your problem. As I
wrote earlier back-translation is probably still not fully functional.

The (slightly outdated) code lives here:
https://github.com/liblouis/liblouis/tree/feature/ueb_update_code

Davy

-----Oorspronkelijk bericht-----
Van: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx
[mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] Namens Bue Vester-Andersen
Verzonden: zaterdag 1 augustus 2015 19:21
Aan: liblouis-liblouisxml@xxxxxxxxxxxxx
Onderwerp: [liblouis-liblouisxml] Inconsistent behavior with begcaps and endcaps

Hi,

the opcodes begcaps and endcaps do not behave consistently between forward and
backward translation if punctuation characters like - are involved.

With en-us-g2.ctb, you could try the combination USB-cable (yes, I know it is
not spelled like that in English, but pick a better example yourself).

USB-cable translates to
,,usb-ca#
which then back-translates to
USB-CABLE (all caps)

Besides, in many languages we don't have a specific endcaps marker other than
letsign or the presence of a punctuation character.

I would suggest the following:

1. Behavior should be consistent for begcaps between forward and backward
translation.

2. A new opcode called "autoendcaps characters". When this opcode is used, the
characters following autoendcaps imply the end of a sequence of capital letters
without the need for any further marker. It tells the translator to use a new
begcaps sign if another sequence of capital letters follows. It also tells the
back-translator to end the sequence of capital letters when one of the
characters after autoendcaps is encountered.

Best regards bue
.



For a description of the software, to download it and links to project pages go
to http://liblouis.org
DISCLAIMER:
De informatie verzonden met dit e-mail bericht is uitsluitend bestemd voor de
geadresseerde. Indien u niet de beoogde geadresseerde bent, verzoeken wij u
vriendelijk dit aan de afzender te melden (of via:
info@xxxxxxxxxx<mailto:info@xxxxxxxxxx>) en het origineel en eventuele kopieën
te verwijderen.

The information sent in this e-mail is solely intended for the individual or
company to whom it is addressed. If you received this message in error, please
notify the sender immediately (or mail to
info@xxxxxxxxxx<mailto:info@xxxxxxxxxx>) and delete the original message and
possible copies.

 z + b z  pj 0 Zv+Z b K- - 
- m 剹h +(

For a description of the software, to download it and links to project pages go
to http://liblouis.org
DISCLAIMER:
De informatie verzonden met dit e-mail bericht is uitsluitend bestemd voor de
geadresseerde. Indien u niet de beoogde geadresseerde bent, verzoeken wij u
vriendelijk dit aan de afzender te melden (of via:
info@xxxxxxxxxx<mailto:info@xxxxxxxxxx>) en het origineel en eventuele kopieën
te verwijderen.

The information sent in this e-mail is solely intended for the individual or
company to whom it is addressed. If you received this message in error, please
notify the sender immediately (or mail to
info@xxxxxxxxxx<mailto:info@xxxxxxxxxx>) and delete the original message and
possible copies.

h ^ ب (~  hv ' ֧vX h #y ij
h i bnZ. +

For a description of the software, to download it and links to project pages go
to http://liblouis.org
DISCLAIMER:
De informatie verzonden met dit e-mail bericht is uitsluitend bestemd voor de
geadresseerde. Indien u niet de beoogde geadresseerde bent, verzoeken wij u
vriendelijk dit aan de afzender te melden (of via:
info@xxxxxxxxxx<mailto:info@xxxxxxxxxx>) en het origineel en eventuele kopieën
te verwijderen.

The information sent in this e-mail is solely intended for the individual or
company to whom it is addressed. If you received this message in error, please
notify the sender immediately (or mail to
info@xxxxxxxxxx<mailto:info@xxxxxxxxxx>) and delete the original message and
possible copies.

��u��*m���~�^�����޶�h�yhiحjwe�y,��k�7����z�(��m����&��謢�

Other related posts: