[liblouis-liblouisxml] Re: Capital/Emphasis update

  • From: Michael Gray <mgray@xxxxxxx>
  • To: "liblouis-liblouisxml@xxxxxxxxxxxxx" <liblouis-liblouisxml@xxxxxxxxxxxxx>
  • Date: Tue, 17 Feb 2015 19:28:59 +0000

I hope the following answers your questions.

>> Words and characters are determined by characters defined as letters.  If the
>> emphasis markings do not start at the beginning of a word it is shifted to 
>> the
>> beginning of the first word.  If the emphasis markings do not end at a word
>> end, it is shifted to the end of the last word.
>>
>> Characters marked as capital are merged with other characters marked as
>> capitals if the characters between them are defined as spaces.

>I don't quite understand. Does this mean that A B C is treated as a single
>uppercase word?

A B C is treated as a passage: ,,,a ;b ;c,' .  The previous approach did not 
work well with the UEB standard, which is why it was replaced. 
The current behavior is as follows (and subject to change).  This behavior 
seems to hit the majority of examples in the UEB best.

1.  All words are marked using *word and *wordstop, checking if singleletter* 
is to be used instead.  Also checked is whether a whole word is completely 
covered, i.e. all caps, all underlined, etc.
2.  All consecutive whole words (words completely covered) that are more than 
or equal to len*phrase are converted to passages using firstletter* and 
lastletter* (or firstword* and lastwordbefore* or lastwordafter*, see next).
3.  (Capitalization only)  All words that are not in a passage are checked for 
word resets, i.e. hyphens, apostrophes, etc..

Words are determined as symbols-sequences (Rules of Unified English Braille 
2013, page 8):
symbols-sequence:  an unbroken string of braille signs, whether alphabetic or 
non-alphabetic, preceded and followed by space (also referred to as 
symbols-word) 

If any of the opcodes are not defined then the corresponding stage is skipped, 
the resulting translation is undefined.

>I'm curious, are you keeping firstletter* and lastletter* solely to preserve
>backwards-compatibility, or do they still have a real function in the new
>design? In the former case, I vote for dropping them for the benefit of
>simplicity. We don't have to worry about backwards-compatibility too much. It's
>easy enough to write a conversion script to update existing tables to the new
>syntax.

They would not be necessary for UEB, so yes it will be for 
backwards-compatibility.  In the most recent update they are not implemented, 
but are still there.  It would not be a problem to implement them, but I was 
going to ask this list whether or not they should be.  If someone does want 
that behavior then they would have to be implemented, as I don't think the 
remaining opcodes can replicate that behaviour.

>My first reaction was that while this adds opcodes and therefore complexity, it
>still isn't obvious to me whether it actually covers more cases than before or
>not. (I'm only talking about emphasis now, for capitals it is obvious!) Perhaps
>I should have a look at some concrete UEB examples before questioning, but
>anyway.

I felt that it is better to have a bunch of opcodes all do one thing each 
rather than have opcodes do several things depending on when, how, where, etc. 
they were used.  I originally had some opcodes used for several things, but I 
decided that giving each opcode just one function will make it more flexible, 
easier to document and implememt.  Also, the emphases are all the same set of 
opcodes so one would only have to understand that set of opcodes.

I wrote a tool which allows me to test examples directly from the UEB standard. 
 For capitalizations it is 80% correct for the examples from chapter 8 of Rules 
of Unified English Braille (not including examples dealing with large text 
elements).  10% of those require the what is described next.  The majority of 
remaining failures seem to have to do with how LibLouis handles letter 
indicators.  I have attached the most recent list.  

>One thing that would be useful is to be able to define characters that "break"
>an uppercase passage, and characters that don't. For example in Dutch, the
>characters that are not breaking (apart from letters), are minus, plus,
>ampersand, full stop, and apostrophe. How does this work in UEB and how are you
>handling that?

I added a passage_break bit and a word_reset bit to the typeform array so users 
can specifiy these things manually.  The examples below are from the UEB 
standard.  The middle line, if there, is the emphasis, and mono-spaced font 
works best for viewing.

The passage_break bit signifies that a new passage starts on this character and 
any other passages must stop before it.  Examples (@ indicates passage_break):

He worked for the ABC. A BBC journalist reported ...
00000000000000000000000@0000000000000000000000000000
,he "w$ = ! ,,abc4 ,a ,,bbc j|rnali/ report$ 444

STOP RUNNING NOW! It's dangerous.
000000000000000000@00000000000000
,,,/op runn+ n{6,' ,x's dang}|s4

INITIALS OF WRITER/initials of secretary
000000000000000000@000000000000000000000
,,,9itials ( writ},'_/9itials ( secret>y

The word_reset bit specifies that a word indicator stops at that point in the 
word and will need to be repeated if it continues.  I was originally going to 
automatically add a word_reset to the hyphen opcode, and create apostrophe(') 
and initial(.) opcodes for this purpose, but just have the word reset on any 
non-alphabetic character worked just as well.  It would not be a problem to add 
an opcode so that these characters could be designated in the table files.  
Examples (P indicates word reset):

McGRAW-HILL
,mc,,graw-,,hill

UPPERCASE-lowercase
,,upp}case-l{}case

MERRY-GO-ROUND
,,m}ry-,,g-,,r.d

WELCOME TO McDONALD'S
,,welcome ,,to ,mc,,donald',s

www.BLASTSoundMachine.com
000000000P000000000000000
www4,,bla/,s.d,ma*9e4com

ATandT
0P0000
,a,t&,t


I have added the 5 transcriber-defined typeform indicators.  Their behavior is 
the same as the rest of the emphases.  Each of the five follow the same design 
as the other emphases:

? = 1, 2, 3, 4, 5
singlelettertrans?
trans?word
trans?wordstop
lentrans?phrase
firstwordtrans? 
lastwordaftertrans? 
firstlettertrans?
lastlettertrans?


Please let me know if these changes causes any problems with any other 
languages as I have so far just focused on UEB.

MRG
ÿþ#   8.3.1



in:  20B

ueb: #bj,b

lou: #bj;,b



#~FAIL  en-ueb-g1.ctb:  noletsignafter 
.  en-ueb-g2.ctb:  noletsignafter .

in:  C. O. Linkletter

ueb: ;,c4 ,o4 ,l9klett}

lou: ,c4 ,o4 ,l9klett}



in:  B-E-L-I-E-V-E

ueb: ;;,b-,e-,l-,i-,e-,v-,e

lou: ;,b-;,e-;,l-,i-;,e-;,v-,e



#   8.3.2



#   8.3.3



in:  Voyage À Nice

ueb: ,voyage ,~*a ,nice

lou: ,voyage ;,~*a ,nice



#   8.4.2



in:  (R)AC

ueb: "<,r">,,ac

lou: "<;,r">,,ac



in:  B&B

ueb: ,b`&,b

lou: ;,b`&,b



in:  AT&T

ueb: ,,at`&,t

lou: ,,at`&;,t



#~FAIL  FOR SALE: 1975 FIREBIRD works?

#~~emp

#~000000@000

in:  SWIFT & CO.

ueb: ,,swift `& ,,co4

lou: ,,,swift `& co4,'



#   8.5.3



#~FAIL  en-ueb-g1.ctb:  midnum / 456-34

in:  BUY FAHRENHEIT 9/11 ON E-BAY

emp: 0000111111111111111000000000

ueb: ,,,buy .1fahr5heit .1#i_/#aa on 
;e-bay,'

lou: ,,,buy .1fahr5heit .1#i_/aa on 
;e-bay,'



#   8.5.4



in:  "... at 11:00 AM" MARKHAM 
ECONOMIST AND SUN

ueb: 8444 at #aa3#jj ,,am0,-,,,m>kham 
economi/ & sun,'

lou: "8444 at #aa3jj ,,,am0",-m>kham 
economi/ & sun,'



in:  &  (See Attachment A).  A CSP 
(Carriage Service Provider) has  
obligations to & 

ueb: 444 "<,see ,atta*;t ,a">4 ,a ,,csp 
"<,c>riage ,s}vice ,provid}"> has 
obliga;ns to 444

lou: ' "<,see ,atta*;t ,,,a">4  a csp,' 
"<,c>riage ,s}vice ,provid}"> has  
obliga;ns to '



#   8.6.2



in:  XXIInd

ueb: ,,xxii,'nd

lou: ,,xxi9,'d



in:  B-U-S

ueb: ;;,b-,u-,s

lou: ;,b-;,u-;,s



in:  [£]

ueb: .<,.s.>

lou: .<;,.s.>



in:  Voyage À Nice

emp: 1111111111111

ueb: .7,voyage ,~*a ,nice.'

lou: .7,voyage ;,~*a ,nice.'



in:  CD

ueb: ;,,cd

lou: ,,cd



in:  AC SMITH

ueb: ;,,ac ,,smi?

lou: ,,ac ,,smi?



in:  V-NECK SWEATERS FOR SALE!

ueb: ;,,,v-neck sw1t}s = sale6,'

lou: ,,;,v-neck sw1t}s = sale6,'



in:  CD CDs

ueb: ;,,cd ,,cd,'s

lou: ,,cd ,,cd,'s



Other related posts: