[liblouis-liblouisxml] Re: Capital/Emphasis update

  • From: Michael Gray <mgray@xxxxxxx>
  • To: "liblouis-liblouisxml@xxxxxxxxxxxxx" <liblouis-liblouisxml@xxxxxxxxxxxxx>
  • Date: Mon, 23 Feb 2015 17:28:27 +0000

>>> I'm curious, are you keeping firstletter* and lastletter* solely to preserve
>>> backwards-compatibility, or do they still have a real function in the new
>>> design? In the former case, I vote for dropping them for the benefit of
>>> simplicity. We don't have to worry about backwards-compatibility too
>>> much. It's easy enough to write a conversion script to update existing 
>>> tables
>>> to the new syntax.
>>
>> They would not be necessary for UEB, so yes it will be for
>> backwards-compatibility.  In the most recent update they are not implemented,
>> but are still there.  It would not be a problem to implement them, but I was
>> going to ask this list whether or not they should be.  If someone does want
>> that behavior then they would have to be implemented, as I don't think the
>> remaining opcodes can replicate that behavior.
>
>Why don't you leave them unimplemented for now, we write a conversion script 
>for
>the existing tables, and check if we still need them or not? Does that sound
>right?

The firstword* and lastword*after can map directly to firstletter* and 
lastletter*.  lastword*before would be the problem as there is no real 
equivalent functionality implemented.

>One could even go a step further, making each opcode's function even more
>limited. I had a eureka moment the other day, I'll throw it out here in case 
>you
>find it useful.
>
>The idea was to have a set of low-level opcodes/marks that can be inserted 
>alone
>or in pairs (or in triples, etc.) and that can be further processed using
>multipass rules. This would give us the flexibility that is needed to cover all
>thinkable cases, and maybe it would be even easier to implement and to
>understand.
>
>For example, there could be 6 marks per emphasis type: begin*, end*, word*,
>firstword*, lastword*before and lastword*after (where * is ital, caps,
>etc.). begin* and end* are placed at the beginning/end of passages. word*,
>firstword*, lastword*before and lastword*after would have the same meaning as
>now except that they are never placed in the middle of words. The beginning/end
>of passages can coincide with the beginning/end of words. Multipass rules are
>used to process sequences of marks. Combinations of emphasis (e.g. italic and
>bold) and combinations of emphasis and capitalisation could be handled in a
>similar way.

I think I follow what you are saying.  In dealing with multiple emphases, the 
UEB standard (9.8.1) says that while the order does not matter if they start 
and end at the same place, if they don't start at the same place then they 
should be nested, i.e. first typeform started should be the last typeform 
closed.  It also seems to say (8.7.1) that the capitalization indicators should 
always be immediately after the letters they are modifying, before any other 
typeform indicators.  Right now I have LibLouis determining all of these things 
before the main translation loop even starts.

>>> One thing that would be useful is to be able to define characters that
>>> "break" an uppercase passage, and characters that don't. For example in
>>> Dutch, the characters that are not breaking (apart from letters), are minus,
>>> plus, ampersand, full stop, and apostrophe. How does this work in UEB and 
>>> how
>>> are you handling that?
>>
>> I added a passage_break bit and a word_reset bit to the typeform array so
>> users can specifiy these things manually.  The examples below are from the 
>> UEB
>> standard.  The middle line, if there, is the emphasis, and mono-spaced font
>> works best for viewing.
>>
>> The passage_break bit signifies that a new passage starts on this character
>> and any other passages must stop before it.  Examples (@ indicates
>> passage_break):
>>
>> He worked for the ABC. A BBC journalist reported ...
>> 00000000000000000000000@0000000000000000000000000000
>> ,he "w$ = ! ,,abc4 ,a ,,bbc j|rnali/ report$ 444
>>
>> STOP RUNNING NOW! It's dangerous.
>> 000000000000000000@00000000000000
>> ,,,/op runn+ n{6,' ,x's dang}|s4
>>
>> INITIALS OF WRITER/initials of secretary
>> 000000000000000000@000000000000000000000
>> ,,,9itials ( writ},'_/9itials ( secret>y
>
>Okay I can see how this could be useful. But this requires preprocessing right?
>Do you think that in some cases it would be possible to automatically insert
>these passage_break bits? (for example at the beginning of sentences, then we
>would need sentence detection)

With word_resets it is easy as that is dealing with individual characters.  
Passages are trickier because it is context that determine where they should 
go.  I don't have any ideas at the moment on how to detect passages, but there 
may be a way. 

>> The word_reset bit specifies that a word indicator stops at that point in the
>> word and will need to be repeated if it continues.  I was originally going to
>> automatically add a word_reset to the hyphen opcode, and create apostrophe(')
>> and initial(.) opcodes for this purpose, but just have the word reset on any
>> non-alphabetic character worked just as well. It would not be a problem to 
>> add
>> an opcode so that these characters could be designated in the table files.
>
>Yes, an opcode for defining characters that generate a word_reset bit sounds
>like a good idea to me.

Instead of creating individual opcodes for hypens, apostrophes, etc., I may 
implement this in the same manner as the noletsign opcode.  Let me know if you 
want that done.

For a description of the software, to download it and links to
project pages go to http://www.abilitiessoft.com

Other related posts: