[liblouis-liblouisxml] Re: SV: Re: hyphenation-based contracted braille translation [was: specifying digraphs in libelous tables]

From: Bert Frees <bertfrees@xxxxxxxxx>
To: liblouis-liblouisxml@xxxxxxxxxxxxx
Date: Mon, 09 Jun 2014 15:37:49 +0200

Hello Bue,

Thanks, that answers my question. I now understand that you're actually
using Frank Liang's algorithm for solving the problem of braille
contraction, not that of hyphenation. With Liang's algorithm, translation
tables can be made much smaller than with liblouis' relatively "simple"
rules. You're using libhyphen together with the nocross opcode basically
as a sort of hack to achieve that.

The patterns that you use match the braille contraction rules, and "by
accident" also the hyphenation rules, but in theory the patterns don't
have to be related with hyphenation at all. This approach will work as
long as a certain isolated "chunk" is always translated to the same
braille contraction. The division in chunks is taken care of by
libhyphen.

Thanks,
Bert

Bue Vester-Andersen writes:

> Hi Bert,
>
> I have just had another look at the german tables, and I think I see your 
> point. But if you have access to a whole lot of translated material as well 
> as the source texts, then you already far on your way. the sources could 
> easily be compiled into a very large corpus of words that covers all material 
> that you have ever produced. You could use the hyphenation file that you 
> already have as a starting point for proof-reading. Just remember that the 
> hyphenation should match the braille contraction rules, not necesarily the 
> official German hyphenation rules. You will probably not be creating the best 
> file for normal hyphenation. E.g. in Braille, you need a left and right 
> "hyphenation margin" of one, because a syllable break may be anywhere, even 
> after the first letter or just before the last one. Well, excuse me if I am 
> getting over-excited on your behalf. :-)
>
> No, I don't think it would work well for all languages. It would be most 
> useful with languages where the grade 2 rules state that contractions cannot 
> cross syllable boundaries.
>
> In Danish we basically have two hyphenation systems, a syllable based one and 
> an inclination based one, e.g. the word "hunden" (the dog). It consists of 
> two parts: "hund" (dog and "en", which is the suffix that makes the definite 
> form. So according to the inclination based system, I can hyphenate 
> "hund-en", but according to the syllable based system, it is "hun-den". So in 
> a normal hyphenation file, the correct entry would be "hun-d-en" to take both 
> possibilities into account.
>
> However, in Danish Braille, only the syllable based option is valid. The 
> contraction should be 125-136-1345-12346 (h_u_n_den") and not 
> 125-136-12345-126 (h_u_nd_en).
>
> To make matters worse, there are some cases where contractions may actually 
> cross syllable boundaries. They are mainly conflicts between Braille 
> tradition and the ever evolving rules of language. These cases are also best 
> handled in the hyphenation rules.
>
> I hope that answers your question. Please, tell me if there is anything I can 
> do to help you with the hyphenation file.
>
> Bue
>
> -----Oprindelig meddelelse-----
> Fra: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx 
> [mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] På vegne af Bert Frees
> Sendt: 6. juni 2014 15:20
> Til: liblouis-liblouisxml@xxxxxxxxxxxxx
> Emne: [liblouis-liblouisxml] Re: hyphenation-based contracted braille 
> translation [was: specifying digraphs in libelous tables]
>
> Thanks Bue, very useful indeed!
>
> I'm aware of how libhyphen tables work and what the difference is with
> TeX tables, but I haven't used patgen. At my former employer SBS we
> worked with the German hyphenation table from LibreOffice augmented with
> a big list of exception words. (I think I told you before.) These
> exception words are the result of proofreading the braille books that
> SBS produces. This works but maybe new errors will keep on emerging at
> the same rate and proofreading may be necessary for a long time.
>
> I'm not currently working on any hyphenation tables, but that may change
> soon, so learning how to work with patgen could be useful.
>
> But I'm interested mainly by your hyphenation-based approach for
> contracted braille translation. It is an approach I never thought of
> myself. It seems logical that contraction is related to hyphenation, so
> exploiting that is very clever. Unfortunately my knowledge of Braille is
> too limited to know for sure that there is an obvious connection between
> the two.
>
> At SBS, we have always considered hyphenation and translation as two
> separate problems. They even often hinder each other: some braille rules
> needed hyphenation marks in them because they would otherwise wrongly
> eliminate previously inserted break points.
>
> I wonder if your approach would work for all languages.
>
> I still don't really understand how hyphenation and division into
> syllables are different things in Danish.
>
> Thanks,
> Bert
>
>
> Bue Vester-Andersen writes:
>
>> Hi Bert,
>>
>> It is a somewhat lengthy process, but when it works it will create fantastic
>> Braille contraction. No long exception lists etc. The Liblouis rules become
>> much simpler.
>>
>> The hyphenation file is a list of competing patterns with odd numbers
>> indicating "hypenation allowed" and even numbers indicating "hyphenation not
>> allowed". In principal, the file could be created manually, but that would
>> be a nightmare in most languages, unless the hyphenation rules are very
>> simple. In stead, you use the patgen program from Tex to create the
>> hyphenation files from a list of known-good hyphenated words. The more
>> words, the better. Then you need to convert the file from the Tex format to
>> the format used by LibraOffice and Liblouis. The two formats are very
>> similar, but if you use the tex format with Liblouis, it will fail silently.
>> You will just get a lot of strange hyphenation errors.
>>
>> The real trick is finding/making a list of hyphenated words. You will need a
>> corpus or extracted dictionary like Aspell or something like that. The best
>> would be a corpus with words sorted so that the most commonly used appear
>> first. I have created a Python script to create such a corpus from txt
>> files. Maybe, you can also lay your hands on a more official corpus.
>>
>> Start out by hyphenating a few thousand words manually. Compile the rules
>> with Patgen. Hyphenate your corpus using these rules. Proof-read more words
>> and add them to the hyphenation list and compile again ... and so on, until
>> you start getting acceptable results.
>>
>> At some point you may decide to stop proof-reading the hyphenation and only
>> add words where hyphenation errors result in an incorrect contraction.
>>
>> Currently, my hyphenation list contains close to 50,000 words. It gives a
>> near perfect contraction result. Whenever I find errors, I add the words to
>> the hyphenation list. Then I compare the result of contracting the whole
>> corpus of 638,000 words before and after. I proof-read the changes and add
>> them to the list. This process is repeated until i get no changes. So,
>> starting with 5 words, I can easily end up adding a total of 500 words to
>> the list, before a new "steady-state" has been reached. But, as I said, it
>> gives fantastic results, better than any other automatic contraction system
>> that I have seen.
>>
>> Working with patgen is not trivial. If you are interested, I can help you in
>> more details. This was just a general explanation, but I hope you can use
>> it.
>>
>> Bue
>>
>>
>> -----Oprindelig meddelelse-----
>> Fra: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx
>> [mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] På vegne af Bert Frees
>> Sendt: 6. juni 2014 12:26
>> Til: liblouis-liblouisxml@xxxxxxxxxxxxx
>> Emne: [liblouis-liblouisxml] Re: SV: specifying digraphs in libelous tables
>>
>> Hi Bue,
>>
>>> I eventually had to make my own hyphenation file, since hyphenation
>>> and division into syllables are not quite the same thing in Danish.
>>
>> That's interesting. Could you elaborate a bit on that?
>> For a description of the software, to download it and links to
>> project pages go to http://www.abilitiessoft.com
>>
>> For a description of the software, to download it and links to
>> project pages go to http://www.abilitiessoft.com
>
> For a description of the software, to download it and links to
> project pages go to http://www.abilitiessoft.com
>
> For a description of the software, to download it and links to
> project pages go to http://www.abilitiessoft.com

For a description of the software, to download it and links to
project pages go to http://www.abilitiessoft.com

Follow-Ups:
- [liblouis-liblouisxml] SV: Re: SV: Re: hyphenation-based contracted braille translation [was: specifying digraphs in libelous tables]
  - From: Bue Vester-Andersen

References:
- [liblouis-liblouisxml] specifying digraphs in libelous tables
  - From: Greg Kearney
- [liblouis-liblouisxml] SV: specifying digraphs in libelous tables
  - From: Bue Vester-Andersen
- [liblouis-liblouisxml] Re: SV: specifying digraphs in libelous tables
  - From: Bert Frees
- [liblouis-liblouisxml] SV: Re: SV: specifying digraphs in libelous tables
  - From: Bue Vester-Andersen
- [liblouis-liblouisxml] Re: hyphenation-based contracted braille translation [was: specifying digraphs in libelous tables]
  - From: Bert Frees
- [liblouis-liblouisxml] SV: Re: hyphenation-based contracted braille translation [was: specifying digraphs in libelous tables]
  - From: Bue Vester-Andersen

[liblouis-liblouisxml] Re: SV: Re: hyphenation-based contracted braille translation [was: specifying digraphs in libelous tables]

Other related posts: