[liblouis-liblouisxml] Re: hyphenation-based contracted braille translation [was: specifying digraphs in libelous tables]

  • From: Bert Frees <bertfrees@xxxxxxxxx>
  • To: liblouis-liblouisxml@xxxxxxxxxxxxx
  • Date: Fri, 06 Jun 2014 15:19:38 +0200

Thanks Bue, very useful indeed!

I'm aware of how libhyphen tables work and what the difference is with
TeX tables, but I haven't used patgen. At my former employer SBS we
worked with the German hyphenation table from LibreOffice augmented with
a big list of exception words. (I think I told you before.) These
exception words are the result of proofreading the braille books that
SBS produces. This works but maybe new errors will keep on emerging at
the same rate and proofreading may be necessary for a long time.

I'm not currently working on any hyphenation tables, but that may change
soon, so learning how to work with patgen could be useful.

But I'm interested mainly by your hyphenation-based approach for
contracted braille translation. It is an approach I never thought of
myself. It seems logical that contraction is related to hyphenation, so
exploiting that is very clever. Unfortunately my knowledge of Braille is
too limited to know for sure that there is an obvious connection between
the two.

At SBS, we have always considered hyphenation and translation as two
separate problems. They even often hinder each other: some braille rules
needed hyphenation marks in them because they would otherwise wrongly
eliminate previously inserted break points.

I wonder if your approach would work for all languages.

I still don't really understand how hyphenation and division into
syllables are different things in Danish.

Thanks,
Bert


Bue Vester-Andersen writes:

> Hi Bert,
>
> It is a somewhat lengthy process, but when it works it will create fantastic
> Braille contraction. No long exception lists etc. The Liblouis rules become
> much simpler.
>
> The hyphenation file is a list of competing patterns with odd numbers
> indicating "hypenation allowed" and even numbers indicating "hyphenation not
> allowed". In principal, the file could be created manually, but that would
> be a nightmare in most languages, unless the hyphenation rules are very
> simple. In stead, you use the patgen program from Tex to create the
> hyphenation files from a list of known-good hyphenated words. The more
> words, the better. Then you need to convert the file from the Tex format to
> the format used by LibraOffice and Liblouis. The two formats are very
> similar, but if you use the tex format with Liblouis, it will fail silently.
> You will just get a lot of strange hyphenation errors.
>
> The real trick is finding/making a list of hyphenated words. You will need a
> corpus or extracted dictionary like Aspell or something like that. The best
> would be a corpus with words sorted so that the most commonly used appear
> first. I have created a Python script to create such a corpus from txt
> files. Maybe, you can also lay your hands on a more official corpus.
>
> Start out by hyphenating a few thousand words manually. Compile the rules
> with Patgen. Hyphenate your corpus using these rules. Proof-read more words
> and add them to the hyphenation list and compile again ... and so on, until
> you start getting acceptable results.
>
> At some point you may decide to stop proof-reading the hyphenation and only
> add words where hyphenation errors result in an incorrect contraction.
>
> Currently, my hyphenation list contains close to 50,000 words. It gives a
> near perfect contraction result. Whenever I find errors, I add the words to
> the hyphenation list. Then I compare the result of contracting the whole
> corpus of 638,000 words before and after. I proof-read the changes and add
> them to the list. This process is repeated until i get no changes. So,
> starting with 5 words, I can easily end up adding a total of 500 words to
> the list, before a new "steady-state" has been reached. But, as I said, it
> gives fantastic results, better than any other automatic contraction system
> that I have seen.
>
> Working with patgen is not trivial. If you are interested, I can help you in
> more details. This was just a general explanation, but I hope you can use
> it.
>
> Bue
>
>
> -----Oprindelig meddelelse-----
> Fra: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx
> [mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] På vegne af Bert Frees
> Sendt: 6. juni 2014 12:26
> Til: liblouis-liblouisxml@xxxxxxxxxxxxx
> Emne: [liblouis-liblouisxml] Re: SV: specifying digraphs in libelous tables
>
> Hi Bue,
>
>> I eventually had to make my own hyphenation file, since hyphenation
>> and division into syllables are not quite the same thing in Danish.
>
> That's interesting. Could you elaborate a bit on that?
> For a description of the software, to download it and links to
> project pages go to http://www.abilitiessoft.com
>
> For a description of the software, to download it and links to
> project pages go to http://www.abilitiessoft.com

For a description of the software, to download it and links to
project pages go to http://www.abilitiessoft.com

Other related posts:

  • » [liblouis-liblouisxml] Re: hyphenation-based contracted braille translation [was: specifying digraphs in libelous tables] - Bert Frees