[liblouis-liblouisxml] SV: Re: SV: specifying digraphs in libelous tables

  • From: Bue Vester-Andersen <bue@xxxxxxxxxxxxxxxxxx>
  • To: <liblouis-liblouisxml@xxxxxxxxxxxxx>
  • Date: Fri, 6 Jun 2014 14:19:27 +0200

Hi Bert,

It is a somewhat lengthy process, but when it works it will create fantastic
Braille contraction. No long exception lists etc. The Liblouis rules become
much simpler.

The hyphenation file is a list of competing patterns with odd numbers
indicating "hypenation allowed" and even numbers indicating "hyphenation not
allowed". In principal, the file could be created manually, but that would
be a nightmare in most languages, unless the hyphenation rules are very
simple. In stead, you use the patgen program from Tex to create the
hyphenation files from a list of known-good hyphenated words. The more
words, the better. Then you need to convert the file from the Tex format to
the format used by LibraOffice and Liblouis. The two formats are very
similar, but if you use the tex format with Liblouis, it will fail silently.
You will just get a lot of strange hyphenation errors.

The real trick is finding/making a list of hyphenated words. You will need a
corpus or extracted dictionary like Aspell or something like that. The best
would be a corpus with words sorted so that the most commonly used appear
first. I have created a Python script to create such a corpus from txt
files. Maybe, you can also lay your hands on a more official corpus.

Start out by hyphenating a few thousand words manually. Compile the rules
with Patgen. Hyphenate your corpus using these rules. Proof-read more words
and add them to the hyphenation list and compile again ... and so on, until
you start getting acceptable results.

At some point you may decide to stop proof-reading the hyphenation and only
add words where hyphenation errors result in an incorrect contraction.

Currently, my hyphenation list contains close to 50,000 words. It gives a
near perfect contraction result. Whenever I find errors, I add the words to
the hyphenation list. Then I compare the result of contracting the whole
corpus of 638,000 words before and after. I proof-read the changes and add
them to the list. This process is repeated until i get no changes. So,
starting with 5 words, I can easily end up adding a total of 500 words to
the list, before a new "steady-state" has been reached. But, as I said, it
gives fantastic results, better than any other automatic contraction system
that I have seen.

Working with patgen is not trivial. If you are interested, I can help you in
more details. This was just a general explanation, but I hope you can use
it.

Bue


-----Oprindelig meddelelse-----
Fra: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx
[mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] På vegne af Bert Frees
Sendt: 6. juni 2014 12:26
Til: liblouis-liblouisxml@xxxxxxxxxxxxx
Emne: [liblouis-liblouisxml] Re: SV: specifying digraphs in libelous tables

Hi Bue,

> I eventually had to make my own hyphenation file, since hyphenation
> and division into syllables are not quite the same thing in Danish.

That's interesting. Could you elaborate a bit on that?
For a description of the software, to download it and links to
project pages go to http://www.abilitiessoft.com

For a description of the software, to download it and links to
project pages go to http://www.abilitiessoft.com

Other related posts: