[liblouis-liblouisxml] Re: ISO8859-2 encoded hyphenation tables

  • From: Bert Frees <bert.frees@xxxxxxxxxxxxxxxx>
  • To: liblouis-liblouisxml@xxxxxxxxxxxxx
  • Date: Thu, 09 Sep 2010 11:45:13 +0200


The hyphenation algorithm in liblouis is a modified version of the onne
used in OpenOffice. It should work with any ISO code. Try the
lou_checkhyphens test tool.

Thanks.
I tried lou_checkhyphens, but unfortunately the same problem occurred :(. Letters with unicode U+00A0 and higher are not always handled correctly. (I tried entering the words in both ISO 8859-2 and UTF-8. When I enter them in UTF-8, the length of the "hyphenation mask" doesn't even match the lenght of the input sometimes.)

If you put your test string through
liblouisxml you may get different results, because liblouisxml has a
hyphhenation routine that decides whether to use the liblouis routine
based on the number of characters that overflow a line.

Yes, I am aware of that. I also know that back-translation must be performed first.

What are the rules for making hyphenation tables? I've been trying to
find them for a long time.

The TeX hyphenation algorithm is explained at <http://en.wikipedia.org/wiki/TeX#Hyphenation_and_justification>. Basically, an odd number means letters can be split, an even number means letters cannot be split and higher numbers have higher precedence. OpenOffice.org (Hunspell) uses a modified implementation of the original TeX algorithm and therefore needs conversion of the standard hyphenation patterns, but that's not entirely clear to me. More info at <http://wiki.services.openoffice.org/wiki/Documentation/SL/Using_TeX_hyphenation_patterns_in_OpenOffice.org#1._Download_up-to-date_TeX_hyphenation_patterns>.


Bert

Thanks,
John

On Wed, Sep 08, 2010 at 01:54:00PM +0200, Bert Frees wrote:
    Hi listers,

    I've been experimenting a little with hyphenation tables because I want to
    understand them better, and there's not much about them in the
    documentation. I think liblouis has a problem with hyphenation tables that
    are not encoded in ISO8859-1.

    As an example, I've made a small translation table and hyphenation table.
    The hyphenation table is encoded in ISO8859-2 and has only one entry,
    which says that b and c should always be split.

    ****************** Translation table **************
    space \x0020       0      (blank)
    uplow \x0042\x0062 12     (letter b)
    uplow \x0106\x0107 146    (letter c with acute)
    uplow \x00C6\x00E6 123456 (letter ae)
    ***************************************************

    ****************** Hyphenation table **************
    ISO8859-2
    b1c
    ***************************************************

    Then, if I try to transcribe a file with the string

    "bbbccc bbbccc bbbccc bbbccc bbbccc bbbccc bbbccc bbbccc ..."

    the words are not split. Strangly enough, when i transcribe the string

    "bbbaeaeae bbbaeaeae bbbaeaeae bbbaeaeae bbbaeaeae bbbaeaeae bbbaeaeae
    bbbaeaeae ..."

    the words are split!! It is obvious that liblouis confuses the letters c
    (unicode U+0107 and E6 in ISO8859-2) and ae (which is unicode U+00E6). In
    the Polish hyphenation table (hyph_pl_PL.dic) I noticed the letter c is
    represented by "/c" (slash-c). But changing "b1c" into "b1/c" doesn't
    solve the problem either.

    Anybody got any idea of what the cause of this problem might be?

    Bert

For a description of the software and to download it go to
http://www.jjb-software.com

Other related posts: