[liblouis-liblouisxml] Re: ISO8859-2 encoded hyphenation tables

From: Bert Frees <bert.frees@xxxxxxxxxxxxxxxx>
To: liblouis-liblouisxml@xxxxxxxxxxxxx
Date: Thu, 09 Sep 2010 11:45:13 +0200

The hyphenation algorithm in liblouis is a modified version of the onne
used in OpenOffice. It should work with any ISO code. Try the
lou_checkhyphens test tool.


Thanks.

I tried lou_checkhyphens, but unfortunately the same problem occurred:(. Letters with unicode U+00A0 and higher are not always handledcorrectly. (I tried entering the words in both ISO 8859-2 and UTF-8.When I enter them in UTF-8, the length of the "hyphenation mask" doesn'teven match the lenght of the input sometimes.)

If you put your test string through
liblouisxml you may get different results, because liblouisxml has a
hyphhenation routine that decides whether to use the liblouis routine
based on the number of characters that overflow a line.

Yes, I am aware of that. I also know that back-translation must beperformed first.

What are the rules for making hyphenation tables? I've been trying to
find them for a long time.

The TeX hyphenation algorithm is explained at<http://en.wikipedia.org/wiki/TeX#Hyphenation_and_justification>.Basically, an odd number means letters can be split, an even numbermeans letters cannot be split and higher numbers have higher precedence.OpenOffice.org (Hunspell) uses a modified implementation of the originalTeX algorithm and therefore needs conversion of the standard hyphenationpatterns, but that's not entirely clear to me. More info at<http://wiki.services.openoffice.org/wiki/Documentation/SL/Using_TeX_hyphenation_patterns_in_OpenOffice.org#1._Download_up-to-date_TeX_hyphenation_patterns>.



Bert

Thanks,
John

On Wed, Sep 08, 2010 at 01:54:00PM +0200, Bert Frees wrote:

    Hi listers,

    I've been experimenting a little with hyphenation tables because I want to
    understand them better, and there's not much about them in the
    documentation. I think liblouis has a problem with hyphenation tables that
    are not encoded in ISO8859-1.

    As an example, I've made a small translation table and hyphenation table.
    The hyphenation table is encoded in ISO8859-2 and has only one entry,
    which says that b and c should always be split.

    ****************** Translation table **************
    space \x0020       0      (blank)
    uplow \x0042\x0062 12     (letter b)
    uplow \x0106\x0107 146    (letter c with acute)
    uplow \x00C6\x00E6 123456 (letter ae)
    ***************************************************

    ****************** Hyphenation table **************
    ISO8859-2
    b1c
    ***************************************************

    Then, if I try to transcribe a file with the string

    "bbbccc bbbccc bbbccc bbbccc bbbccc bbbccc bbbccc bbbccc ..."

    the words are not split. Strangly enough, when i transcribe the string

    "bbbaeaeae bbbaeaeae bbbaeaeae bbbaeaeae bbbaeaeae bbbaeaeae bbbaeaeae
    bbbaeaeae ..."

    the words are split!! It is obvious that liblouis confuses the letters c
    (unicode U+0107 and E6 in ISO8859-2) and ae (which is unicode U+00E6). In
    the Polish hyphenation table (hyph_pl_PL.dic) I noticed the letter c is
    represented by "/c" (slash-c). But changing "b1c" into "b1/c" doesn't
    solve the problem either.

    Anybody got any idea of what the cause of this problem might be?

    Bert


For a description of the software and to download it go to
http://www.jjb-software.com

Follow-Ups:
- [liblouis-liblouisxml] Re: ISO8859-2 encoded hyphenation tables
  - From: John J. Boyer

References:
- [liblouis-liblouisxml] ISO8859-2 encoded hyphenation tables
  - From: Bert Frees
- [liblouis-liblouisxml] Re: ISO8859-2 encoded hyphenation tables
  - From: John J. Boyer

[liblouis-liblouisxml] Re: ISO8859-2 encoded hyphenation tables

Other related posts: