[liblouis-liblouisxml] Re: SV: Re: SV: Multipass back translation.

  • From: "Michael Whapples" <dmarc-noreply@xxxxxxxxxxxxx> (Redacted sender "mwhapples" for DMARC)
  • To: liblouis-liblouisxml@xxxxxxxxxxxxx
  • Date: Fri, 6 Jan 2017 17:11:15 +0000

May be how LibLouis currently works would lead to what you say in the example, although I would say not necessarily has to be that way.

For the result you said of @1-4 it would require that pass2 and assumably pass3 and pass4 mutate the input rather than copying to a new string. If output is separate to input and rules in that pass only match against the input, then the rule:
pass2 @3-2 @4
could never match in a input string of @1-3.

Further to this even under a mutating the input model, I still do not see how the result you suggest could occur if applying the rules as I said.

Cursor at index=0, neither pass2 rule matches (remember the content of the replacement brackets must be a @1). Therefore do not modify and advance cursor by one.
Cursor at index=1, again neither match. The second does not match because there is no @2 following the @3 in the input yet. Advance the cursor.
Cursor at index=2, the pass2 @1-3[] @2 matches because whilst the focus is nothing it is preceeded by @1-3. Insert @2 at the position of the replacement brackets, input now is @1-3-2. Cursor does not advance.
Cursor at index=2, unfortunately we end up in a loop due to @1-3[] still matching. The other rule of @3-2 will not match because the inferred brackets would say that is really [@3-2] and the cursor is after @3.

Yes though with the mutating input model you could do some interesting things along the lines of what you said, rules like:
pass2 @1-3[] @2
pass2 @3[@2] @4
Would get @1-3-4. Also this pairing gets you out of the looping situation by that second one becoming enabled.

What this though has highlighted is that my suggestion would still have the possibility of getting stuck in loops. In fact my suggestion would mean empty replacement brackets [] will always cause a endless loop, but I guess that could easily be checked and an error could be given.

It definitely would require that @56* in a third column would have to be accepted as without it some rules might be impossible.

Michael Whapples
On 06/01/2017 16:22, Bert Frees wrote:

One problem with your proposed alternative behavior is that if you have the following two rules:

pass2 @1-3[] @2
pass2 @3@2 @4

and the input string to the second pass is @1-3, then it would result in @1-4. In other words, a replacement string is processed again in the same pass, which in a way is also unintuitive. With the current behavior you can't have that.

Regarding the documentation: yes it could be more precise and comprehensive. Any volunteers?




2017-01-06 16:06 GMT+01:00 Michael Whapples <dmarc-noreply@xxxxxxxxxxxxx <mailto:dmarc-noreply@xxxxxxxxxxxxx>>:

    Well OK may be it is sort of said in the documentation but its not
    as clear as it could be and the significance I think gets lost
    amongst everything else in there.


    My thought of it not being as clear as it could be is to refer to
    the replaced text is where the terminology is may be not precise
    enough.


    Using these two rules (slightly modified from the past):

    pass2 @1-3[] @2

    pass2 []@1-3 @2

    In the first case the @1-3 is copied to the output by the rule.
    Whilst may be not being modified and so technically not being
    replaced, it is still being handled. May be wrongly I had just
    taken the term replaced text to mean handled text. If somehow this
    could be modified to emphasise that it is after the closing
    replacement bracket this might help.


    Now moving to the second rule, I then take it to be that the @1-3
    would also be handled and copied by the rule, not so. This is
    where it becomes unintuitive, stuff before [] is handled but stuff
    after is not.


    In fact if anything were to be changed I would actually go with
    the current cursor position relating to the opening replacement
    bracket [ and anything before it is searched back from the cursor.
    My reasons are:

    1. This is how regular expressions work, may as well work like
    other systems people may be familiar with.

    2. Interaction of rules would be more consistant. If a table had
    these two rules:

    pass2 @1-3[] @2

    pass2 @3 @4

    and give a string of @1-3 to the second pass we currently would get:

    @1-3-2

    The pass2 @3 @4 rule does not get applied. If though the table had
    these two rules:

    pass2 []@1-3 @2

    pass2 @3 @4

    We currently get @2-1-4. So this did allow the pass2 @3 @4 rule to
    be applied. If doing the change as I said then in both cases the
    pass2 @3 @4 rule would be applied.


    I am not saying we must change it, after all that could be some
    work to ensure tables still work correctly. However if a change is
    going to be made then this would be my preferred option.


    In the meantime though may be the documentation could highlight
    the significance of where the cursor is placed when context or
    multipass rules are applied.


    Michael Whapples


    On 06/01/2017 13:01, Bert Frees wrote:
    This is all in the documentation, although maybe not in so many
    words. See the last paragraph of "2.11 The Context and Multipass
    Opcodes".

    As far as I remember Christian just quoted me from an email in
    which I was actually asking about the inner workings myself, long
    time ago (I looked it up, 2010). What I wrote then was just what
    I was guessing based on experimentation, John said it was correct
    and Christian just copied it to the documentation. After reading
    it again, I don't think it's 100% accurate though.

    I get what you are trying to say with your example. I'm not sure
    what the reason is for not advancing the cursor in the second
    case. I guess it's in order to be able to more in a single pass.
    It's indeed not super-intuitive.

    The important question is: are there cases in which we want to
    advance to the end of the entire match, not just to the end of
    the square brackets? The answer is yes: see for example the
    "pass2 []%englishLetter. @56" case. So the next question is: are
    there any cases where this can't be be solved with the asterisk?
    That is, instead use "pass2 %englishLetter. @56*", or more
    general, convert "pass2 []<x> <y>" to "pass2 <x> <y>*".

    If there are no such cases, I wouldn't touch the algorithm. If
    there are such cases, we should try to find a solution.




    2017-01-06 12:35 GMT+01:00 Michael Whapples
    <dmarc-noreply@xxxxxxxxxxxxx <mailto:dmarc-noreply@xxxxxxxxxxxxx>>:

        Thank you that has actually made it very clear on why some
        rules with [] work when others get into the loop.


        It seemed a bit odd that something before [] would advance
        the cursor when something after does not. Take two rules like:

        pass2 @1[] @1

        pass2 []@1 @1

        Why should the first rule advance the cursor when the second
        does not.


        I understand from you explaining how the internals work, but
        without that internal workings knowledge it does not seem
        logical or intuitive that before and after are handled
        differently.


        May be this also could be a documentation improvement, add
        something which states that the cursor will be moved to the
        position just after the replacement brackets when a rule is
        applied.


        Another small note, it might be worth being extremely precise
        in terminology here. The term "replacement" might mean the
        brackets [] or it may mean what you are replacing it with. So
        may be for the brackets [] refer to them as the "replacement
        brackets" or the "replacement group".


        Michael Whapples


        On 06/01/2017 09:33, Bert Frees wrote:
        Yes, it is like other translation rules. However, with both
        multipass and other translation rules, it is not the first
        match that is used, but rather the best match. Only one
        matching rule is used, the rest is ignored. Processing
        resumes at the first character after the replacement. This
        means that if the replacement starts at offset 0 and has
        length 0, the processing resumes at the same place which
        results in an endless loop.

        2017-01-06 8:09 GMT+01:00 Dave Mielke <dave@xxxxxxxxx
        <mailto:dave@xxxxxxxxx>>:

            I'm having trouble understanding when a multipass opcode
            (e.g. pass2) moves on
            to the next character. It doesn't seem to be like
            translation rules where the
            first one that matches is used, the rest are skipped,
            and processing resumes at
            the next character after the replacement.

            Are they all always processed, or is the first one that
            matches the only one
            that's processed?

            Where does processing resume after a match?

            --
            Dave Mielke        | 2213 Fox Crescent | The Bible is
            the very Word of God.
            Phone: 1-613-726-0014 <tel:1-613-726-0014> | Ottawa,
            Ontario   | http://Mielke.cc/bible/
EMail: Dave@xxxxxxxxx <mailto:Dave@xxxxxxxxx> | Canada K2A 1H7 | http://FamilyRadio.org/
            For a description of the software, to download it and
            links to
            project pages go to http://liblouis.org







Other related posts: