[liblouis-liblouisxml] Re: Fix for extra space after 'changetable'

  • From: Bert Frees <bertfrees@xxxxxxxxx>
  • To: liblouis-liblouisxml@xxxxxxxxxxxxx
  • Date: Mon, 07 Jul 2014 19:52:54 +0200

Hi Keith,

I appreciate the issue with bad markup and I agree that there needs to
be some logic that can automatically handle whitespace correctly. I
wonder however if such a more sophisticated algorithm should be part of
the formatter or not. Fixing markup errors seems like a task that could
be done in a preprocessing step. It also seems like it could depend a
lot on the the XML supplier and the type of errors, so we can't make too
many assumptions either.

Another example of something that liblouis can handle but is not really
a part of braille translation are the `correct` rules in some liblouis
tables, introduced to fix OCR errors. This is also something that could
be done in a preprocessing step. The difference is that it's easy to
remove translation rules, it's much harder to change the behaviour of
liblouisutdml because it's coded in C.


Bert


Keith Creasy writes:

> Hi Bert.
>
> The problem with "requiring" anything to be in the original XML is that we 
> (those who use the files to produce Braille) don't really have control over 
> it. Essentially we can't require anything and have to work with what we get. 
> If things are done correctly whitespace isn't really an issue because the DTD 
> or schema dictates when white space is "significant" or not and when it can 
> be ignored. I believe a lot of the white-space handling code is there to 
> mostly correct what is essentially bad markup, at least in the context of 
> processing XML. I think we do need some of this code in there but it perhaps 
> needs to be a little more sophisticated regarding when it either adds or 
> removes white space.
>
>
> -----Original Message-----
> From: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx 
> [mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] On Behalf Of Bert Frees
> Sent: Monday, July 07, 2014 1:06 PM
> To: liblouis-liblouisxml@xxxxxxxxxxxxx
> Subject: [liblouis-liblouisxml] Re: Fix for extra space after 'changetable'
>
> OK John, thanks for the math example. I think what I'm suggesting is: a 
> Braille formatter shouldn't be concerned about adding whitespace (unless it 
> is for positioning and indentation, obviously). It should only have to 
> *remove* whitespace, and more specifically if it's 'insignificant'. There 
> should be clear and simple rules about when whitespace is significant and 
> when not. (I think I suggested a set of rules a while ago inspired by the 
> XHTML spec.)
>
> If space needs to be present before and after mathematical expressions, it's 
> better to require it to be already present in the XML. I think that approach 
> is safer than assuming space is always needed, because always adding space 
> can lead to annoying situations such as the one Paul describes.
>
> I'm not sure about Paul's solution, it seems more like a workaround. There 
> might not even be a space after the abbr tag in his example. Will the patch 
> still work in that case?
>
> By the way, Paul, I'm only trying to feed the discussion in the hope others 
> will get exited about fixing the problem. Don't expect me to make any code 
> changes in liblouisutdml, I haven't got any time allocated for that anymore, 
> unfortunately.
>
>
> Bert
>
>
>
> John J. Boyer writes:
>
>> insert___translation may be called more than once in translating a 
>> block of text such as a paragraph. This happens if different tables 
>> are needed. For example, if a paragraph contains text followed by 
>> MathML followed by more text then the literary text table will be 
>> used, then a math table, then the literaryTextTable again. The math 
>> should ordinarily be preceded and followed by a space. Of course, 
>> something more discriminating could be used. If you like Paaul's 
>> solution you could try it. A test to make sure it doesn't break 
>> something else would also be nice.
>>
>> John
>>
>>
>> On Mon, Jul 07, 2014 at 05:55:33PM +0200, Bert Frees wrote:
>>> John,
>>> 
>>> You say that "the space is added to keep things from running 
>>> together". Remind me, why exactly is that needed again? Is it because 
>>> you remove space at an other place in the code? If so, shouldn't the 
>>> logic that removes space be a little more conservative so that it 
>>> doesn't remove space that has to be added again later?
>>> 
>>> Bert
>>> 
>>> 
>>> Paul Wood writes:
>>> 
>>> > Hi John,
>>> > We have looked at the code and found that the last 3 statements of 
>>> > change_table.c are:
>>> >   insert_translation (ud->main_braille_table);
>>> >    ud->main_braille_table = oldTable;
>>> >    pop_sem_stack ();
>>> >
>>> > This last 'insert_translation' calls the code I referred to in my 
>>> > previous email and because the text being translated doesn't end 
>>> > with a space e.g. 'a <abbr>US</abbr> state' then the 
>>> > 'insert_translation' code ADDS a space. As the next bit to be 
>>> > translated starts with a space, we end up with two spaces. There is 
>>> > nothing else that would add the space in the next two lines. as far 
>>> > as we can tell What we suggest is that the code does a look forward 
>>> > and if the next character to be translated is a space then it 
>>> > doesn't add that space, but otherwise it does. Is this a workable 
>>> > solution?
>>> > Suggested  solution:
>>> >
>>> > The line is in 'transcriber.c' and is in the function 
>>> > 'insert_translation (const char *table)'
>>> > It's replacing:
>>> > if (ud->translated_length > 0 && ud->translated_length <
>>> >        MAX_TRANS_LENGTH &&
>>> >        ud->translated_buffer[ud->translated_length - 1] > 32)
>>> >      {
>>> >      ud->translated_buffer[ud->translated_length++] = 32;
>>> >
>>> > with:
>>> > if (ud->translated_length > 0 && ud->translated_length <
>>> >
>>> >      MAX_TRANS_LENGTH &&
>>> >        ud->translated_buffer[ud->translated_length - 1] > 32 &&
>>> >              ud->text_buffer[0]!=32)
>>> >      {
>>> >      ud->translated_buffer[ud->translated_length++] = 32;
>>> >
>>> > Thanks
>>> > Paul
>>> >
>>> >
>>> >
>>> > On 30/06/2014 17:03, John J. Boyer wrote:
>>> >> The space is added to keep things from running together. Your 
>>> >> concern about breaking something is justified, since this is the 
>>> >> function that handles all translations. The problem is  more 
>>> >> likely to be in the function that actually handles changetable.
>>> >>
>>> >> John
>>> >>
>>> >> On Mon, Jun 30, 2014 at 03:40:11PM +0100, Paul Wood wrote:
>>> >>> Hi Guys,
>>> >>> We have a university student volunteering with us for the summer! 
>>> >>> So he thinks he has found the cause of the extra space after the 
>>> >>> changetable opcode. I'm worried it will break something else and 
>>> >>> I don't think we can run the checks as we are using windows.
>>> >>>
>>> >>> The line is in 'transcriber.c' and is in the function 
>>> >>> 'insert_translation (const char *table)'
>>> >>> It's after:
>>> >>> if (ud->translated_length > 0 && ud->translated_length <
>>> >>>        MAX_TRANS_LENGTH &&
>>> >>>        ud->translated_buffer[ud->translated_length - 1] > 32)
>>> >>>      {
>>> >>>
>>> >>> and is:
>>> >>> ud->translated_buffer[ud->translated_length++] = 32;
>>> >>>
>>> >>> He tells me 32 is the ascii for space, so basically it's adding a space.
>>> >>> Please tell me if we can do the checks under windows and what 
>>> >>> else we need to do ie. create a fork etc.
>>> >>> Thanks
>>> >>> Paul
>>> >>>
>>> >>> --
>>> >>> Paulw.torchtrust signature
>>> >>>
>>> >>> Paul Wood, Chief Technical Officer *Torch Trust* Torch House, 
>>> >>> Torch Way, Market Harborough, Leics. LE16 9HL, UK Direct Line: 
>>> >>> *+44(0)1858 438269*
>>> >>> Tel: *+44(0)1858 438260*, Fax: *+44(0)1858 438275*
>>> >>> Email: paulw@xxxxxxxxxxxxxx <mailto:paulw@xxxxxxxxxxxxxx>
>>> >>> Website: www.torchtrust.org <http://www.torchtrust.org/>
>>> >>>
>>> >>> ____________________________________________________
>>> >>>
>>> >>> Chief Executive: Dr Gordon Temple Charity No. 1095904
>>> >>>
>>> >>> Privileged/Confidential Information may be contained in this message.
>>> >>> If you are not the intended recipient please destroy this message 
>>> >>> and kindly notify the sender by reply email. The computer from 
>>> >>> which this mail originates is equipped with virus screening software.
>>> >>> However Torch Trust cannot guarantee that the mail and its 
>>> >>> attachments are free from virus infection.
>>> >>>
>>> 
>>> For a description of the software, to download it and links to 
>>> project pages go to http://www.abilitiessoft.com
>
> For a description of the software, to download it and links to project pages 
> go to http://www.abilitiessoft.com
> For a description of the software, to download it and links to
> project pages go to http://www.abilitiessoft.com

For a description of the software, to download it and links to
project pages go to http://www.abilitiessoft.com

Other related posts: