[liblouis-liblouisxml] Re: Fix for extra space after 'changetable'

  • From: Keith Creasy <kcreasy@xxxxxxx>
  • To: "liblouis-liblouisxml@xxxxxxxxxxxxx" <liblouis-liblouisxml@xxxxxxxxxxxxx>
  • Date: Mon, 7 Jul 2014 18:03:08 +0000

Bert.

Sure, pre-processing is an option. I don't know if that alone could eliminate 
the need for LibLouis to do some things with white space issues but it might.

I've found that a lot of XML files are pretty messy.

-----Original Message-----
From: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx 
[mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] On Behalf Of Bert Frees
Sent: Monday, July 07, 2014 1:53 PM
To: liblouis-liblouisxml@xxxxxxxxxxxxx
Subject: [liblouis-liblouisxml] Re: Fix for extra space after 'changetable'

Hi Keith,

I appreciate the issue with bad markup and I agree that there needs to be some 
logic that can automatically handle whitespace correctly. I wonder however if 
such a more sophisticated algorithm should be part of the formatter or not. 
Fixing markup errors seems like a task that could be done in a preprocessing 
step. It also seems like it could depend a lot on the the XML supplier and the 
type of errors, so we can't make too many assumptions either.

Another example of something that liblouis can handle but is not really a part 
of braille translation are the `correct` rules in some liblouis tables, 
introduced to fix OCR errors. This is also something that could be done in a 
preprocessing step. The difference is that it's easy to remove translation 
rules, it's much harder to change the behaviour of liblouisutdml because it's 
coded in C.


Bert


Keith Creasy writes:

> Hi Bert.
>
> The problem with "requiring" anything to be in the original XML is that we 
> (those who use the files to produce Braille) don't really have control over 
> it. Essentially we can't require anything and have to work with what we get. 
> If things are done correctly whitespace isn't really an issue because the DTD 
> or schema dictates when white space is "significant" or not and when it can 
> be ignored. I believe a lot of the white-space handling code is there to 
> mostly correct what is essentially bad markup, at least in the context of 
> processing XML. I think we do need some of this code in there but it perhaps 
> needs to be a little more sophisticated regarding when it either adds or 
> removes white space.
>
>
> -----Original Message-----
> From: liblouis-liblouisxml-bounce@xxxxxxxxxxxxx 
> [mailto:liblouis-liblouisxml-bounce@xxxxxxxxxxxxx] On Behalf Of Bert 
> Frees
> Sent: Monday, July 07, 2014 1:06 PM
> To: liblouis-liblouisxml@xxxxxxxxxxxxx
> Subject: [liblouis-liblouisxml] Re: Fix for extra space after 'changetable'
>
> OK John, thanks for the math example. I think what I'm suggesting is: 
> a Braille formatter shouldn't be concerned about adding whitespace 
> (unless it is for positioning and indentation, obviously). It should 
> only have to *remove* whitespace, and more specifically if it's 
> 'insignificant'. There should be clear and simple rules about when 
> whitespace is significant and when not. (I think I suggested a set of 
> rules a while ago inspired by the XHTML spec.)
>
> If space needs to be present before and after mathematical expressions, it's 
> better to require it to be already present in the XML. I think that approach 
> is safer than assuming space is always needed, because always adding space 
> can lead to annoying situations such as the one Paul describes.
>
> I'm not sure about Paul's solution, it seems more like a workaround. There 
> might not even be a space after the abbr tag in his example. Will the patch 
> still work in that case?
>
> By the way, Paul, I'm only trying to feed the discussion in the hope others 
> will get exited about fixing the problem. Don't expect me to make any code 
> changes in liblouisutdml, I haven't got any time allocated for that anymore, 
> unfortunately.
>
>
> Bert
>
>
>
> John J. Boyer writes:
>
>> insert___translation may be called more than once in translating a 
>> block of text such as a paragraph. This happens if different tables 
>> are needed. For example, if a paragraph contains text followed by 
>> MathML followed by more text then the literary text table will be 
>> used, then a math table, then the literaryTextTable again. The math 
>> should ordinarily be preceded and followed by a space. Of course, 
>> something more discriminating could be used. If you like Paaul's 
>> solution you could try it. A test to make sure it doesn't break 
>> something else would also be nice.
>>
>> John
>>
>>
>> On Mon, Jul 07, 2014 at 05:55:33PM +0200, Bert Frees wrote:
>>> John,
>>> 
>>> You say that "the space is added to keep things from running 
>>> together". Remind me, why exactly is that needed again? Is it 
>>> because you remove space at an other place in the code? If so, 
>>> shouldn't the logic that removes space be a little more conservative 
>>> so that it doesn't remove space that has to be added again later?
>>> 
>>> Bert
>>> 
>>> 
>>> Paul Wood writes:
>>> 
>>> > Hi John,
>>> > We have looked at the code and found that the last 3 statements of 
>>> > change_table.c are:
>>> >   insert_translation (ud->main_braille_table);
>>> >    ud->main_braille_table = oldTable;
>>> >    pop_sem_stack ();
>>> >
>>> > This last 'insert_translation' calls the code I referred to in my 
>>> > previous email and because the text being translated doesn't end 
>>> > with a space e.g. 'a <abbr>US</abbr> state' then the 
>>> > 'insert_translation' code ADDS a space. As the next bit to be 
>>> > translated starts with a space, we end up with two spaces. There 
>>> > is nothing else that would add the space in the next two lines. as 
>>> > far as we can tell What we suggest is that the code does a look 
>>> > forward and if the next character to be translated is a space then 
>>> > it doesn't add that space, but otherwise it does. Is this a workable 
>>> > solution?
>>> > Suggested  solution:
>>> >
>>> > The line is in 'transcriber.c' and is in the function 
>>> > 'insert_translation (const char *table)'
>>> > It's replacing:
>>> > if (ud->translated_length > 0 && ud->translated_length <
>>> >        MAX_TRANS_LENGTH &&
>>> >        ud->translated_buffer[ud->translated_length - 1] > 32)
>>> >      {
>>> >      ud->translated_buffer[ud->translated_length++] = 32;
>>> >
>>> > with:
>>> > if (ud->translated_length > 0 && ud->translated_length <
>>> >
>>> >      MAX_TRANS_LENGTH &&
>>> >        ud->translated_buffer[ud->translated_length - 1] > 32 &&
>>> >              ud->text_buffer[0]!=32)
>>> >      {
>>> >      ud->translated_buffer[ud->translated_length++] = 32;
>>> >
>>> > Thanks
>>> > Paul
>>> >
>>> >
>>> >
>>> > On 30/06/2014 17:03, John J. Boyer wrote:
>>> >> The space is added to keep things from running together. Your 
>>> >> concern about breaking something is justified, since this is the 
>>> >> function that handles all translations. The problem is  more 
>>> >> likely to be in the function that actually handles changetable.
>>> >>
>>> >> John
>>> >>
>>> >> On Mon, Jun 30, 2014 at 03:40:11PM +0100, Paul Wood wrote:
>>> >>> Hi Guys,
>>> >>> We have a university student volunteering with us for the summer! 
>>> >>> So he thinks he has found the cause of the extra space after the 
>>> >>> changetable opcode. I'm worried it will break something else and 
>>> >>> I don't think we can run the checks as we are using windows.
>>> >>>
>>> >>> The line is in 'transcriber.c' and is in the function 
>>> >>> 'insert_translation (const char *table)'
>>> >>> It's after:
>>> >>> if (ud->translated_length > 0 && ud->translated_length <
>>> >>>        MAX_TRANS_LENGTH &&
>>> >>>        ud->translated_buffer[ud->translated_length - 1] > 32)
>>> >>>      {
>>> >>>
>>> >>> and is:
>>> >>> ud->translated_buffer[ud->translated_length++] = 32;
>>> >>>
>>> >>> He tells me 32 is the ascii for space, so basically it's adding a space.
>>> >>> Please tell me if we can do the checks under windows and what 
>>> >>> else we need to do ie. create a fork etc.
>>> >>> Thanks
>>> >>> Paul
>>> >>>
>>> >>> --
>>> >>> Paulw.torchtrust signature
>>> >>>
>>> >>> Paul Wood, Chief Technical Officer *Torch Trust* Torch House, 
>>> >>> Torch Way, Market Harborough, Leics. LE16 9HL, UK Direct Line:
>>> >>> *+44(0)1858 438269*
>>> >>> Tel: *+44(0)1858 438260*, Fax: *+44(0)1858 438275*
>>> >>> Email: paulw@xxxxxxxxxxxxxx <mailto:paulw@xxxxxxxxxxxxxx>
>>> >>> Website: www.torchtrust.org <http://www.torchtrust.org/>
>>> >>>
>>> >>> ____________________________________________________
>>> >>>
>>> >>> Chief Executive: Dr Gordon Temple Charity No. 1095904
>>> >>>
>>> >>> Privileged/Confidential Information may be contained in this message.
>>> >>> If you are not the intended recipient please destroy this 
>>> >>> message and kindly notify the sender by reply email. The 
>>> >>> computer from which this mail originates is equipped with virus 
>>> >>> screening software.
>>> >>> However Torch Trust cannot guarantee that the mail and its 
>>> >>> attachments are free from virus infection.
>>> >>>
>>> 
>>> For a description of the software, to download it and links to 
>>> project pages go to http://www.abilitiessoft.com
>
> For a description of the software, to download it and links to project 
> pages go to http://www.abilitiessoft.com For a description of the 
> software, to download it and links to project pages go to 
> http://www.abilitiessoft.com

For a description of the software, to download it and links to project pages go 
to http://www.abilitiessoft.com
For a description of the software, to download it and links to
project pages go to http://www.abilitiessoft.com

Other related posts: