[brailleblaster] Re: tika and text translation in brailleblaster

  • From: "John J. Boyer" <john@xxxxxxxxxxxxxx>
  • To: brailleblaster@xxxxxxxxxxxxx
  • Date: Mon, 25 Oct 2010 15:00:04 -0500

Blank lines (2 newlines) are the conventional manner of indicating 
paragraph breaks in text files, not just in liblouisutdml. This is used 
in tex and also in texinfo. Single newlines should not be indicated with 
<brf/>, since they are usually unimportant. <pre>...</pre> is 
inappropriate, since people are interested in paragraphs. Tables and 
lists canot eally be marked up properly when they occur in a plain text 
file.

tika is deficient in not recognizing & and < characters in text files. I 
hope this is fixed in the upcoming release.

If BrailleBlaster can recognize a file as plain text, it would probably 
be better to pass it directly to liblouisutdml with the configuration 
option formatFor set to utd. This will give a file similar to the sample 
I posted on the www.abilitiessoft.com website a few days ago.

Similarly, If BrailleBlaster can recognize a file as Daisy, it should be 
passed directly to liblouisutdml with formatFor  utd.

Other types of files will have to go through tika. As I pointed out, we 
need to have tika optionally produce Daisy output, not xhtml.

John

On Mon, Oct 25, 2010 at 01:33:02PM -0500, qubit wrote:
> Greetings all --
> In my discussions on the developers and users lists for the tika document 
> translator, I have been focusing my attention first on the processing of txt 
> files.  In tika, txt files are converted to xhtml (the internal 
> representation), but as JohnB has discovered, it does not insert adequate 
> markup.  In particular, in the case of 2 consecutive newlines, tika should 
> insert a paragraph <p> to conform to the way current text files are handled 
> in liblouisutdml.  (John, please feel free to correct any misstatements.)
> 
> Anyway, looking for closely at the markup, I am wondering about other markup 
> as well, such as a line break <br> when a newline is encountered, and also 
> properly escaping special symbols in xhtml so there are no surprises when 
> the output xhtml is rendered by a browser.  For example, supposing the input 
> txt file contains an ampersand: '&'.    This is a special symbol to a 
> browser and so it will try to translate the & and subsequent into a string. 
> Similarly for a '<' symbol.  (I actually got bitten by this type of thing 
> once as I sent a little text html tutorial to someone with an AOL account. 
> Turned out that AOL passed all the markup in my tutorial through to html, 
> and when my friend read it, he was seeing the tutorial with all the examples 
> rendered, rather than the source text.  What was more, he couldn't 
> understand me when I said he wasn't looking at what I typed, and concluded 
> my tutorial was unreadable.
> 
> So in translating from text/plain files, I believe that xhtml special 
> symbols should be escaped: &lt; &gt; &amp; &nbsp; &#nnn; for numbers, etc.
> tika is currently not doing this, but it would be easy to fix.
> 
> Now that leads to the final issue: While exchanging mail on the tika lists, 
> when I got to the part about inserting markup, one person asked if I should 
> recognize lists and tables and insert relevant markup.  I believe this would 
> be a bad idea and out of the scope of a translator.
> 
> Finally, rather than insert markup for paragraphs and newlines, would it be 
> better to have tika generate a large <pre></pre> block around the text? This 
> would not exclude the escaping of special characters, but would take care of 
> counting whitespace to insert markup.
> 
> Any and all comments are welcome.
> Thanks.
> --le
> 

-- 
John J. Boyer, Executive Director
GodTouches Digital Ministry, Inc.
http://www.godtouches.org
Madison, Wisconsin, USA
Peace, Love, Service


Other related posts: