[brailleblaster] Re: More Clarifications

  • From: Michael Whapples <mwhapples@xxxxxxx>
  • To: brailleblaster@xxxxxxxxxxxxx
  • Date: Tue, 07 Aug 2012 22:14:00 +0100

Hello,
A few of questions regarding the text tika gives from PDF:
1. Does tika insert page break characters where the PDF starts a new page? I know some PDF to text tools do this but I don't know about tika. 2. How would liblouisutdml handle these page break characters? Would that result in a page break in the Braille although one isn't really wanted? 3. If tika inserts a blank line between paragraphs, then wouldn't it be better to see if we could actually fix tika (or the PDF processor tika uses, I think it might be PDFBox) so that the HTML has proper paragraphs? 4. How does tika determine the flow of the text? Adobe reader has a few options for this. Also does tika attempt to maintain the layout (like pdftotext does when you give it the -layout option)? I guess an example of what I see might cause an issue is where the PDF has multiple columns (not a table) like a page from a science journal which may have the page split into two columns for visual appearance purposes only.

Michael Whapples
On 07/08/2012 21:41, John J. Boyer wrote:
Hi Francois,

liblouisutdml treats blank lines in a plain text file as paragraph
separators, so if the file has those we will hopefully get sensible
paragraphs.

John

On Tue, Aug 07, 2012 at 02:26:22PM -0400, Fran�ois Ouellette wrote:
John: I tried a few approached when importing non-XML files with Tika,
one that produces text only, that we just display on the daisy view,
and one that produces an XML file (in fact in is a XHTML file). When
walking the XHTML with XOM we get what is displayed today on the daisy
view. I tried opening the XHTML with liblouisutdml using a sem file
but the results were not very good. The problem is that with Tika we
usually get meta, heading and title elements, but only one <p> element
that holds all the extracted text. I have not tried processing the
Tika text file with liblouisutdml as you indicated earlier. This may
be the better option, as we would also get a UTD structure and an
initial translation.

F.

On Tue, Aug 7, 2012 at 1:47 PM, John J. Boyer
<john.boyer@xxxxxxxxxxxxxxxxx> wrote:
UTF-8 should be translated only for display purposes. liblouisutdml
requires the opriginal UTF-8.

When I said that text files should be procedssed by calling
translateTextFile with formatFor utd I was thinkinng of plain text, not
text derived from imported files such as pdf. It would probably be more
consistent to let tika handle even plain text, converting it to xml.

John

On Tue, Aug 07, 2012 at 11:45:15AM -0400, Fran�ois Ouellette wrote:
When importing non-XML documents with foreign or special characters
they may contain Unicode expressions such as \u00e9 since they were
not processed by  liblouisutdml. I have a routine to find the
corresponding codepoints and display the corresponding character. I
haven't done much testing yet but I guess that when saving as UTD
these should be processed correctly.

François.

On Tue, Aug 7, 2012 at 10:04 AM, John J. Boyer <john@xxxxxxxxxxxxxx> wrote:
Hi Francois,

What context are you considering when you ask about UTF-8? If these
codes occur in xml documents they are automatically handled by
liblouisutdml on translation. What does Java do when you attempt to
display them?

Thanks,
John

On Tue, Aug 07, 2012 at 07:59:06AM -0400, Fran�ois Ouellette wrote:
John: thanks for the clarifications. We are half-way through for the
brf files, I will add a method to read and backtranslate them.

What about UTF-8? Is BB supposed to recognize the \u sequences and
change them to the corresponding characters?

Thanks.
François.

On Mon, Aug 6, 2012 at 8:53 PM, John J. Boyer
<john.boyer@xxxxxxxxxxxxxxxxx> wrote:
My vision is that BrailleBlaster will be able to display and edit any
flavor of xml, just as liblouisutdml can translate any flavor.
Liblouisutdml accompliishes this by using a sort of pattern-matching
virtual machine. The semantic-action files are the "programs" for this
VM. If I had it to do over again I would format them somewhat
differently. Each line would contain first the pattern, then the
"instruction", then parameters, separated by white space. Optionally,
an equals sign could be inserted between the patterns and the
instructions, so Java could accept them as properties files.

Most of the patterns are literal such as "p" "span,class,italic", and so
on. Patterns can also be XPath expressions.

The instruction is either the name of an action to be applied to the
pattern, a style or a macro.

The parameters are bits of text to be inserted between the texts
contained in the subtree of the patterns. For an example, see nemeth.sem

For BrailleBlaster, the patterns would be similar, actions would also be
similar in many cases, except that those having to do with Braille would
be dropped, and others, having to do with displaying on a screen would
be added.

This describes the display virtual machine. The edit virtual machine
would be more complex, since there are two types of editing, changing
the text in a text node and adding or deleting nodes. The former is
quite straightforward. The latter will generally require selecting the
name of a style. The definition of style will have to include the name
of the element and any relevant attribute names and values.

On other clarifications: The best way to handle text files is to use the
translateTextFile method with the configuration setting formatFor utd
This will result in an output file with text paragraphs (separated by
blank lines) enclosed in <p> tags and the Braille translation enclosed
in <brl> tags, as normal. This can then be handled by BrailleBlaster
like any other utd file.

BrailleBlaster is also supposed to handle natively brf files. When these
are recognized they should be displayed in the Braille view. The method
to use is backTranslateFile formatFor utd should also be specified.
Again the resulting output file can be handled like any other utd file.

John

--
John J. Boyer; President, Chief Software Developer
Abilitiessoft, Inc.
http://www.abilitiessoft.com
Madison, Wisconsin USA
Developing software for people with disabilities


--
John J. Boyer, Executive Director
GodTouches Digital Ministry, Inc.
http://www.godtouches.org
Madison, Wisconsin, USA
Peace, Love, Service


--
John J. Boyer; President, Chief Software Developer
Abilitiessoft, Inc.
http://www.abilitiessoft.com
Madison, Wisconsin USA
Developing software for people with disabilities




Other related posts: