[brailleblaster] Re: More Clarifications

From: François Ouellette <braille@xxxxxxx>
To: brailleblaster@xxxxxxxxxxxxx
Date: Tue, 7 Aug 2012 19:59:44 -0400
Michael: Tika is basically a text extraction interface, not a document
converter. It gets what it finds and depending on how much detail is
available in the source document (PDF documents can be formatted is
many ways) it will create paragraph tags if we ask for a XHTML output
or simple line breaks if we ask for text only. It does not seem to
deal with page breaks. I haven't tried extracting text from a document
with columns yet.

I don't think we should expect too much "magic" to happen here, unless
a certain amount of development can take place. Tika is open source,
so we can look at the PDF parser but I haven't ventured into this path
yet. I know there are lots of document coming from government agencies
(at least up here in Canada) and afforts are being made to have them
in an accessible format  but I don't know to what extent these can be
read and processed easily with BB.

I don't know how liblouisutdml deals with page breaks in an input
document. I am just about to experiment the plain text conversion
feature of liblouisutdml so I can tell you more soon.

François.

On Tue, Aug 7, 2012 at 5:14 PM, Michael Whapples <mwhapples@xxxxxxx> wrote:
> Hello,
> A few of questions regarding the text tika gives from PDF:
> 1. Does tika insert page break characters where the PDF starts a new page? I
> know some PDF to text tools do this but I don't know about tika.
> 2. How would liblouisutdml handle these page break characters? Would that
> result in a page break in the Braille although one isn't really wanted?
> 3. If tika inserts a blank line between paragraphs, then wouldn't it be
> better to see if we could actually fix tika (or the PDF processor tika uses,
> I think it might be PDFBox) so that the HTML has proper paragraphs?
> 4. How does tika determine the flow of the text? Adobe reader has a few
> options for this. Also does tika attempt to maintain the layout (like
> pdftotext does when you give it the -layout option)? I guess an example of
> what I see might cause an issue is where the PDF has multiple columns (not a
> table) like a page from a science journal which may have the page split into
> two columns for visual appearance purposes only.
>
> Michael Whapples
> On 07/08/2012 21:41, John J. Boyer wrote:
>>
>> Hi Francois,
>>
>> liblouisutdml treats blank lines in a plain text file as paragraph
>> separators, so if the file has those we will hopefully get sensible
>> paragraphs.
>>
>> John
>>
>> On Tue, Aug 07, 2012 at 02:26:22PM -0400, Fran�ois Ouellette wrote:
>>>
>>> John: I tried a few approached when importing non-XML files with Tika,
>>> one that produces text only, that we just display on the daisy view,
>>> and one that produces an XML file (in fact in is a XHTML file). When
>>> walking the XHTML with XOM we get what is displayed today on the daisy
>>> view. I tried opening the XHTML with liblouisutdml using a sem file
>>> but the results were not very good. The problem is that with Tika we
>>> usually get meta, heading and title elements, but only one <p> element
>>> that holds all the extracted text. I have not tried processing the
>>> Tika text file with liblouisutdml as you indicated earlier. This may
>>> be the better option, as we would also get a UTD structure and an
>>> initial translation.
>>>
>>> F.
>>>
>>> On Tue, Aug 7, 2012 at 1:47 PM, John J. Boyer
>>> <john.boyer@xxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> UTF-8 should be translated only for display purposes. liblouisutdml
>>>> requires the opriginal UTF-8.
>>>>
>>>> When I said that text files should be procedssed by calling
>>>> translateTextFile with formatFor utd I was thinkinng of plain text, not
>>>> text derived from imported files such as pdf. It would probably be more
>>>> consistent to let tika handle even plain text, converting it to xml.
>>>>
>>>> John
>>>>
>>>> On Tue, Aug 07, 2012 at 11:45:15AM -0400, Fran�ois Ouellette wrote:
>>>>>
>>>>> When importing non-XML documents with foreign or special characters
>>>>> they may contain Unicode expressions such as \u00e9 since they were
>>>>> not processed by  liblouisutdml. I have a routine to find the
>>>>> corresponding codepoints and display the corresponding character. I
>>>>> haven't done much testing yet but I guess that when saving as UTD
>>>>> these should be processed correctly.
>>>>>
>>>>> François.
>>>>>
>>>>> On Tue, Aug 7, 2012 at 10:04 AM, John J. Boyer <john@xxxxxxxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>> Hi Francois,
>>>>>>
>>>>>> What context are you considering when you ask about UTF-8? If these
>>>>>> codes occur in xml documents they are automatically handled by
>>>>>> liblouisutdml on translation. What does Java do when you attempt to
>>>>>> display them?
>>>>>>
>>>>>> Thanks,
>>>>>> John
>>>>>>
>>>>>> On Tue, Aug 07, 2012 at 07:59:06AM -0400, Fran�ois Ouellette wrote:
>>>>>>>
>>>>>>> John: thanks for the clarifications. We are half-way through for the
>>>>>>> brf files, I will add a method to read and backtranslate them.
>>>>>>>
>>>>>>> What about UTF-8? Is BB supposed to recognize the \u sequences and
>>>>>>> change them to the corresponding characters?
>>>>>>>
>>>>>>> Thanks.
>>>>>>> François.
>>>>>>>
>>>>>>> On Mon, Aug 6, 2012 at 8:53 PM, John J. Boyer
>>>>>>> <john.boyer@xxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> My vision is that BrailleBlaster will be able to display and edit
>>>>>>>> any
>>>>>>>> flavor of xml, just as liblouisutdml can translate any flavor.
>>>>>>>> Liblouisutdml accompliishes this by using a sort of pattern-matching
>>>>>>>> virtual machine. The semantic-action files are the "programs" for
>>>>>>>> this
>>>>>>>> VM. If I had it to do over again I would format them somewhat
>>>>>>>> differently. Each line would contain first the pattern, then the
>>>>>>>> "instruction", then parameters, separated by white space.
>>>>>>>> Optionally,
>>>>>>>> an equals sign could be inserted between the patterns and the
>>>>>>>> instructions, so Java could accept them as properties files.
>>>>>>>>
>>>>>>>> Most of the patterns are literal such as "p" "span,class,italic",
>>>>>>>> and so
>>>>>>>> on. Patterns can also be XPath expressions.
>>>>>>>>
>>>>>>>> The instruction is either the name of an action to be applied to the
>>>>>>>> pattern, a style or a macro.
>>>>>>>>
>>>>>>>> The parameters are bits of text to be inserted between the texts
>>>>>>>> contained in the subtree of the patterns. For an example, see
>>>>>>>> nemeth.sem
>>>>>>>>
>>>>>>>> For BrailleBlaster, the patterns would be similar, actions would
>>>>>>>> also be
>>>>>>>> similar in many cases, except that those having to do with Braille
>>>>>>>> would
>>>>>>>> be dropped, and others, having to do with displaying on a screen
>>>>>>>> would
>>>>>>>> be added.
>>>>>>>>
>>>>>>>> This describes the display virtual machine. The edit virtual machine
>>>>>>>> would be more complex, since there are two types of editing,
>>>>>>>> changing
>>>>>>>> the text in a text node and adding or deleting nodes. The former is
>>>>>>>> quite straightforward. The latter will generally require selecting
>>>>>>>> the
>>>>>>>> name of a style. The definition of style will have to include the
>>>>>>>> name
>>>>>>>> of the element and any relevant attribute names and values.
>>>>>>>>
>>>>>>>> On other clarifications: The best way to handle text files is to use
>>>>>>>> the
>>>>>>>> translateTextFile method with the configuration setting formatFor
>>>>>>>> utd
>>>>>>>> This will result in an output file with text paragraphs (separated
>>>>>>>> by
>>>>>>>> blank lines) enclosed in <p> tags and the Braille translation
>>>>>>>> enclosed
>>>>>>>> in <brl> tags, as normal. This can then be handled by BrailleBlaster
>>>>>>>> like any other utd file.
>>>>>>>>
>>>>>>>> BrailleBlaster is also supposed to handle natively brf files. When
>>>>>>>> these
>>>>>>>> are recognized they should be displayed in the Braille view. The
>>>>>>>> method
>>>>>>>> to use is backTranslateFile formatFor utd should also be specified.
>>>>>>>> Again the resulting output file can be handled like any other utd
>>>>>>>> file.
>>>>>>>>
>>>>>>>> John
>>>>>>>>
>>>>>>>> --
>>>>>>>> John J. Boyer; President, Chief Software Developer
>>>>>>>> Abilitiessoft, Inc.
>>>>>>>> http://www.abilitiessoft.com
>>>>>>>> Madison, Wisconsin USA
>>>>>>>> Developing software for people with disabilities
>>>>>>>>
>>>>>>>>
>>>>>> --
>>>>>> John J. Boyer, Executive Director
>>>>>> GodTouches Digital Ministry, Inc.
>>>>>> http://www.godtouches.org
>>>>>> Madison, Wisconsin, USA
>>>>>> Peace, Love, Service
>>>>>>
>>>>>>
>>>> --
>>>> John J. Boyer; President, Chief Software Developer
>>>> Abilitiessoft, Inc.
>>>> http://www.abilitiessoft.com
>>>> Madison, Wisconsin USA
>>>> Developing software for people with disabilities
>>>>
>>>>
>
>
References:
- [brailleblaster] More Clarifications
  - From: John J. Boyer
- [brailleblaster] Re: More Clarifications
  - From: François Ouellette
- [brailleblaster] Re: More Clarifications
  - From: John J. Boyer
- [brailleblaster] Re: More Clarifications
  - From: François Ouellette
- [brailleblaster] Re: More Clarifications
  - From: John J. Boyer
- [brailleblaster] Re: More Clarifications
  - From: François Ouellette
- [brailleblaster] Re: More Clarifications
  - From: John J. Boyer
- [brailleblaster] Re: More Clarifications
  - From: Michael Whapples
[brailleblaster] Re: More Clarifications

Other related posts: