[brailleblaster] Re: Compiled

  • From: François Ouellette <braille@xxxxxxx>
  • To: brailleblaster@xxxxxxxxxxxxx
  • Date: Fri, 20 Jul 2012 00:26:41 -0400

Hi all,

This message came at the right time. I just finished experimenting
with the parser classes offered by Tika, and I can create a XHTML
document from a Word or RTF file that has all the elements for the
title, headings and paragraphs including text formatting tags like
italic and bold. It evens extracts the meta information about the
document author, encoding, mime type, number of characters, etc.

This means that with the proper statements in a sem file we can
translate that into a UTDML structure for BB. This is actually better
than I expected. All this can be achieved with less than 10 lines of
java code! The underpinnings of Tika use SAX classes to make that
happen.

To be continued...

François.

On Thu, Jul 19, 2012 at 11:31 PM, John Gardner
<john.gardner@xxxxxxxxxxxx> wrote:
> Hello all, we should distinguish between formatting and structure.  We need 
> to capture the structure - which includes paragraphs, headings, tables, etc.  
> But we don't need to capture the formatting of those things - whether the 
> paragraph is indented or double spaced, whether the heading is bold or 
> centered. Or...  From the on-going conversation, it isn't clear to me that we 
> are even capturing structure.  I hope I am misunderstanding.
>
> Thanks.
> John G
>
>
>
> -----Original Message-----
> From: brailleblaster-bounce@xxxxxxxxxxxxx 
> [mailto:brailleblaster-bounce@xxxxxxxxxxxxx] On Behalf Of François Ouellette
> Sent: Thursday, July 19, 2012 3:25 PM
> To: brailleblaster@xxxxxxxxxxxxx
> Subject: [brailleblaster] Re: Compiled
>
> I don't know what was there at the beginning, but it looks like it has 
> improved over time if we read the notes from the consecutve releases.
> Again, it is not a content formatter, it is a content extractor! But we can 
> get XML or XHTML from a file through SAX classes and decide on the resulting 
> format. I am following-up on this. Currently we only get unformatted text and 
> it is a start point.
>
> François
>
> On Thu, Jul 19, 2012 at 5:40 PM, Michael Whapples <mwhapples@xxxxxxx> wrote:
>> Hello,
>> I remember John Gardner mentioning Tika near the beginning of the
>> Brailleblaster project, but at that time we concluded the formatting
>> from it was not really good enough. Has it improved?
>>
>> Michael Whapples
>> On 19/07/2012 22:17, François Ouellette wrote:
>>>
>>> John: Exactly! Transformation should be a breeze with the sem
>>> statements. I will sure follow up.
>>>
>>> François.
>>>
>>> On Thu, Jul 19, 2012 at 3:37 PM, John J. Boyer
>>> <john.boyer@xxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> Hi Francois,
>>>>
>>>> It is very desirable to get xml output from tika. liblouisutdml may
>>>> already have a .sem file to handle it. If not, one can be created
>>>> easily.
>>>>
>>>> John
>>>>
>>>> On Thu, Jul 19, 2012 at 03:03:41PM -0400, Fran ois Ouellette wrote:
>>>>>
>>>>> (follow-up on previous email)
>>>>> Vic: it seems like we can produce formatted XML or HTML from the
>>>>> extraction, in which case we could retrieve the main formatting
>>>>> elements and replicate them in BB. Let me check on this.
>>>>>
>>>>> François.
>>>>>
>>>>> On Thu, Jul 19, 2012 at 12:26 PM, Vic Beckley
>>>>> <vic.beckley3@xxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>> John and François,
>>>>>>
>>>>>> I got it to compile. I opened a Word 2010 document with it. It
>>>>>> seemed the format of the text was missing. I don't think the
>>>>>> paragraphs were still intact.
>>>>>>
>>>>>> I will do more testing later. I am a little under the weather
>>>>>> today and I think I am going to go rest now. More later. Looks
>>>>>> good so far.
>>>>>>
>>>>>>
>>>>>> Best regards from Ohio, U.S.A.,
>>>>>>
>>>>>> Vic
>>>>>> E-mail: vic.beckley3@xxxxxxxxx
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> --
>>>> John J. Boyer; President, Chief Software Developer Abilitiessoft,
>>>> Inc.
>>>> http://www.abilitiessoft.com
>>>> Madison, Wisconsin USA
>>>> Developing software for people with disabilities
>>>>
>>>>
>>
>>
>
>
>

Other related posts: