Hello all, we should distinguish between formatting and structure. We need to capture the structure - which includes paragraphs, headings, tables, etc. But we don't need to capture the formatting of those things - whether the paragraph is indented or double spaced, whether the heading is bold or centered. Or... From the on-going conversation, it isn't clear to me that we are even capturing structure. I hope I am misunderstanding. Thanks. John G -----Original Message----- From: brailleblaster-bounce@xxxxxxxxxxxxx [mailto:brailleblaster-bounce@xxxxxxxxxxxxx] On Behalf Of François Ouellette Sent: Thursday, July 19, 2012 3:25 PM To: brailleblaster@xxxxxxxxxxxxx Subject: [brailleblaster] Re: Compiled I don't know what was there at the beginning, but it looks like it has improved over time if we read the notes from the consecutve releases. Again, it is not a content formatter, it is a content extractor! But we can get XML or XHTML from a file through SAX classes and decide on the resulting format. I am following-up on this. Currently we only get unformatted text and it is a start point. François On Thu, Jul 19, 2012 at 5:40 PM, Michael Whapples <mwhapples@xxxxxxx> wrote: > Hello, > I remember John Gardner mentioning Tika near the beginning of the > Brailleblaster project, but at that time we concluded the formatting > from it was not really good enough. Has it improved? > > Michael Whapples > On 19/07/2012 22:17, François Ouellette wrote: >> >> John: Exactly! Transformation should be a breeze with the sem >> statements. I will sure follow up. >> >> François. >> >> On Thu, Jul 19, 2012 at 3:37 PM, John J. Boyer >> <john.boyer@xxxxxxxxxxxxxxxxx> wrote: >>> >>> Hi Francois, >>> >>> It is very desirable to get xml output from tika. liblouisutdml may >>> already have a .sem file to handle it. If not, one can be created >>> easily. >>> >>> John >>> >>> On Thu, Jul 19, 2012 at 03:03:41PM -0400, Fran ois Ouellette wrote: >>>> >>>> (follow-up on previous email) >>>> Vic: it seems like we can produce formatted XML or HTML from the >>>> extraction, in which case we could retrieve the main formatting >>>> elements and replicate them in BB. Let me check on this. >>>> >>>> François. >>>> >>>> On Thu, Jul 19, 2012 at 12:26 PM, Vic Beckley >>>> <vic.beckley3@xxxxxxxxx> >>>> wrote: >>>>> >>>>> John and François, >>>>> >>>>> I got it to compile. I opened a Word 2010 document with it. It >>>>> seemed the format of the text was missing. I don't think the >>>>> paragraphs were still intact. >>>>> >>>>> I will do more testing later. I am a little under the weather >>>>> today and I think I am going to go rest now. More later. Looks >>>>> good so far. >>>>> >>>>> >>>>> Best regards from Ohio, U.S.A., >>>>> >>>>> Vic >>>>> E-mail: vic.beckley3@xxxxxxxxx >>>>> >>>>> >>>>> >>>>> >>> -- >>> John J. Boyer; President, Chief Software Developer Abilitiessoft, >>> Inc. >>> http://www.abilitiessoft.com >>> Madison, Wisconsin USA >>> Developing software for people with disabilities >>> >>> > >