[brailleblaster] Re: Compiled

  • From: François Ouellette <braille@xxxxxxx>
  • To: brailleblaster@xxxxxxxxxxxxx
  • Date: Fri, 20 Jul 2012 08:45:09 -0400

Michael: I guess the advantage of Tika is that it can provide us today
with one single interface to import virtually any kind of file that
can contains text. Nothing is perfect but it seems it can resolve 80%
of our importing needs. And it can give us a formatted XML or XHTML
document that facilitates formatting. Someone may have turned it down
a few years ago, for legitimate reasons, but the fact remains that it
seems like a mature product today.

About PDF documents if they were not created with accessibility in
mind they can be a bitch to work with. I have run across this issue
before. Some people create them to be difficult to hack, on purpose,
to protect their copyright. This is another issue that users should be
aware of.

François.

On Fri, Jul 20, 2012 at 4:12 AM, Michael Whapples <mwhapples@xxxxxxx> wrote:
> Yes I did mean structure when I said formatting earlier.
>
> As I remember PDF was a big problem for structure with tika, but I believe
> there were problems with other formats.
>
> May be things have improved, I hope so.
>
> It is worth noting that PDF is generally quite bad for extracting document
> structure from, particularly if the PDF is untagged. PDF is quite a beast,
> there are some parts of the specification which only seem to be supported by
> Adobe Reader, but those parts probably don't really concern us as they
> aren't too relevant for paper based documents (eg. embedding flash content
> and videos into a PDF).
>
> Michael Whapples
> On 20/07/2012 04:31, John Gardner wrote:
>>
>> Hello all, we should distinguish between formatting and structure.  We
>> need to capture the structure - which includes paragraphs, headings, tables,
>> etc.  But we don't need to capture the formatting of those things - whether
>> the paragraph is indented or double spaced, whether the heading is bold or
>> centered. Or...  From the on-going conversation, it isn't clear to me that
>> we are even capturing structure.  I hope I am misunderstanding.
>>
>> Thanks.
>> John G
>>
>>
>>
>> -----Original Message-----
>> From: brailleblaster-bounce@xxxxxxxxxxxxx
>> [mailto:brailleblaster-bounce@xxxxxxxxxxxxx] On Behalf Of François Ouellette
>> Sent: Thursday, July 19, 2012 3:25 PM
>> To: brailleblaster@xxxxxxxxxxxxx
>> Subject: [brailleblaster] Re: Compiled
>>
>> I don't know what was there at the beginning, but it looks like it has
>> improved over time if we read the notes from the consecutve releases.
>> Again, it is not a content formatter, it is a content extractor! But we
>> can get XML or XHTML from a file through SAX classes and decide on the
>> resulting format. I am following-up on this. Currently we only get
>> unformatted text and it is a start point.
>>
>> François
>>
>> On Thu, Jul 19, 2012 at 5:40 PM, Michael Whapples <mwhapples@xxxxxxx>
>> wrote:
>>>
>>> Hello,
>>> I remember John Gardner mentioning Tika near the beginning of the
>>> Brailleblaster project, but at that time we concluded the formatting
>>> from it was not really good enough. Has it improved?
>>>
>>> Michael Whapples
>>> On 19/07/2012 22:17, François Ouellette wrote:
>>>>
>>>> John: Exactly! Transformation should be a breeze with the sem
>>>> statements. I will sure follow up.
>>>>
>>>> François.
>>>>
>>>> On Thu, Jul 19, 2012 at 3:37 PM, John J. Boyer
>>>> <john.boyer@xxxxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>> Hi Francois,
>>>>>
>>>>> It is very desirable to get xml output from tika. liblouisutdml may
>>>>> already have a .sem file to handle it. If not, one can be created
>>>>> easily.
>>>>>
>>>>> John
>>>>>
>>>>> On Thu, Jul 19, 2012 at 03:03:41PM -0400, Fran ois Ouellette wrote:
>>>>>>
>>>>>> (follow-up on previous email)
>>>>>> Vic: it seems like we can produce formatted XML or HTML from the
>>>>>> extraction, in which case we could retrieve the main formatting
>>>>>> elements and replicate them in BB. Let me check on this.
>>>>>>
>>>>>> François.
>>>>>>
>>>>>> On Thu, Jul 19, 2012 at 12:26 PM, Vic Beckley
>>>>>> <vic.beckley3@xxxxxxxxx>
>>>>>> wrote:
>>>>>>>
>>>>>>> John and François,
>>>>>>>
>>>>>>> I got it to compile. I opened a Word 2010 document with it. It
>>>>>>> seemed the format of the text was missing. I don't think the
>>>>>>> paragraphs were still intact.
>>>>>>>
>>>>>>> I will do more testing later. I am a little under the weather
>>>>>>> today and I think I am going to go rest now. More later. Looks
>>>>>>> good so far.
>>>>>>>
>>>>>>>
>>>>>>> Best regards from Ohio, U.S.A.,
>>>>>>>
>>>>>>> Vic
>>>>>>> E-mail: vic.beckley3@xxxxxxxxx
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --
>>>>> John J. Boyer; President, Chief Software Developer Abilitiessoft,
>>>>> Inc.
>>>>> http://www.abilitiessoft.com
>>>>> Madison, Wisconsin USA
>>>>> Developing software for people with disabilities
>>>>>
>>>>>
>>>
>>
>>
>
>

Other related posts: