[bksvol-discuss] Re: What The New Converter Does

  • From: "Julia" <julia.kulak@xxxxxxxxxxxx>
  • To: <bksvol-discuss@xxxxxxxxxxxxx>
  • Date: Sun, 5 Apr 2009 23:05:05 -0400

Thanks Jake and Pavi, that really cleared things up.
Julia
----- Original Message ----- From: "Cindy Rosenthal" <popularplace@xxxxxxxxx>
To: <bksvol-discuss@xxxxxxxxxxxxx>
Cc: "Pavi Mehta" <pavim@xxxxxxxxxxxx>
Sent: Friday, April 03, 2009 10:29 PM
Subject: [bksvol-discuss] Re: What The New Converter Does



a big thnk you Jake ad Pavi for that clear explanation, and to Engineeering for the new stripper. I'm amazed and impressed.

Cindy
WISH LIST (CALLED REQUESTED ADDITIONS TO THE BOOKSHARE COLLECTION)IS AVAILABLE AT
http://www.friendsofbookshare.org/wish_list/wish_list.htm
www.lljfm.net/bookshare/home.htm

A LIST OF BOOKS CURRENTLY BEING SCANNED IS AVAILABLE AT
http://www.friendsofbookshare.org/
www.lljfm.net/bookshare/home.htm


--- On Fri, 4/3/09, Pavi Mehta <pavim@xxxxxxxxxxxx> wrote:

From: Pavi Mehta <pavim@xxxxxxxxxxxx>
Subject: [bksvol-discuss] Re: What The New Converter Does
To: bksvol-discuss@xxxxxxxxxxxxx
Date: Friday, April 3, 2009, 2:09 PM
Hi Mayrie,

Will check in with engineering on this and get back to you
on the list.

Thanks!

Pavi

-discuss-bounce@xxxxxxxxxxxxx
[mailto:bksvol-discuss-bounce@xxxxxxxxxxxxx]
On Behalf Of Mayrie ReNae
Sent: Thursday, April 02, 2009 4:28 PM
To: bksvol-discuss@xxxxxxxxxxxxx
Subject: [bksvol-discuss] Re: What The New Converter Does

Hi Pavi!

    Thank you!  And please thank Jake
vociferously for all of us!

    I do still have one question regarding
page numbers.  Can you
please
find out the answer for me?  I understand completely
how and where page
numbers appear in daisy and brf files.  What I want to
know is where the
page numbers will appear in the xml or html files?
Will they appear
where
they appeared in the text of the rtf file?  I assume,
though that has
gotten
me into big trouble recently, that since the xml and html
files were not
mentioned in Jakes write-up, that the page numbers in the
xml and html
files
will appear where they were placed on the page in the rtf
file.  Which
would
mean, am I right, that page numbers at the bottoms of pages
will appear
there in the xml and html files?

    Thank you for this final bit of
information.

Mayrie



-----Original Message-----
From: bksvol-discuss-bounce@xxxxxxxxxxxxx
[mailto:bksvol-discuss-bounce@xxxxxxxxxxxxx]
On Behalf Of Pavi Mehta
Sent: Thursday, April 02, 2009 4:01 PM
To: bksvol-discuss@xxxxxxxxxxxxx
Subject: [bksvol-discuss] What The New Converter Does

Hi Folks,



Exciting news! At our request, Bookshare's Jake Brownell
wrote up a
detailed
explanation of how the new converter works and how it
interfaces with
our
volunteer work. I distilled his explanation into the three
guidelines
below
(we will include these in the volunteer manual shortly and
also let
volunteers who are not on this list know). Jake's terrific
write up is
included in its entirety at the end of this mail (Thank you
Jake!)



Guidelines for Chapter Headings & Page Numbers



1.       The new converter removes
the unwanted running headers and
footers
(author name, book title, chapter title) at the top of the
page for you.
Caveat: The converter is an improvement on the stripper but
occasionally
may
leave a header in or remove a legitimate piece of text. If
you come
across
an occurrence of "unintended stripping" please report it to
us.



2.       Ensuring that chapter
headings are in a font size bigger than
the
rest of the text helps the new converter recognize them
more easily. You
no
longer have to do anything beyond that to protect chapter
headings.



3.        You do not have to move page
numbers from the bottom to the
top.
The converter places recognized page numbers at the top of
each page for
you.





Jake's Explanation of the New Converter:



We have anew RTF converter as part of the new platform we
launched early
this year. Along with the RTF converter is a new tool
designed to
process
running headers, running footers, page numbers and
chapters. The term
chapters in this context is really any generic section of a
book, but
since
most books use chapters, for the sake of discussion we'll
use that term.
The
terms running headers and running footers refers to text on
books that
is
repetitious at the top or bottom of nearly all pages and is
something
usually ignored during the reading of a standard print
book. Examples of
running headers and running footers are the book title, the
author's
name or
a chapter title. Running headers are much more common in
practice than
running footers.



The tool attempts to do several things. It attempts to
identify and
remove
running headers and running footers from the text, so that
this
information
is not repeated on every page by TTS engines, interrupting
the flow of
the
book. The tool also attempts to identify page numbers on
each page and
handle them appropriately for each format DAISY or BRF.
(For DAISY this
means placing the page number in the special pagenum tag
that tells a
DAISY
player that the enclosed text is a page number. It's that
tag that
allows a
DAISY player to skip to different pages. For BRF books this
means
placing
the page number at the end of a line of dashes so that it
can be easily
located.)



Does all of this sound familiar? Veteran volunteers might
recognize the
above steps as those that our old, now defunct tool used to
do. Our new
tool
does each of them more accurately producing much better
results.



The new tool also attempts to locate chapters within a
book. If the tool
can
reasonably identify some sort of consistent divisions
throughout a book,
it
will make appropriate DAISY levels and headings. Note,
don't confuse
"headings" with "headers." Headings are similar to those
found on web
pages.
This additional markup can help with navigation.



What does all of this imply?



The old tool was overzealous in the removal of text it
considered to be
a
running header or running footer. The new tool is more
conservative
about
what it should remove. For example, the old tool might have
considered
the
text at the start of a chapter to be a running header, e.g.
"Chapter 10"
or
"Chapter 15." Some volunteers elected to "protect" that
text by placing
a
dummy header above it such as "***". This should no longer
be necessary
with
the new tool. In fact, the new tool in the best of
circumstances will
recognize "Chapter 10" as a new chapter and mark it as
such.



Is it still more accurate to strip running headers and
footers by hand?



The best result is to remove the running headers and
running footers by
hand, but this is a time consuming process. It's also a
time consuming
process to ensure the headers match exactly. The new tool
will allow
minor
variations in a running header and footer, but since we
wanted to air on
the
side of caution, some headers or footers might be left in
the text.



How are chapters identified among all the text?



We use a few different techniques and may add more in the
future. Right
now
the easiest way to identify chapters is when the text of
the header is
slightly larger than the rest of the text. For example the
normal text
might
be 12 pt while the chapter text is in 16 pt. Other factors
can affect
the
identification, but that's an easy rule of thumb.



Some books have page numbers at the top and others at the
bottom; does
it
matter where they are in the scan?



The easy answer is no, it does not. When processing a book
we look at
text
between two page breaks. When a page number is located
either at the top
or
bottom of the page, the text between the page breaks is
associated with
that
number. When generating DAISY and BRF we place the
associated page
number in
the correct spot, which for both formats is at the
beginning of the
page. So
effectively if the page number is at the bottom of the
page, we move it
to
the top.







All good things,



Pavi Mehta

Volunteer Coordinator, Bookshare



Benetech

480 S. California Ave., Suite 201

Palo Alto, CA 94306-1609 USA

Phone:  +1 650 644-3459



pavim@xxxxxxxxxxxx



www.benetech.org



The Benetech Initiative - Technology Serving Humanity

A Nonprofit Organization




 To unsubscribe from this list send a blank Email to
bksvol-discuss-request@xxxxxxxxxxxxx
put the word 'unsubscribe' by itself in the subject
line.  To get a list
of available commands, put the word 'help' by itself in the
subject
line.

 To unsubscribe from this list send a blank Email to
bksvol-discuss-request@xxxxxxxxxxxxx
put the word 'unsubscribe' by itself in the subject
line.  To get a list of available commands, put the
word 'help' by itself in the subject line.






To unsubscribe from this list send a blank Email to
bksvol-discuss-request@xxxxxxxxxxxxx
put the word 'unsubscribe' by itself in the subject line. To get a list of available commands, put the word 'help' by itself in the subject line.



To unsubscribe from this list send a blank Email to
bksvol-discuss-request@xxxxxxxxxxxxx
put the word 'unsubscribe' by itself in the subject line.  To get a list of 
available commands, put the word 'help' by itself in the subject line.

Other related posts: