[bksvol-discuss] Re: What The New Converter Does

  • From: "Bob" <rwiley@xxxxxxxxxxxxx>
  • To: <bksvol-discuss@xxxxxxxxxxxxx>
  • Date: Thu, 2 Apr 2009 18:15:41 -0500

Excellent. Thanks Pavi.

This goes into the keepers file. Tell Jake we'll hold him personally 
responsible, and we know where he lives.

Where does Jake live?

Bob
  ----- Original Message ----- 
  From: Pavi Mehta 
  To: bksvol-discuss@xxxxxxxxxxxxx 
  Sent: Thursday, April 02, 2009 6:01 PM
  Subject: [bksvol-discuss] What The New Converter Does


  Hi Folks,

   

  Exciting news! At our request, Bookshare's Jake Brownell wrote up a detailed 
explanation of how the new converter works and how it interfaces with our 
volunteer work. I distilled his explanation into the three guidelines below (we 
will include these in the volunteer manual shortly and also let volunteers who 
are not on this list know). Jake's terrific write up is included in its 
entirety at the end of this mail (Thank you Jake!)

   

  Guidelines for Chapter Headings & Page Numbers

   

  1.       The new converter removes the unwanted running headers and footers 
(author name, book title, chapter title) at the top of the page for you. 
Caveat: The converter is an improvement on the stripper but occasionally may 
leave a header in or remove a legitimate piece of text. If you come across an 
occurrence of "unintended stripping" please report it to us.

   

  2.       Ensuring that chapter headings are in a font size bigger than the 
rest of the text helps the new converter recognize them more easily. You no 
longer have to do anything beyond that to protect chapter headings.

   

  3.        You do not have to move page numbers from the bottom to the top. 
The converter places recognized page numbers at the top of each page for you.

   

   

  Jake's Explanation of the New Converter:

   

  We have anew RTF converter as part of the new platform we launched early this 
year. Along with the RTF converter is a new tool designed to process running 
headers, running footers, page numbers and chapters. The term chapters in this 
context is really any generic section of a book, but since most books use 
chapters, for the sake of discussion we'll use that term. The terms running 
headers and running footers refers to text on books that is repetitious at the 
top or bottom of nearly all pages and is something usually ignored during the 
reading of a standard print book. Examples of running headers and running 
footers are the book title, the author's name or a chapter title. Running 
headers are much more common in practice than running footers.

   

  The tool attempts to do several things. It attempts to identify and remove 
running headers and running footers from the text, so that this information is 
not repeated on every page by TTS engines, interrupting the flow of the book. 
The tool also attempts to identify page numbers on each page and handle them 
appropriately for each format DAISY or BRF. (For DAISY this means placing the 
page number in the special pagenum tag that tells a DAISY player that the 
enclosed text is a page number. It's that tag that allows a DAISY player to 
skip to different pages. For BRF books this means placing the page number at 
the end of a line of dashes so that it can be easily located.)

   

  Does all of this sound familiar? Veteran volunteers might recognize the above 
steps as those that our old, now defunct tool used to do. Our new tool does 
each of them more accurately producing much better results.

   

  The new tool also attempts to locate chapters within a book. If the tool can 
reasonably identify some sort of consistent divisions throughout a book, it 
will make appropriate DAISY levels and headings. Note, don't confuse "headings" 
with "headers." Headings are similar to those found on web pages. This 
additional markup can help with navigation.

   

  What does all of this imply?

   

  The old tool was overzealous in the removal of text it considered to be a 
running header or running footer. The new tool is more conservative about what 
it should remove. For example, the old tool might have considered the text at 
the start of a chapter to be a running header, e.g. "Chapter 10" or "Chapter 
15." Some volunteers elected to "protect" that text by placing a dummy header 
above it such as "***". This should no longer be necessary with the new tool. 
In fact, the new tool in the best of circumstances will recognize "Chapter 10" 
as a new chapter and mark it as such.

   

  Is it still more accurate to strip running headers and footers by hand?

   

  The best result is to remove the running headers and running footers by hand, 
but this is a time consuming process. It's also a time consuming process to 
ensure the headers match exactly. The new tool will allow minor variations in a 
running header and footer, but since we wanted to air on the side of caution, 
some headers or footers might be left in the text.

   

  How are chapters identified among all the text?

   

  We use a few different techniques and may add more in the future. Right now 
the easiest way to identify chapters is when the text of the header is slightly 
larger than the rest of the text. For example the normal text might be 12 pt 
while the chapter text is in 16 pt. Other factors can affect the 
identification, but that's an easy rule of thumb.

   

  Some books have page numbers at the top and others at the bottom; does it 
matter where they are in the scan?

   

  The easy answer is no, it does not. When processing a book we look at text 
between two page breaks. When a page number is located either at the top or 
bottom of the page, the text between the page breaks is associated with that 
number. When generating DAISY and BRF we place the associated page number in 
the correct spot, which for both formats is at the beginning of the page. So 
effectively if the page number is at the bottom of the page, we move it to the 
top.

   

   

   

  All good things,

   

  Pavi Mehta

  Volunteer Coordinator, Bookshare

   

  Benetech 

  480 S. California Ave., Suite 201

  Palo Alto, CA 94306-1609 USA

  Phone:  +1 650 644-3459

   

  pavim@xxxxxxxxxxxx

   

  www.benetech.org

   

  The Benetech Initiative - Technology Serving Humanity 

  A Nonprofit Organization

   

Other related posts: