[bksvol-discuss] What The New Converter Does

  • From: "Pavi Mehta" <pavim@xxxxxxxxxxxx>
  • To: <bksvol-discuss@xxxxxxxxxxxxx>
  • Date: Thu, 2 Apr 2009 16:01:06 -0700

Hi Folks,

 

Exciting news! At our request, Bookshare's Jake Brownell wrote up a
detailed explanation of how the new converter works and how it
interfaces with our volunteer work. I distilled his explanation into the
three guidelines below (we will include these in the volunteer manual
shortly and also let volunteers who are not on this list know). Jake's
terrific write up is included in its entirety at the end of this mail
(Thank you Jake!)

 

Guidelines for Chapter Headings & Page Numbers

 

1.       The new converter removes the unwanted running headers and
footers (author name, book title, chapter title) at the top of the page
for you. Caveat: The converter is an improvement on the stripper but
occasionally may leave a header in or remove a legitimate piece of text.
If you come across an occurrence of "unintended stripping" please report
it to us.

 

2.       Ensuring that chapter headings are in a font size bigger than
the rest of the text helps the new converter recognize them more easily.
You no longer have to do anything beyond that to protect chapter
headings.

 

3.        You do not have to move page numbers from the bottom to the
top. The converter places recognized page numbers at the top of each
page for you.

 

 

Jake's Explanation of the New Converter:

 

We have anew RTF converter as part of the new platform we launched early
this year. Along with the RTF converter is a new tool designed to
process running headers, running footers, page numbers and chapters. The
term chapters in this context is really any generic section of a book,
but since most books use chapters, for the sake of discussion we'll use
that term. The terms running headers and running footers refers to text
on books that is repetitious at the top or bottom of nearly all pages
and is something usually ignored during the reading of a standard print
book. Examples of running headers and running footers are the book
title, the author's name or a chapter title. Running headers are much
more common in practice than running footers.

 

The tool attempts to do several things. It attempts to identify and
remove running headers and running footers from the text, so that this
information is not repeated on every page by TTS engines, interrupting
the flow of the book. The tool also attempts to identify page numbers on
each page and handle them appropriately for each format DAISY or BRF.
(For DAISY this means placing the page number in the special pagenum tag
that tells a DAISY player that the enclosed text is a page number. It's
that tag that allows a DAISY player to skip to different pages. For BRF
books this means placing the page number at the end of a line of dashes
so that it can be easily located.)

 

Does all of this sound familiar? Veteran volunteers might recognize the
above steps as those that our old, now defunct tool used to do. Our new
tool does each of them more accurately producing much better results.

 

The new tool also attempts to locate chapters within a book. If the tool
can reasonably identify some sort of consistent divisions throughout a
book, it will make appropriate DAISY levels and headings. Note, don't
confuse "headings" with "headers." Headings are similar to those found
on web pages. This additional markup can help with navigation.

 

What does all of this imply?

 

The old tool was overzealous in the removal of text it considered to be
a running header or running footer. The new tool is more conservative
about what it should remove. For example, the old tool might have
considered the text at the start of a chapter to be a running header,
e.g. "Chapter 10" or "Chapter 15." Some volunteers elected to "protect"
that text by placing a dummy header above it such as "***". This should
no longer be necessary with the new tool. In fact, the new tool in the
best of circumstances will recognize "Chapter 10" as a new chapter and
mark it as such.

 

Is it still more accurate to strip running headers and footers by hand?

 

The best result is to remove the running headers and running footers by
hand, but this is a time consuming process. It's also a time consuming
process to ensure the headers match exactly. The new tool will allow
minor variations in a running header and footer, but since we wanted to
air on the side of caution, some headers or footers might be left in the
text.

 

How are chapters identified among all the text?

 

We use a few different techniques and may add more in the future. Right
now the easiest way to identify chapters is when the text of the header
is slightly larger than the rest of the text. For example the normal
text might be 12 pt while the chapter text is in 16 pt. Other factors
can affect the identification, but that's an easy rule of thumb.

 

Some books have page numbers at the top and others at the bottom; does
it matter where they are in the scan?

 

The easy answer is no, it does not. When processing a book we look at
text between two page breaks. When a page number is located either at
the top or bottom of the page, the text between the page breaks is
associated with that number. When generating DAISY and BRF we place the
associated page number in the correct spot, which for both formats is at
the beginning of the page. So effectively if the page number is at the
bottom of the page, we move it to the top.

 

 

 

All good things,

 

Pavi Mehta

Volunteer Coordinator, Bookshare

 

Benetech 

480 S. California Ave., Suite 201

Palo Alto, CA 94306-1609 USA

Phone:  +1 650 644-3459

 

pavim@xxxxxxxxxxxx

 

www.benetech.org

 

The Benetech Initiative - Technology Serving Humanity 

A Nonprofit Organization

 

Other related posts: