[bksvol-discuss] Re: What The New Converter Does

From: "Pavi Mehta" <pavim@xxxxxxxxxxxx>
To: <bksvol-discuss@xxxxxxxxxxxxx>
Date: Mon, 6 Apr 2009 11:40:03 -0700
Hi Mayrie and All,

Here's Jake's response to you follow up question on the converter.

"The XML file is the main DAISY file in that it holds the contents of the book. 
It's often called the DTBook. The markup in the DTBook requires that page 
number tags come at the start of pages. So, if our tools successfully recognize 
page numbers they will be moved from the bottom to the top and appear at the 
top of a page in the XML.

It is _possible_ that if page numbers are not recognized in the text, that 
those numbers stay in as part of the general text, and thus would remain at the 
bottom. In that case, the page numbers would appear as regular text part of a 
paragraph, not part of a page number tag. 

There are guidelines for producing good DTBooks on the DAISY website at: 
http://www.daisy.org/z3986/structure/SG-DAISY3/index.html#contents
Those guidelines usually describe the preferred way to markup books, including 
the proper placement of page number tags"

Hope this helps!

Pavi

Pavi Mehta
Volunteer Coordinator, Bookshare

Benetech 
480 S. California Ave., Suite 201
Palo Alto, CA 94306-1609 USA
Phone:  +1 650 644-3459
 
pavim@xxxxxxxxxxxx

www.benetech.org
 
The Benetech Initiative - Technology Serving Humanity 
A Nonprofit Organization




-----Original Message-----
From: bksvol-discuss-bounce@xxxxxxxxxxxxx 
[mailto:bksvol-discuss-bounce@xxxxxxxxxxxxx] On Behalf Of Mayrie ReNae
Sent: Thursday, April 02, 2009 4:28 PM
To: bksvol-discuss@xxxxxxxxxxxxx
Subject: [bksvol-discuss] Re: What The New Converter Does

Hi Pavi!

        Thank you!  And please thank Jake vociferously for all of us!

        I do still have one question regarding page numbers.  Can you please
find out the answer for me?  I understand completely how and where page
numbers appear in daisy and brf files.  What I want to know is where the
page numbers will appear in the xml or html files?  Will they appear where
they appeared in the text of the rtf file?  I assume, though that has gotten
me into big trouble recently, that since the xml and html files were not
mentioned in Jakes write-up, that the page numbers in the xml and html files
will appear where they were placed on the page in the rtf file.  Which would
mean, am I right, that page numbers at the bottoms of pages will appear
there in the xml and html files?

        Thank you for this final bit of information.

Mayrie

 

-----Original Message-----
From: bksvol-discuss-bounce@xxxxxxxxxxxxx
[mailto:bksvol-discuss-bounce@xxxxxxxxxxxxx] On Behalf Of Pavi Mehta
Sent: Thursday, April 02, 2009 4:01 PM
To: bksvol-discuss@xxxxxxxxxxxxx
Subject: [bksvol-discuss] What The New Converter Does

Hi Folks,

 

Exciting news! At our request, Bookshare's Jake Brownell wrote up a detailed
explanation of how the new converter works and how it interfaces with our
volunteer work. I distilled his explanation into the three guidelines below
(we will include these in the volunteer manual shortly and also let
volunteers who are not on this list know). Jake's terrific write up is
included in its entirety at the end of this mail (Thank you Jake!)

 

Guidelines for Chapter Headings & Page Numbers

 

1.       The new converter removes the unwanted running headers and footers
(author name, book title, chapter title) at the top of the page for you.
Caveat: The converter is an improvement on the stripper but occasionally may
leave a header in or remove a legitimate piece of text. If you come across
an occurrence of "unintended stripping" please report it to us.

 

2.       Ensuring that chapter headings are in a font size bigger than the
rest of the text helps the new converter recognize them more easily. You no
longer have to do anything beyond that to protect chapter headings.

 

3.        You do not have to move page numbers from the bottom to the top.
The converter places recognized page numbers at the top of each page for
you.

 

 

Jake's Explanation of the New Converter:

 

We have anew RTF converter as part of the new platform we launched early
this year. Along with the RTF converter is a new tool designed to process
running headers, running footers, page numbers and chapters. The term
chapters in this context is really any generic section of a book, but since
most books use chapters, for the sake of discussion we'll use that term. The
terms running headers and running footers refers to text on books that is
repetitious at the top or bottom of nearly all pages and is something
usually ignored during the reading of a standard print book. Examples of
running headers and running footers are the book title, the author's name or
a chapter title. Running headers are much more common in practice than
running footers.

 

The tool attempts to do several things. It attempts to identify and remove
running headers and running footers from the text, so that this information
is not repeated on every page by TTS engines, interrupting the flow of the
book. The tool also attempts to identify page numbers on each page and
handle them appropriately for each format DAISY or BRF. (For DAISY this
means placing the page number in the special pagenum tag that tells a DAISY
player that the enclosed text is a page number. It's that tag that allows a
DAISY player to skip to different pages. For BRF books this means placing
the page number at the end of a line of dashes so that it can be easily
located.)

 

Does all of this sound familiar? Veteran volunteers might recognize the
above steps as those that our old, now defunct tool used to do. Our new tool
does each of them more accurately producing much better results.

 

The new tool also attempts to locate chapters within a book. If the tool can
reasonably identify some sort of consistent divisions throughout a book, it
will make appropriate DAISY levels and headings. Note, don't confuse
"headings" with "headers." Headings are similar to those found on web pages.
This additional markup can help with navigation.

 

What does all of this imply?

 

The old tool was overzealous in the removal of text it considered to be a
running header or running footer. The new tool is more conservative about
what it should remove. For example, the old tool might have considered the
text at the start of a chapter to be a running header, e.g. "Chapter 10" or
"Chapter 15." Some volunteers elected to "protect" that text by placing a
dummy header above it such as "***". This should no longer be necessary with
the new tool. In fact, the new tool in the best of circumstances will
recognize "Chapter 10" as a new chapter and mark it as such.

 

Is it still more accurate to strip running headers and footers by hand?

 

The best result is to remove the running headers and running footers by
hand, but this is a time consuming process. It's also a time consuming
process to ensure the headers match exactly. The new tool will allow minor
variations in a running header and footer, but since we wanted to air on the
side of caution, some headers or footers might be left in the text.

 

How are chapters identified among all the text?

 

We use a few different techniques and may add more in the future. Right now
the easiest way to identify chapters is when the text of the header is
slightly larger than the rest of the text. For example the normal text might
be 12 pt while the chapter text is in 16 pt. Other factors can affect the
identification, but that's an easy rule of thumb.

 

Some books have page numbers at the top and others at the bottom; does it
matter where they are in the scan?

 

The easy answer is no, it does not. When processing a book we look at text
between two page breaks. When a page number is located either at the top or
bottom of the page, the text between the page breaks is associated with that
number. When generating DAISY and BRF we place the associated page number in
the correct spot, which for both formats is at the beginning of the page. So
effectively if the page number is at the bottom of the page, we move it to
the top.

 

 

 

All good things,

 

Pavi Mehta

Volunteer Coordinator, Bookshare

 

Benetech 

480 S. California Ave., Suite 201

Palo Alto, CA 94306-1609 USA

Phone:  +1 650 644-3459

 

pavim@xxxxxxxxxxxx

 

www.benetech.org

 

The Benetech Initiative - Technology Serving Humanity 

A Nonprofit Organization

 


 To unsubscribe from this list send a blank Email to
bksvol-discuss-request@xxxxxxxxxxxxx
put the word 'unsubscribe' by itself in the subject line.  To get a list of 
available commands, put the word 'help' by itself in the subject line.

 To unsubscribe from this list send a blank Email to
bksvol-discuss-request@xxxxxxxxxxxxx
put the word 'unsubscribe' by itself in the subject line.  To get a list of 
available commands, put the word 'help' by itself in the subject line.
References:
- [bksvol-discuss] Re: What The New Converter Does
  - From: Mayrie ReNae
[bksvol-discuss] Re: What The New Converter Does

Other related posts: