[bksvol-discuss] Miraculous multiplication of book pages -- a bug of biblical proportion?

  • From: Guido Corona <guidoc@xxxxxxxxxx>
  • To: bksvol-discuss@xxxxxxxxxxxxx
  • Date: Tue, 10 Aug 2004 11:38:21 -0500

Ken,  that worries me. I had not heard yet of K1000 Kes to RTF conversion 
fowling page boundaries this badly  .  I am contacting Kurzweil on this 
subject.  They might need the settings file (OST) you used to reproduce 
the problem.
The only setting which in my view may be the cause of these extra pages is 
in recognition settings:  partial columns should be ignored rather than 
kept.

In the meantime,  are you saying that a further K1000 conversion from RTF 
to TXT cleans up the spurious page breaks?  If that is the case,  I will 
try this in the next book where I encounter the problem.
I suggest that until we get the problem fixed programmatically,  in the 
meantime you convert problem books to TXT before submission to Bookshare.

Guido
 forassistance.  If we can provide with reproduceable scenarios,  I am 
sure they can address the problem

Guido D. Corona
IBM Accessibility Center,  Austin Tx.
IBM Research,
Phone:  (512) 838-9735
Email: guidoc@xxxxxxxxxxx

Visit my weekly Accessibility WebLog at:
http://www-3.ibm.com/able/weblog/corona_weblog.html





"Kenneth A. Cross" <crossk@xxxxxxxxxxxx> 
Sent by: bksvol-discuss-bounce@xxxxxxxxxxxxx
08/10/2004 11:01 AM
Please respond to
bksvol-discuss


To
<bksvol-discuss@xxxxxxxxxxxxx>
cc

Subject
[bksvol-discuss] Re: 550 books in the download queue






Actually, there is a very interesting problem, at least in Kurzweil 8.1. I 
just saved a 219 page book from kes to rtf.  The rtf version had five 
hundred plus pages.  More pointedly, after deleting the pages and saving 
the book, the pages reappeared when the file was loaded  and saved again. 
That had nothing to do with the scanning and very little to do with the 
actual reading of the book.  Now that same book saved as a two hundred 
page txt book, and if you want books in that format that is easy enough to 
do. 
----- Original Message ----- 
From: Guido Corona 
To: bksvol-discuss@xxxxxxxxxxxxx 
Sent: Tuesday, August 10, 2004 11:34 AM
Subject: [bksvol-discuss] Re: 550 books in the download queue


Kenneth,  I do agree that we cannot make things perfect,  especially if 
graphs and pictures are involved,  which typically generate optical noise. 
 On the other hand I do urge all scanning volunteers to perform a modicum 
of cleanup of their materials prior to posting to the system. 

1.  PLEASE do perform a book integrity check.  If a print copy has 320 
pages, there is no reason whatsoever that the etext should contain 1760 
pages instead. 
There should be a 1 to 1 correspondence in the bodytext between printed 
pages and etext pages.  Any volunteer can in most cases easily remove some 
duplicate pages,  but can't fix things when they are totally out of 
kilter. Furthermore, it is a lot easier for the submitter to integrate any 
missing pages as he/she has still access to the print copy. 
In most cases,  even if a few missing pages need to be inserted,  a page 
integrity check for a book does not take more than 15 minutes and would 
save all reviewers a lot of wasted time. 
In some cases there is -- as already mentioned -- a tremendous discrepancy 
between pages in the printed book and etext pages,  where the etext has 5 
or 6 times the number of pages in the original.  This likely points to 
scanning/OCR settings that are way off,  or an OCR package being used 
which is less than adequate.   

2.  Lots of broken page headers make a book very tiring to read.  Please 
fix them or remove them.  Kurzweil lets you remove page headers 
automatically.  Version 8 was a little radical in this regard and ended 
removing also stuff it was not supposed to.  Newest version 9,  just 
announced yesterday,  has now an option for 'careful' header removal. 
Yesterday I worked on one of your books.  Kurzweil removed 190 headers. 
Approx 120 headers I removed manually.  It took me all of 15 minutes to do 
the cleanup.  As I had already performed a page integrity check and had 
come up with perfect correspondence after removal of duplicate pages,  I 
also removed page numbers to do a faster job.  Just want to get the 
backlog down quickly. 

3.  Ah yes,  those amazing synopses. Let us all try to be informative . 
The short one should give our paying customers a very brief sense of the 
book.  If we are inclined to give more detailed info,  a longer and useful 
description can go in the long synopsis.  I confess I hardly ever make up 
my own synopsis,  but I liberally borrow from the front matter of the 
book,  the back cover of paperbacks and the front and back flaps of hard 
covers.  Synopsis such as "Set in Alabama",  "It's all in the Title", 
"Thorough Treatment of the matter at hand" are unfortunately not 
noticeably helpful and will only cause our paying customers to get 
irritated and lose faith in Bookshare. 

Of course,  there is a lot more that submitters can do to improve the 
quality of their postings,  but even a little of upfront cleanup and a 
clean submission process will enable us to offer a quality product to OUR 
PAYING CUSTOMERS. 

NOTE:  By the way,  Kurzweil will start shipping K1000 version 9 shortly. 
I have used the beta and found OCR quality even improved over earlier 
versions.  I will post the announcement shortly.  Cost of the upgrade is 
$95 or $0.00,  depending on the status of your account. 


Guido D. Corona
IBM Accessibility Center,  Austin Tx.
IBM Research,
Phone:  (512) 838-9735
Email: guidoc@xxxxxxxxxxx

Visit my weekly Accessibility WebLog at:
http://www-3.ibm.com/able/weblog/corona_weblog.html

Other related posts: