[bksvol-discuss] Re: Miraculous multiplication of book pages -- a bug of biblical proportion?

  • From: "Kenneth A. Cross" <crossk@xxxxxxxxxxxx>
  • To: <bksvol-discuss@xxxxxxxxxxxxx>
  • Date: Tue, 10 Aug 2004 18:26:52 -0400

No, I don't have 3000.  I sent Stephen the entire file, and we are talking 
about it.
  ----- Original Message ----- 
  From: Guido Corona 
  To: bksvol-discuss@xxxxxxxxxxxxx 
  Sent: Tuesday, August 10, 2004 2:45 PM
  Subject: [bksvol-discuss] Re: Miraculous multiplication of book pages -- a 
bug of biblical proportion?



  Thanks Ken.  You do not happen to be using K3000 rather than K1000 by any 
chance? 
  Let Stephen know exactly which product and version you are using. 
  It may also be a good idea to send Stephen a few sample images of pages that 
give the most bizarre results.  E.g. title pages where it seems each char in 
the title is assigned its unique page. 

  Thanks, 

  Guido 



  Guido D. Corona
  IBM Accessibility Center,  Austin Tx.
  IBM Research,
  Phone:  (512) 838-9735
  Email: guidoc@xxxxxxxxxxx

  Visit my weekly Accessibility WebLog at:
  http://www-3.ibm.com/able/weblog/corona_weblog.html




        "Kenneth A. Cross" <crossk@xxxxxxxxxxxx> 
        Sent by: bksvol-discuss-bounce@xxxxxxxxxxxxx 
        08/10/2004 12:59 PM Please respond to
              bksvol-discuss 


       To <bksvol-discuss@xxxxxxxxxxxxx>  
              cc  
              Subject [bksvol-discuss] Re: Miraculous multiplication of book 
pages -- a bug of biblical proportion? 

              

       



  I have sent them the kes file to study.  The kes file of about two hundred 
pages converted to 500 plus in rtf.  The same kes file converted to the same 
two hundred pages when I changed the kes file to txt.  I will also get the 
settings file to them. 
  ----- Original Message ----- 
  From: Guido Corona 
  To: bksvol-discuss@xxxxxxxxxxxxx 
  Sent: Tuesday, August 10, 2004 12:38 PM 
  Subject: [bksvol-discuss] Miraculous multiplication of book pages -- a bug of 
biblical proportion? 


  Ken,  that worries me. I had not heard yet of K1000 Kes to RTF conversion 
fowling page boundaries this badly  .  I am contacting Kurzweil on this 
subject.  They might need the settings file (OST) you used to reproduce the 
problem. 
  The only setting which in my view may be the cause of these extra pages is in 
recognition settings:  partial columns should be ignored rather than kept. 

  In the meantime,  are you saying that a further K1000 conversion from RTF to 
TXT cleans up the spurious page breaks?  If that is the case,  I will try this 
in the next book where I encounter the problem. 
  I suggest that until we get the problem fixed programmatically,  in the 
meantime you convert problem books to TXT before submission to Bookshare. 

  Guido 
  forassistance.  If we can provide with reproduceable scenarios,  I am sure 
they can address the problem 

  Guido D. Corona
  IBM Accessibility Center,  Austin Tx.
  IBM Research,
  Phone:  (512) 838-9735
  Email: guidoc@xxxxxxxxxxx

  Visit my weekly Accessibility WebLog at:
  http://www-3.ibm.com/able/weblog/corona_weblog.html



        "Kenneth A. Cross" <crossk@xxxxxxxxxxxx> 
        Sent by: bksvol-discuss-bounce@xxxxxxxxxxxxx 
        08/10/2004 11:01 AM 
              Please respond to
              bksvol-discuss 


       
              To <bksvol-discuss@xxxxxxxxxxxxx>  
              cc  
              Subject [bksvol-discuss] Re: 550 books in the download queue 


              

       




  Actually, there is a very interesting problem, at least in Kurzweil 8.1.  I 
just saved a 219 page book from kes to rtf.  The rtf version had five hundred 
plus pages.  More pointedly, after deleting the pages and saving the book, the 
pages reappeared when the file was loaded  and saved again.  That had nothing 
to do with the scanning and very little to do with the actual reading of the 
book.  Now that same book saved as a two hundred page txt book, and if you want 
books in that format that is easy enough to do.   
  ----- Original Message ----- 
  From: Guido Corona 
  To: bksvol-discuss@xxxxxxxxxxxxx 
  Sent: Tuesday, August 10, 2004 11:34 AM 
  Subject: [bksvol-discuss] Re: 550 books in the download queue 


  Kenneth,  I do agree that we cannot make things perfect,  especially if 
graphs and pictures are involved,  which typically generate optical noise.  On 
the other hand I do urge all scanning volunteers to perform a modicum of 
cleanup of their materials prior to posting to the system. 

  1.  PLEASE do perform a book integrity check.  If a print copy has 320 pages, 
there is no reason whatsoever that the etext should contain 1760 pages instead. 
  
  There should be a 1 to 1 correspondence in the bodytext between printed pages 
and etext pages.  Any volunteer can in most cases easily remove some duplicate 
pages,  but can't fix things when they are totally out of kilter. Furthermore, 
it is a lot easier for the submitter to integrate any missing pages as he/she 
has still access to the print copy. 
  In most cases,  even if a few missing pages need to be inserted,  a page 
integrity check for a book does not take more than 15 minutes and would save 
all reviewers a lot of wasted time. 
  In some cases there is -- as already mentioned -- a tremendous discrepancy 
between pages in the printed book and etext pages,  where the etext has 5 or 6 
times the number of pages in the original.  This likely points to scanning/OCR 
settings that are way off,  or an OCR package being used which is less than 
adequate.   

  2.  Lots of broken page headers make a book very tiring to read.  Please fix 
them or remove them.  Kurzweil lets you remove page headers automatically.  
Version 8 was a little radical in this regard and ended removing also stuff it 
was not supposed to.  Newest version 9,  just announced yesterday,  has now an 
option for 'careful' header removal.  Yesterday I worked on one of your books.  
Kurzweil removed 190 headers.  Approx 120 headers I removed manually.  It took 
me all of 15 minutes to do the cleanup.  As I had already performed a page 
integrity check and had come up with perfect correspondence after removal of 
duplicate pages,  I also removed page numbers to do a faster job.  Just want to 
get the backlog down quickly. 

  3.  Ah yes,  those amazing synopses. Let us all try to be informative .  The 
short one should give our paying customers a very brief sense of the book.  If 
we are inclined to give more detailed info,  a longer and useful description 
can go in the long synopsis.  I confess I hardly ever make up my own synopsis,  
but I liberally borrow from the front matter of the book,  the back cover of 
paperbacks and the front and back flaps of hard covers.  Synopsis such as "Set 
in Alabama",  "It's all in the Title",  "Thorough Treatment of the matter at 
hand" are unfortunately not noticeably helpful and will only cause our paying 
customers to get irritated and lose faith in Bookshare. 

  Of course,  there is a lot more that submitters can do to improve the quality 
of their postings,  but even a little of upfront cleanup and a clean submission 
process will enable us to offer a quality product to OUR PAYING CUSTOMERS. 

  NOTE:  By the way,  Kurzweil will start shipping K1000 version 9 shortly.  I 
have used the beta and found OCR quality even improved over earlier versions.  
I will post the announcement shortly.  Cost of the upgrade is $95 or $0.00,  
depending on the status of your account. 


  Guido D. Corona
  IBM Accessibility Center,  Austin Tx.
  IBM Research,
  Phone:  (512) 838-9735
  Email: guidoc@xxxxxxxxxxx

  Visit my weekly Accessibility WebLog at:
  http://www-3.ibm.com/able/weblog/corona_weblog.html

Other related posts: