[bksvol-discuss] Re: Become A Black Belt Submitter

  • From: james.homme@xxxxxxxxxxxx
  • To: bksvol-discuss@xxxxxxxxxxxxx
  • Date: Fri, 15 Aug 2008 07:43:40 -0400

Dear Monica,
What a wonderful set of tips, and entertaining to read, too. I will put
them in my saved-info folder in case I ever start scanning. Thank you for
writing this. I think that this should go into the manual.



James D Homme, Usability Engineering, Highmark Inc.,
james.homme@xxxxxxxxxxxx, 412-544-1810

"The difference between those who get what they wish for and those who
don't is action. Therefore, every action you take is a complete
success,regardless of the results." -- Jerrold Mundis
Highmark internal only: For usability and accessibility:

             "Monica Willyard"                                             
             >                                                          To 
             Sent by:                  "Bookshare Volunteers"              
             bksvol-discuss-bo         <bksvol-discuss@xxxxxxxxxxxxx>      
             unce@xxxxxxxxxxxx                                          cc 
                                       [bksvol-discuss] Become A Black     
             08/14/2008 06:19          Belt Submitter                      
             Please respond to                                             

Hi, everyone. I wrote an email about getting really clear scans for one of
our volunteers, and it occurred to me that someone on this list might
benefit from it. It's a little on the long side. I hope something in it
will help you. If I've said anything confusing, please ask me about it. I
know many of you have done a lot of scanning, so I'm focusing on things
that may not have occurred to you. I'll call them my top ten scanning tips.
(grin) They work from my experience, and you may find that you need to
experiment to find something that works well for you. Also, I use Kurzweil
for scanning. Openbook users may find some of this to be useful, but some
of it won't apply. I do have Openbook 7 and used it for several years. So
I'll do my best to help you translate these to Openbook if that's what you

I got a lot of these ideas from volunteers I've been fortunate enough to
work with over the past 2 years. Jim Baugh, Louise, Pratik, Jake, Scott,
Shelley, and Gerald taught me so much about good scanning. Thanks guys.
(smile) You rock!

1. Start with some solid settings in Kurzweil that will work most of the
time. You may  know your way around Kurzweil well. I don't know if you've
thought to work on these settings though since they're not obvious. Under
the settings menu, in the general tab, make sure that your confidence
threshold is set to at least 98.5. Why? Kurzweil defaults to 95 percent,
and that means that it optimizes scans for a lower level of accuracy. That
means you won't get the best results from optimization. That also means
more clean-up on the backside, and that's a pain in the neck. The other
setting in general that you may want to turn on if you have some disk space
is the option to keep scanned images. This feature lets you re-recognize
pages if they have issues. Sometimes just changing something like detect
columns will make that page come out right without you having to totally
rescan the page. Once you've read through the book, Kurzweil will let you
remove the scanned images from the book to reduce the file size.

There are three final settings that you may find useful for scanning most
fiction. These work well for me, especially with library books. They're all
under the recognition tab. Column identification should be enabled. Partial
columns should be ignored, and suspicious regions should be ignored. This
flies in the face of what Nick has recommended on the Kurzweil list, so I'd
better explain. When scanning books, it's somewhat common to get a shadow
from the spine of the book. It often makes a narrow column of a tab
character and a random group of numbers or letters. If you turn off column
identification, these random letters are mingled with the regular text.
Turning on the column detection separates this garbage from the text, and
ignoring partial columns and suspicious regions removes it during OCR. If a
page needs column detection turned off due to a table, and you have
retained images of the scanned page, you can easily change the recognition
settings and just re-recognize the page from the scanned image. Do you see
how this could save you time and hassle?

Once you have settings you like, save them as default so you can start
scanning without worrying about them each time you start Kurzweil.

2. Prepare your book for scanning, and you'll get better results from the
start. Before you begin to scan a book, run your fingers lightly through
the pages to remove any possible ink ,dust, or other particles that may be
on the pages. If the book is a library book, flip through the book in
sections of about fifteen pages or so, gently pressing your fingers along
the inner spine to encourage the book to lie flat. If the book belongs to
you, especially if its a paperback, flip through sections as with a library
book, but bend the book back so that it's outer covers almost touch. You're
giving your book some flexibility stretches while not breaking its spine.
This is especially important for thick books or two-page scanning mode and
will keep you from having to push down as hard on books while you scan.

3. Optimize and verify settings for your book. Before scanning a book, open
to the center and use the optimize feature. The Kurzweil staff says that
optimization should be used in one-page mode so it can get the best idea of
how the print works in your book. Scan four or five pages after
optimization to determine if any adjustments in settings need to be made.
Kurzweil does a fairly good job picking the optimal settings to scan a
particular book unless the print quality is exceptionally bad. If you're
planning to scan in two-page mode, you can turn this back on once you're
finished with optimization.

4. When in doubt, go for grey-scale. Grey-scale is the best and most
reliable thing to try when optimization doesn't produce the quality that
you need. Try grey-scale with brightness of around 65 and a resolution of
300 DPI. It's really great for scanning mass market paperbacks. Grey-scale
will make your scans slower, and its scanned images are larger than those
made with static thresholding. It gives the best page representation
though, compared to other forms of thresholding. If you're using a Canon or
Visioneer scanner, grey-scale will save your bacon! (grin) Please note that
Openbook 7 doesn't implement grey-scale correctly, so automatic contrast is
probably your best choice.

5. Catch bad scans as they happen. There is a friendly debate among
submitters about whether to scan in batches or to scan pages and recognize
them one at a time. There are pros and cons on both sides. I do a sort of
modified batch style. I scan a book while on the phone or doing something
else but don't use the scan repeatedly feature for one reason. I want to
catch badly scanned pages as they happen. It saves me from hunting for a
page to rescan it later. So I scan a page and let my scan recognize while
I'm turning to the next page. I wait for Kurzweil to tell me its confidence
number. I make this really easy because I've turned off the progress
messages for Kurzweil's scanning and recognition and have it set to play a
chime when scanning and recognition are finished. So if Kurzweil says
something, it's the confidence number letting me know that the page scanned
below the accuracy threshold I've set. If the statistics say 97 percent
confidence level or less, rescan the page to try for a better scan.
Otherwise, you will have to struggle with many errors on the page.

6. Your scanner needs TLC too. Books can be dirty or dusty sometimes. Mass
market paperbacks can leave a residue of ink dust on your scanner. Keep the
scanner glass clean by using a dry, lint-free cloth. Never use anything wet
like an alcohol pad or baby wipe. That will create little bubbles under the
scanner glass and will cause problems in future scans.

7. When scanning a book, do a spot check every 15 or 20 pages. Look at the
last page or two of the file to make sure the settings are still producing
accurate results.

8. After doing a scan, run rank spelling. It will let you see your spelling
errors and will put them in the order of their prevalence in your scan. If
you find some words that Kurzweil doesn't know, you may want to add them to
your word list so they won't be flagged in future scans. I don't do this
for proper names unless its a name that will keep cropping up in future
books. I do add words that are valid but that Kurzweil doesn't have in its
internal word list. You'll find that doing this over time helps Kurzweil do
a better job for you when you're cleaning up your scans.

9. Keep the de-speckle setting turned off for most books. You may need it
with hardcover books because they sometimes have a text decoration on the
pages. Otherwise, de-speckle can interfere with OCR and actually cause more
errors than it solves.

10. The issue of using auto-corrections when scanning is another issue
where there is debate. I believe it can be a good thing if used carefully.
I should note that Gerald has pointed out that Openbook has some
auto-corrections that cause problems with books and should be fixed by
users of that program. Kurzweil seems to do a good job for me, and it makes
my work easier. I loaded up a bunch of my older scans that have been
lurking on my hard rive for over a decade and ran auto-correction on them.
What an improvement! I might actually get to submit some of them now. Here
are a few auto-corrections I have added to my Kurzweil list.

dirough for through
diough for though
diought for thought
diey for they
diere for there
dieir for their
cornpany for company
cornfortable for comfortable
tiiing for thing
rnany for many
anydiing for anything

If you use Openbook, you may want to remove a few of the corrections in its
default list. I regularly find these in books scanned in Openbook and have
to fix them as I read.

modem for modern
torn for tom
glock for clock
morn for mom
bum for burn
corn for com

That last one causes problems for anyone scanning Star Trek books because
Kirk presses his corn badge to talk to the ship. (grin) If a word like
command is hyphenated between two pages, you get corn-mand. Meanwhile,
Batman dials into the internet with his modern, tries to stop a crook named
torn from shooting him with a clock, and puts the dirty burn in cuffs until
mom-ing. See how auto-corrections can go wrong if you're not careful?

Whew! We've made it to the end. (grin) I hope some of this makes your scans
easier to work with. It'll give you a foundation to start from anyhow.
Clean-up tips will be another email and will take some thought. I'm better
at doing than explaining things. I do have a system I use though. I just
haven't really written it down. Anyone got a cold Dr. Pepper to share?

Monica Willyard

 To unsubscribe from this list send a blank Email to
put the word 'unsubscribe' by itself in the subject line.  To get a list of 
available commands, put the word 'help' by itself in the subject line.

Other related posts: