[bksvol-discuss] Re: Become A Black Belt Submitter

  • From: "EVAN REESE" <mentat3@xxxxxxxxxxx>
  • To: <bksvol-discuss@xxxxxxxxxxxxx>
  • Date: Thu, 14 Aug 2008 19:27:43 -0400

Thanks for sending this up. This is all very useful stuff.

I do use Scan Repeatedly, and just hit the Cancel key twice if I get a 
confidence number below the threshhold - which on my K1000 is set to %98.7. If 
I can go twenty or fifty pages without getting a page below that number, then 
it saves me from having to hit the F9 key twenty or fifty times.

I also use autocorrection, but haven't compared a scan with and without it, so 
I cannot take sides in that debate.

According to Pratik's excellent monograph on getting the best recognition of 
mass market paperbacks, he wrote that grayscale and 400 dots per inch can 
sometimes produce better results than static optimized. So your point here 
about grayscale is a good one, but increasing the resolution from 300 to 400, 
especially for poor quality print such as you'd get with cheap paperbacks can 
give even better recognition sometimes. Of course, increasing the resolution 
from the usual 300 will also slow down the scan and the recognition; but the 
extra time invested up front is very likely to be more than offset by the time 
saved cleaning up the scan afterword.

I have scanned the same material with Suspicious Regions kept and ignored, and 
it can really make a difference in the amount of junk you get. So this is 
another good point you make here.

Thanks again.

Evan

  ----- Original Message ----- 
  From: Monica Willyard 
  To: Bookshare Volunteers 
  Sent: Thursday, August 14, 2008 6:19 PM
  Subject: [bksvol-discuss] Become A Black Belt Submitter


  Hi, everyone. I wrote an email about getting really clear scans for one of 
our volunteers, and it occurred to me that someone on this list might benefit 
from it. It's a little on the long side. I hope something in it will help you. 
If I've said anything confusing, please ask me about it. I know many of you 
have done a lot of scanning, so I'm focusing on things that may not have 
occurred to you. I'll call them my top ten scanning tips. (grin) They work from 
my experience, and you may find that you need to experiment to find something 
that works well for you. Also, I use Kurzweil for scanning. Openbook users may 
find some of this to be useful, but some of it won't apply. I do have Openbook 
7 and used it for several years. So I'll do my best to help you translate these 
to Openbook if that's what you need.

  I got a lot of these ideas from volunteers I've been fortunate enough to work 
with over the past 2 years. Jim Baugh, Louise, Pratik, Jake, Scott, Shelley, 
and Gerald taught me so much about good scanning. Thanks guys. (smile) You rock!

  1. Start with some solid settings in Kurzweil that will work most of the 
time. You may  know your way around Kurzweil well. I don't know if you've 
thought to work on these settings though since they're not obvious. Under the 
settings menu, in the general tab, make sure that your confidence threshold is 
set to at least 98.5. Why? Kurzweil defaults to 95 percent, and that means that 
it optimizes scans for a lower level of accuracy. That means you won't get the 
best results from optimization. That also means more clean-up on the backside, 
and that's a pain in the neck. The other setting in general that you may want 
to turn on if you have some disk space is the option to keep scanned images. 
This feature lets you re-recognize pages if they have issues. Sometimes just 
changing something like detect columns will make that page come out right 
without you having to totally rescan the page. Once you've read through the 
book, Kurzweil will let you remove the scanned images from the book to reduce 
the file size.

  There are three final settings that you may find useful for scanning most 
fiction. These work well for me, especially with library books. They're all 
under the recognition tab. Column identification should be enabled. Partial 
columns should be ignored, and suspicious regions should be ignored. This flies 
in the face of what Nick has recommended on the Kurzweil list, so I'd better 
explain. When scanning books, it's somewhat common to get a shadow from the 
spine of the book. It often makes a narrow column of a tab character and a 
random group of numbers or letters. If you turn off column identification, 
these random letters are mingled with the regular text. Turning on the column 
detection separates this garbage from the text, and ignoring partial columns 
and suspicious regions removes it during OCR. If a page needs column detection 
turned off due to a table, and you have retained images of the scanned page, 
you can easily change the recognition settings and just re-recognize the page 
from the scanned image. Do you see how this could save you time and hassle?

  Once you have settings you like, save them as default so you can start 
scanning without worrying about them each time you start Kurzweil.

  2. Prepare your book for scanning, and you'll get better results from the 
start. Before you begin to scan a book, run your fingers lightly through the 
pages to remove any possible ink ,dust, or other particles that may be on the 
pages. If the book is a library book, flip through the book in sections of 
about fifteen pages or so, gently pressing your fingers along the inner spine 
to encourage the book to lie flat. If the book belongs to you, especially if 
its a paperback, flip through sections as with a library book, but bend the 
book back so that it's outer covers almost touch. You're giving your book some 
flexibility stretches while not breaking its spine. This is especially 
important for thick books or two-page scanning mode and will keep you from 
having to push down as hard on books while you scan.

  3. Optimize and verify settings for your book. Before scanning a book, open 
to the center and use the optimize feature. The Kurzweil staff says that 
optimization should be used in one-page mode so it can get the best idea of how 
the print works in your book. Scan four or five pages after optimization to 
determine if any adjustments in settings need to be made. Kurzweil does a 
fairly good job picking the optimal settings to scan a particular book unless 
the print quality is exceptionally bad. If you're planning to scan in two-page 
mode, you can turn this back on once you're finished with optimization.

  4. When in doubt, go for grey-scale. Grey-scale is the best and most reliable 
thing to try when optimization doesn't produce the quality that you need. Try 
grey-scale with brightness of around 65 and a resolution of 300 DPI. It's 
really great for scanning mass market paperbacks. Grey-scale will make your 
scans slower, and its scanned images are larger than those made with static 
thresholding. It gives the best page representation
  though, compared to other forms of thresholding. If you're using a Canon or 
Visioneer scanner, grey-scale will save your bacon! (grin) Please note that 
Openbook 7 doesn't implement grey-scale correctly, so automatic contrast is 
probably your best choice.

  5. Catch bad scans as they happen. There is a friendly debate among 
submitters about whether to scan in batches or to scan pages and recognize them 
one at a time. There are pros and cons on both sides. I do a sort of modified 
batch style. I scan a book while on the phone or doing something else but don't 
use the scan repeatedly feature for one reason. I want to catch badly scanned 
pages as they happen. It saves me from hunting for a page to rescan it later. 
So I scan a page and let my scan recognize while I'm turning to the next page. 
I wait for Kurzweil to tell me its confidence number. I make this really easy 
because I've turned off the progress messages for Kurzweil's scanning and 
recognition and have it set to play a chime when scanning and recognition are 
finished. So if Kurzweil says something, it's the confidence number letting me 
know that the page scanned below the accuracy threshold I've set. If the 
statistics say 97 percent confidence level or less, rescan the page to try for 
a better scan. Otherwise, you will have to struggle with many errors on the 
page. 

  6. Your scanner needs TLC too. Books can be dirty or dusty sometimes. Mass 
market paperbacks can leave a residue of ink dust on your scanner. Keep the 
scanner glass clean by using a dry, lint-free cloth. Never use anything wet 
like an alcohol pad or baby wipe. That will create little bubbles under the 
scanner glass and will cause problems in future scans.

  7. When scanning a book, do a spot check every 15 or 20 pages. Look at the 
last page or two of the file to make sure the settings are still producing 
accurate results. 

  8. After doing a scan, run rank spelling. It will let you see your spelling 
errors and will put them in the order of their prevalence in your scan. If you 
find some words that Kurzweil doesn't know, you may want to add them to your 
word list so they won't be flagged in future scans. I don't do this for proper 
names unless its a name that will keep cropping up in future books. I do add 
words that are valid but that Kurzweil doesn't have in its internal word list. 
You'll find that doing this over time helps Kurzweil do a better job for you 
when you're cleaning up your scans.

  9. Keep the de-speckle setting turned off for most books. You may need it 
with hardcover books because they sometimes have a text decoration on the 
pages. Otherwise, de-speckle can interfere with OCR and actually cause more 
errors than it solves.

  10. The issue of using auto-corrections when scanning is another issue where 
there is debate. I believe it can be a good thing if used carefully. I should 
note that Gerald has pointed out that Openbook has some auto-corrections that 
cause problems with books and should be fixed by users of that program. 
Kurzweil seems to do a good job for me, and it makes my work easier. I loaded 
up a bunch of my older scans that have been lurking on my hard rive for over a 
decade and ran auto-correction on them. What an improvement! I might actually 
get to submit some of them now. Here are a few auto-corrections I have added to 
my Kurzweil list.

  dirough for through
  diough for though
  diought for thought
  diey for they
  diere for there
  dieir for their
  cornpany for company
  cornfortable for comfortable
  tiiing for thing
  rnany for many
  anydiing for anything


  If you use Openbook, you may want to remove a few of the corrections in its 
default list. I regularly find these in books scanned in Openbook and have to 
fix them as I read.

  modem for modern
  torn for tom
  glock for clock
  morn for mom
  bum for burn
  corn for com

  That last one causes problems for anyone scanning Star Trek books because 
Kirk presses his corn badge to talk to the ship. (grin) If a word like command 
is hyphenated between two pages, you get corn-mand. Meanwhile, Batman dials 
into the internet with his modern, tries to stop a crook named torn from 
shooting him with a clock, and puts the dirty burn in cuffs until mom-ing. See 
how auto-corrections can go wrong if you're not careful?

  Whew! We've made it to the end. (grin) I hope some of this makes your scans 
easier to work with. It'll give you a foundation to start from anyhow. Clean-up 
tips will be another email and will take some thought. I'm better at doing than 
explaining things. I do have a system I use though. I just haven't really 
written it down. Anyone got a cold Dr. Pepper to share?


-- 
Monica Willyard

Other related posts: