[bksvol-discuss] How to Effectively Scan a Book

  • From: "Lisa Hall" <lhall10@xxxxxxxxxxx>
  • To: <bksvol-discuss@xxxxxxxxxxxxx>, "Vivian Seki" <vootsa@xxxxxxxxxxxxx>, "Sharon Dresser" <sdresser@xxxxxxxx>, "Catherine Grinda" <cgrinda@xxxxxxxx>
  • Date: Mon, 24 Oct 2005 20:02:33 -0500

Hi everyone,

 

Here is an article I found that was written a few years ago on the ESight
web site. The original page where the article was found is located at 

 

http://www.esight.org/View.cfm?x=773

 

How to Effectively Scan a Book

 

By: Kelly Pierce

 

Here's a bevy of tricks, tips and caveats about how to scan a book or other
document with the greatest accuracy. Included are suggestions about scanner
settings, document preparation, and enhancing software performance.

The Scanning Process

Scanning

Processing

Error Correction

What Tips Do You Have to Share With Us?

Other comments people have made about this topic:

 

 

 

 

. "How to Effectively Scan a Book"

 

From: Nan - eSight - Friday, June 21, 2002

 

One of the most common assumptions you may encounter as a blind computer
user is how to get a high quality scan of a book or document. You may have
been

lured into believing that the results of scanning tests between the

Open Book

scanning packages produced by

Freedom Scientific

and the Kurzweil 1000 produced by

Kurzweil Educational Systems

are all you'll likely get.

 

In fact, you may find yourself imitating the reviewers in the articles about
scanning. How often have you, as they sometimes suggest, simply selected the

normal default setting in your software and accepted the results as the kind
of access that can be obtained?

 

Actually, you can take a number of steps to improve your resulting scans. To
improve scanning results, however, it is important to understand the
scanning

process.

 

The Scanning Process

 

The process of taking a printed book or document and turning it into a
computer file that can be read out loud to a blind person consists of these
three

parts:

 

. Scanning

. Processing

. Correcting

 

An image of the document is captured with a scanner using very bright light.
This image is then processed through an optical character recognition (OCR)

program. The resulting computer file is then put through a spell checker to
correct scanning errors.

 

I will offer tips and suggestions for each part of this scanning process.

 

Go to Top of Page

 

Scanning

 

The scan is the image or picture taken of the book or document. Any
improvements of the source image will improve the accuracy of the end
product. It all

begins by choosing a good scanner. In recent years, scanning software has
come to support scores and possibly hundreds of different scanners. The
scanners

themselves have drastically dropped in price. Some cost less than $100.

 

There are so many choices in scanners, and they can vary widely in quality
and purpose. Not all produce good results for scanning text and OCR. Be sure

to check web sites with extensive hardware reviews, such as

Cnet.com,

for an honest opinion of the scanner you are considering as a purchase.

 

The key item is in the resolution of the scanned image. The more dots per
inch (dpi) the better -- although, after a certain point (usually 600 dpi),
added

quality doesn't dramatically improve the optical character recognition (OCR)
of the image.

 

However, the scanners with the best reviews are typically those with high
dpi. This is not the main reason why they were rated so highly. Instead, the
scanners

themselves were designed and manufactured better, and part of the design
includes a high dpi (among other factors). The highly rated scanners are not
of

the $70 variety, but a little extra money on the scanner purchase will
typically deliver a better scanner.

 

Regardless of price, consider the scanner's purpose. Some scanners are
specifically designed for photographs and graphics instead of general
purpose scanning.

These scanners do not deliver optimum results for the OCR process.
Similarly, some scanners are designed to be lightweight and portable. They
may be excellent

tools, if you are on the road, but realize that they may not be as solidly
built with high quality parts and assemblies as the larger scanners that
will

sit on a desk. Performance may suffer as a result. Read product information
carefully to see if the scanner's purpose fits your need for desktop
scanning

of text.

 

When you have your scanner and its drivers installed, check the resolution
setting. With some scanners, graphic resolution of 400 dpi yields slightly
better

results than the default 300 dpi setting. The tradeoff is the additional
time needed to both scan and process the image into text. I changed this
setting

on my Epson 1640 and found noticeably better results.

 

An item frequently overlooked in improving scanning quality is cleaning the
glass scanning bed. It is easy for a blind person to forget how readily ink

from newsprint, magazines, photocopies, and other sources comes off,
resulting in dirty hands. It is not uncommon for fingers to be visibly
darkened after

reading the Sunday newspaper. If all of this is so, you can only imagine the
grime on the scanning glass. Regular cleaning is important.

 

To clean the scanning glass, use mild soap or a glass cleaner like Windex.
Put it on a soft cloth; do not pour it on the glass. Pouring it on the glass

could get the "inch" ruler scales wet. These are the X and Y axes on the
sides of the scanner. Some scanners mark these scales tactilely. The bottom
of

the rulers is often used for the scanner's internal calibration area before
beginning a scan. Don't do anything foolish or excessive in that area.

 

Rubbing alcohol should not be used as a cleaner because it is often impure.
Glass cleaners clean well, but you must be sure to remove all of it or else

it will leave a film. This film is not noticed on windows or mirrors, but
the scanner's bright light causes the scans to show this film, leading to
degraded

scans. A way to remove this film is to go back over the Windex-cleaned glass
with vinegar diluted with water on a wet cloth. The more meticulous the
cleaning

the less likelihood that a film will be left behind. Windex works, but
persistence may sometimes be needed to make sure that all has been removed.

 

Before cleaning, be sure to check the scanner's manual about recommended
cleaning. Some scanners use a plastic or non-glass scanning bed and certain
cleaning

agents, including vinegar, may damage these surfaces.

 

The most preventable care of the glass scanning bed is in avoiding
scratches. Common advice is to remove all staples and paperclips before
scanning. Less

well known is that paper towels can make fine scratches on optical surfaces.
For example, camera owners would never consider using paper towels on a
camera

lens. They use a soft cloth instead.

 

The grit and dirt on the glass can also cause fine scratches. This is why
the scanning glass should be wiped instead of scrubbed. A good scrubbing
grinds

the grit into the glass in addition to removing it.

 

Pay attention to the cloth used. Many can leave little specks of lint
behind. Some suggest using an old clean diaper or an old clean tee shirt for
minimal

lint.

 

Documents with the highest contrast scan best. This is why documents printed
with shades of gray or on colored paper scan poorly or not at all. Running

the document through a black and white photocopier can make an unconvertible
document readable. The document is now on white paper and, typically, the

contrast has been sharpened.

 

The brightness or contrast of the scanned image significantly affects
scanning quality. With document scanning, the brightness or contrast setting
of your

scanner darkens or lightens the text on the page. If the image (that is, the
text) is too dark, the OCR software will misinterpret open letter forms so

that an "F" will be interpreted as a "P" and a lower case "H" may be
interpreted as a lower case "B," for example. If the document is scanned too
light,

letterforms may be broken so an upper case "B" may be interpreted as an
upper case "E."

 

The default for scanning packages is a normal scan, which is usually at 50
percent brightness or contrast. Depending on the scanner, print quality, and

OCR engine, "automatic thresholding" (also called "automatic brightness
control" and "automatic contrast") may need to be "on" or "off." People have
reported

that they usually gain better results when this setting is turned "on."

 

With the normal setting, contrast/brightness is set at a fixed point,
usually 50 percent -- with a value of 0 being the lightest setting and a
value of

100 the darkest. With automatic contrast, the scanning system takes an
educated guess at the best scanning setting, typically between the values of
40

and 60.

 

While automatic contrast/brightness may be better than the normal default
setting, this usually only works well for scanning individual items of
unknown

origin and quality, such as mail and meeting handouts. The best results for
scanning longer documents, such as books and reports, occur after you have

customized the scanning software to the scanner and the document to be
scanned.

 

To do this, first choose the scanning engine. Open Book uses three scanning
engines, and the Kurzweil 1000 uses two scanning engines. In most cases, the

FineReader engine delivers the best results. Verify this yourself with a
sample page from a book and scan the page with each engine. Run each page
through

a spell checker and count the errors on each page. Also, consider whether
the misspelled word is recognizable enough to be corrected. Or is it not
recognizable,

even in context, so it cannot be corrected without actually looking at the
printed text?

 

Once you have found the best scanning engine, using the normal or automatic
brightness/contrast setting, switch the setting to custom
brightness/contrast.

Start at value 50. Spell check the page and count the errors and evaluate
the overall quality of the document. Next, increase the contrast/brightness
value.

Many find it best to go in increments of five. Then spell check the page and
count the errors as before. Determine if the resulting page has greater or

fewer errors than the previous page. If there are fewer errors, then
increase the value again by another five until an increment of five results
in greater

errors.

 

When greater errors are obtained on a scanned page than from the previous
position, go back in steps of one until you reach two values with little
difference

in changing the values up or down. If you found greater errors by increasing
the value by five from 50, use the same process as above, except decrease

in values of five. You have now found the correct brightness. This is the
ideal setting for the scanner and the document. Be sure to write the value
down

and save it on your system.

 

For the most part, many other similar documents will be either at or very
near this setting. End users have reported settings ranging anywhere from 50
to

76 percent brightness/contrast. There is no magic number that delivers the
best results. The settings will vary with each brand and model. The setting

should be re-calibrated for each book for best results, and several pages in
different parts of the book should be evaluated to determine an overall
setting

because print and paper quality may be uneven.

 

Once optimal results have been reached for one variable or feature, repeat
the process of scanning the sample page, counting errors with a spell
checker,

and comparing results. I have mentioned a number of variables in this
article. Try to change settings and optimize each setting one at a time and
see how

it affects the resulting recognized page before starting to work on the next
feature and changing the next setting. Doing otherwise will leave you unsure

what features and settings should be optimized for your book and scanner.

 

To make scanning go twice as fast, use "two-page mode" when scanning books.
Both the Open Book and K1000 support this feature, which lets you scan two
pages

at a time and ensure that each physical page of the book is stored in its
own logical page.

 

For additional efficiency, use the "continuous scan" or "express batch"
feature. This continuously scans pages with no processing and eliminates the
need

to press the scan key for each page. I usually set a time interval of 20 to
25 seconds to ensure the page is on the scanner at a full 90-degree angle
and

is fully flat on the glass. If you scan regularly, you can likely shorten
the time interval to a more aggressive value.

 

The importance of scanning books and documents flat against the glass and
straight against the edges of the scanner cannot be emphasized enough.
Wrinkled

pages may need to be flattened with a warm iron. Be sure that the text of
the page can fit flat on the glass. Open the book very wide and make sure
the

entire page is flat on the glass, pressing firmly on the book.

 

Some books, particularly textbooks, have gutters (the inner margins next to
the book's spine) very close to the binding, making it very difficult to
scan

the pages. If this is so, it may be necessary to unbind the book so the
pages are separated and then can lie flat on the scanning glass.

 

People use a number of methods to do this -- from employing table saws to
using sharpened scissors to cut groups of pages from the book very close to
the

binding. Be certain to keep pages in order and to trim off any rough edges
that would not allow pages to lie straight and flat.

 

Yes, today's scanning software has a default "de-skewing" feature that
corrects crooked pages, but this is not a license to correct for a lazy and
sloppy

scanning job. Spending a few extra seconds on each page to align it properly
will allow the software to correct the inevitable imperfections rather than

attempt to compensate for a bad scanning job.

 

Before scanning, evaluate the paper quality of the book. With very thin
paper, the bright scanner light can cause bleed through of characters on the
opposite

side of the page. If the book is unbound and in single sheets, scan each
page with the lid closed. If the inside of the lid is white, also try taping
a

piece of black paper to the lid so the light is absorbed and not reflected.
If the pages are attached to the book, try placing a black piece of paper,

dark cardboard, or a dark, durable paper behind the page.

 

Go to Top of Page

 

Processing

 

The conversion or recognition process occurs after the document has been
scanned. Many feel that little can be done about this stage except to buy
one scanning

system or another. To a certain extent, this is true. Both Open Book and the
Kurzweil 1000 primarily use the same underlying OCR software, FineReader.

This program is also sold as a stand-alone product and may be a scanning
solution for blind computer users extremely comfortable with Windows.

 

The clearly noticeable differences in performance stem from using different
versions of the OCR software. As I write this in the spring of 2002,
Kurzweil

1000 uses FineReader version five and Open Book uses version four. However,
the latest FineReader version is six, which will be available to developers

shortly and hopefully included in product upgrades. Be sure to look "under
the hood" of your scanning software to learn what version of the scanning
engine

your software is using.

 

In many cases, "speckle removal" improves recognition quality; sometimes, it
makes it worse. If this feature isn't turned on, enable it and compare the

results.

 

To automatically correct common recognition errors, activate the "automatic
correction" setting. This is similar to "auto correct" in word processors,
but

the dictionary of misspelled words and their replacements are those
typically generated through the character recognition process.

 

Go to Top of Page

 

Error Correction

 

All scanned text contains some errors -- often not many with today's
technology. But they are in the text nonetheless. To find these errors, you
need to

run your document through a spell checker.

 

However, before proofreading with a spell checker, search for characters
that are typically not found in a book (such as special symbols found in
"shifted

states" on the keyboard as well as "bullet" and "tab"). It is best to remove
these characters manually instead of with global "search and replace." An

added benefit to this approach is that you sometimes find patterns to add to
the automatic correction list.

 

During error correction with the spell checker, you will find obvious
entries to add to the auto correction list. This will increase results and
efficiency

over time. Many entries, though, will not be so obvious. Take a conservative
and careful approach. If you permanently replace "he" with "be," you will

have a real mess on your hands and might spend much time in the future
fixing this preventable disaster.

 

Resources

 

Looking for more tips, tricks, and ideas to improve the quality of the books
you scan? Check these two newsgroups on Usenet:

 

comp.periphs.scanners

 

alt.comp.periphs.scanner

 

Both address scanner types, selection, hardware and software installation,
applications, and problems.

 

A book is available that is more comprehensive than this article. It is:
"OCR With a Smile!: An Operator's Guide to Optical Character Recognition,"
Fred

F. Ross, paperback (July 1998), House of Scanning; ISBN: 0966590406

 

This book is about optical character recognition (OCR) systems, how to
operate them for maximum throughput and how to achieve the highest possible
accuracy

in the process. This comprehensive reference gives you a clear idea of what
must be done and how to start doing it when no software manual addresses the

specific problem.

 

The information and tips given in this book will save you money and
countless hours of editing agony. The book addresses the true fundamentals
for operating

an OCR system and offers valuable information about accuracy enhancements,
document preparation, scanner settings, error classification, tips, tricks,

and caveats regardless of the computer platform, operating system, or
software application you use. You'll learn how to analyze documents, scan at
optimum

speed, and handle the cleanup process in the most efficient manner possible.

 

Unfortunately, this book is not available in alternative format, but, if you
have access to a scanning system, you can create your own accessible
version.

Just scan it, using the suggestions listed above. The major online
booksellers have new and used versions for sale.

 

 

 

Lisa Hall,

Former Consultant for Adaptive Technology for Northwest Vista College, a
college of the Alamo Community College District. 

Web page: http://home.satx.rr.com/lisahall

Phone: (210) 829-4571

E-mail and MSN I.D.: lhall10@xxxxxxxxxxx

 

 

 

 

 

Other related posts:

  • » [bksvol-discuss] How to Effectively Scan a Book