Hi everyone, Here is an article I found that was written a few years ago on the ESight web site. The original page where the article was found is located at http://www.esight.org/View.cfm?x=773 How to Effectively Scan a Book By: Kelly Pierce Here's a bevy of tricks, tips and caveats about how to scan a book or other document with the greatest accuracy. Included are suggestions about scanner settings, document preparation, and enhancing software performance. The Scanning Process Scanning Processing Error Correction What Tips Do You Have to Share With Us? Other comments people have made about this topic: . "How to Effectively Scan a Book" From: Nan - eSight - Friday, June 21, 2002 One of the most common assumptions you may encounter as a blind computer user is how to get a high quality scan of a book or document. You may have been lured into believing that the results of scanning tests between the Open Book scanning packages produced by Freedom Scientific and the Kurzweil 1000 produced by Kurzweil Educational Systems are all you'll likely get. In fact, you may find yourself imitating the reviewers in the articles about scanning. How often have you, as they sometimes suggest, simply selected the normal default setting in your software and accepted the results as the kind of access that can be obtained? Actually, you can take a number of steps to improve your resulting scans. To improve scanning results, however, it is important to understand the scanning process. The Scanning Process The process of taking a printed book or document and turning it into a computer file that can be read out loud to a blind person consists of these three parts: . Scanning . Processing . Correcting An image of the document is captured with a scanner using very bright light. This image is then processed through an optical character recognition (OCR) program. The resulting computer file is then put through a spell checker to correct scanning errors. I will offer tips and suggestions for each part of this scanning process. Go to Top of Page Scanning The scan is the image or picture taken of the book or document. Any improvements of the source image will improve the accuracy of the end product. It all begins by choosing a good scanner. In recent years, scanning software has come to support scores and possibly hundreds of different scanners. The scanners themselves have drastically dropped in price. Some cost less than $100. There are so many choices in scanners, and they can vary widely in quality and purpose. Not all produce good results for scanning text and OCR. Be sure to check web sites with extensive hardware reviews, such as Cnet.com, for an honest opinion of the scanner you are considering as a purchase. The key item is in the resolution of the scanned image. The more dots per inch (dpi) the better -- although, after a certain point (usually 600 dpi), added quality doesn't dramatically improve the optical character recognition (OCR) of the image. However, the scanners with the best reviews are typically those with high dpi. This is not the main reason why they were rated so highly. Instead, the scanners themselves were designed and manufactured better, and part of the design includes a high dpi (among other factors). The highly rated scanners are not of the $70 variety, but a little extra money on the scanner purchase will typically deliver a better scanner. Regardless of price, consider the scanner's purpose. Some scanners are specifically designed for photographs and graphics instead of general purpose scanning. These scanners do not deliver optimum results for the OCR process. Similarly, some scanners are designed to be lightweight and portable. They may be excellent tools, if you are on the road, but realize that they may not be as solidly built with high quality parts and assemblies as the larger scanners that will sit on a desk. Performance may suffer as a result. Read product information carefully to see if the scanner's purpose fits your need for desktop scanning of text. When you have your scanner and its drivers installed, check the resolution setting. With some scanners, graphic resolution of 400 dpi yields slightly better results than the default 300 dpi setting. The tradeoff is the additional time needed to both scan and process the image into text. I changed this setting on my Epson 1640 and found noticeably better results. An item frequently overlooked in improving scanning quality is cleaning the glass scanning bed. It is easy for a blind person to forget how readily ink from newsprint, magazines, photocopies, and other sources comes off, resulting in dirty hands. It is not uncommon for fingers to be visibly darkened after reading the Sunday newspaper. If all of this is so, you can only imagine the grime on the scanning glass. Regular cleaning is important. To clean the scanning glass, use mild soap or a glass cleaner like Windex. Put it on a soft cloth; do not pour it on the glass. Pouring it on the glass could get the "inch" ruler scales wet. These are the X and Y axes on the sides of the scanner. Some scanners mark these scales tactilely. The bottom of the rulers is often used for the scanner's internal calibration area before beginning a scan. Don't do anything foolish or excessive in that area. Rubbing alcohol should not be used as a cleaner because it is often impure. Glass cleaners clean well, but you must be sure to remove all of it or else it will leave a film. This film is not noticed on windows or mirrors, but the scanner's bright light causes the scans to show this film, leading to degraded scans. A way to remove this film is to go back over the Windex-cleaned glass with vinegar diluted with water on a wet cloth. The more meticulous the cleaning the less likelihood that a film will be left behind. Windex works, but persistence may sometimes be needed to make sure that all has been removed. Before cleaning, be sure to check the scanner's manual about recommended cleaning. Some scanners use a plastic or non-glass scanning bed and certain cleaning agents, including vinegar, may damage these surfaces. The most preventable care of the glass scanning bed is in avoiding scratches. Common advice is to remove all staples and paperclips before scanning. Less well known is that paper towels can make fine scratches on optical surfaces. For example, camera owners would never consider using paper towels on a camera lens. They use a soft cloth instead. The grit and dirt on the glass can also cause fine scratches. This is why the scanning glass should be wiped instead of scrubbed. A good scrubbing grinds the grit into the glass in addition to removing it. Pay attention to the cloth used. Many can leave little specks of lint behind. Some suggest using an old clean diaper or an old clean tee shirt for minimal lint. Documents with the highest contrast scan best. This is why documents printed with shades of gray or on colored paper scan poorly or not at all. Running the document through a black and white photocopier can make an unconvertible document readable. The document is now on white paper and, typically, the contrast has been sharpened. The brightness or contrast of the scanned image significantly affects scanning quality. With document scanning, the brightness or contrast setting of your scanner darkens or lightens the text on the page. If the image (that is, the text) is too dark, the OCR software will misinterpret open letter forms so that an "F" will be interpreted as a "P" and a lower case "H" may be interpreted as a lower case "B," for example. If the document is scanned too light, letterforms may be broken so an upper case "B" may be interpreted as an upper case "E." The default for scanning packages is a normal scan, which is usually at 50 percent brightness or contrast. Depending on the scanner, print quality, and OCR engine, "automatic thresholding" (also called "automatic brightness control" and "automatic contrast") may need to be "on" or "off." People have reported that they usually gain better results when this setting is turned "on." With the normal setting, contrast/brightness is set at a fixed point, usually 50 percent -- with a value of 0 being the lightest setting and a value of 100 the darkest. With automatic contrast, the scanning system takes an educated guess at the best scanning setting, typically between the values of 40 and 60. While automatic contrast/brightness may be better than the normal default setting, this usually only works well for scanning individual items of unknown origin and quality, such as mail and meeting handouts. The best results for scanning longer documents, such as books and reports, occur after you have customized the scanning software to the scanner and the document to be scanned. To do this, first choose the scanning engine. Open Book uses three scanning engines, and the Kurzweil 1000 uses two scanning engines. In most cases, the FineReader engine delivers the best results. Verify this yourself with a sample page from a book and scan the page with each engine. Run each page through a spell checker and count the errors on each page. Also, consider whether the misspelled word is recognizable enough to be corrected. Or is it not recognizable, even in context, so it cannot be corrected without actually looking at the printed text? Once you have found the best scanning engine, using the normal or automatic brightness/contrast setting, switch the setting to custom brightness/contrast. Start at value 50. Spell check the page and count the errors and evaluate the overall quality of the document. Next, increase the contrast/brightness value. Many find it best to go in increments of five. Then spell check the page and count the errors as before. Determine if the resulting page has greater or fewer errors than the previous page. If there are fewer errors, then increase the value again by another five until an increment of five results in greater errors. When greater errors are obtained on a scanned page than from the previous position, go back in steps of one until you reach two values with little difference in changing the values up or down. If you found greater errors by increasing the value by five from 50, use the same process as above, except decrease in values of five. You have now found the correct brightness. This is the ideal setting for the scanner and the document. Be sure to write the value down and save it on your system. For the most part, many other similar documents will be either at or very near this setting. End users have reported settings ranging anywhere from 50 to 76 percent brightness/contrast. There is no magic number that delivers the best results. The settings will vary with each brand and model. The setting should be re-calibrated for each book for best results, and several pages in different parts of the book should be evaluated to determine an overall setting because print and paper quality may be uneven. Once optimal results have been reached for one variable or feature, repeat the process of scanning the sample page, counting errors with a spell checker, and comparing results. I have mentioned a number of variables in this article. Try to change settings and optimize each setting one at a time and see how it affects the resulting recognized page before starting to work on the next feature and changing the next setting. Doing otherwise will leave you unsure what features and settings should be optimized for your book and scanner. To make scanning go twice as fast, use "two-page mode" when scanning books. Both the Open Book and K1000 support this feature, which lets you scan two pages at a time and ensure that each physical page of the book is stored in its own logical page. For additional efficiency, use the "continuous scan" or "express batch" feature. This continuously scans pages with no processing and eliminates the need to press the scan key for each page. I usually set a time interval of 20 to 25 seconds to ensure the page is on the scanner at a full 90-degree angle and is fully flat on the glass. If you scan regularly, you can likely shorten the time interval to a more aggressive value. The importance of scanning books and documents flat against the glass and straight against the edges of the scanner cannot be emphasized enough. Wrinkled pages may need to be flattened with a warm iron. Be sure that the text of the page can fit flat on the glass. Open the book very wide and make sure the entire page is flat on the glass, pressing firmly on the book. Some books, particularly textbooks, have gutters (the inner margins next to the book's spine) very close to the binding, making it very difficult to scan the pages. If this is so, it may be necessary to unbind the book so the pages are separated and then can lie flat on the scanning glass. People use a number of methods to do this -- from employing table saws to using sharpened scissors to cut groups of pages from the book very close to the binding. Be certain to keep pages in order and to trim off any rough edges that would not allow pages to lie straight and flat. Yes, today's scanning software has a default "de-skewing" feature that corrects crooked pages, but this is not a license to correct for a lazy and sloppy scanning job. Spending a few extra seconds on each page to align it properly will allow the software to correct the inevitable imperfections rather than attempt to compensate for a bad scanning job. Before scanning, evaluate the paper quality of the book. With very thin paper, the bright scanner light can cause bleed through of characters on the opposite side of the page. If the book is unbound and in single sheets, scan each page with the lid closed. If the inside of the lid is white, also try taping a piece of black paper to the lid so the light is absorbed and not reflected. If the pages are attached to the book, try placing a black piece of paper, dark cardboard, or a dark, durable paper behind the page. Go to Top of Page Processing The conversion or recognition process occurs after the document has been scanned. Many feel that little can be done about this stage except to buy one scanning system or another. To a certain extent, this is true. Both Open Book and the Kurzweil 1000 primarily use the same underlying OCR software, FineReader. This program is also sold as a stand-alone product and may be a scanning solution for blind computer users extremely comfortable with Windows. The clearly noticeable differences in performance stem from using different versions of the OCR software. As I write this in the spring of 2002, Kurzweil 1000 uses FineReader version five and Open Book uses version four. However, the latest FineReader version is six, which will be available to developers shortly and hopefully included in product upgrades. Be sure to look "under the hood" of your scanning software to learn what version of the scanning engine your software is using. In many cases, "speckle removal" improves recognition quality; sometimes, it makes it worse. If this feature isn't turned on, enable it and compare the results. To automatically correct common recognition errors, activate the "automatic correction" setting. This is similar to "auto correct" in word processors, but the dictionary of misspelled words and their replacements are those typically generated through the character recognition process. Go to Top of Page Error Correction All scanned text contains some errors -- often not many with today's technology. But they are in the text nonetheless. To find these errors, you need to run your document through a spell checker. However, before proofreading with a spell checker, search for characters that are typically not found in a book (such as special symbols found in "shifted states" on the keyboard as well as "bullet" and "tab"). It is best to remove these characters manually instead of with global "search and replace." An added benefit to this approach is that you sometimes find patterns to add to the automatic correction list. During error correction with the spell checker, you will find obvious entries to add to the auto correction list. This will increase results and efficiency over time. Many entries, though, will not be so obvious. Take a conservative and careful approach. If you permanently replace "he" with "be," you will have a real mess on your hands and might spend much time in the future fixing this preventable disaster. Resources Looking for more tips, tricks, and ideas to improve the quality of the books you scan? Check these two newsgroups on Usenet: comp.periphs.scanners alt.comp.periphs.scanner Both address scanner types, selection, hardware and software installation, applications, and problems. A book is available that is more comprehensive than this article. It is: "OCR With a Smile!: An Operator's Guide to Optical Character Recognition," Fred F. Ross, paperback (July 1998), House of Scanning; ISBN: 0966590406 This book is about optical character recognition (OCR) systems, how to operate them for maximum throughput and how to achieve the highest possible accuracy in the process. This comprehensive reference gives you a clear idea of what must be done and how to start doing it when no software manual addresses the specific problem. The information and tips given in this book will save you money and countless hours of editing agony. The book addresses the true fundamentals for operating an OCR system and offers valuable information about accuracy enhancements, document preparation, scanner settings, error classification, tips, tricks, and caveats regardless of the computer platform, operating system, or software application you use. You'll learn how to analyze documents, scan at optimum speed, and handle the cleanup process in the most efficient manner possible. Unfortunately, this book is not available in alternative format, but, if you have access to a scanning system, you can create your own accessible version. Just scan it, using the suggestions listed above. The major online booksellers have new and used versions for sale. Lisa Hall, Former Consultant for Adaptive Technology for Northwest Vista College, a college of the Alamo Community College District. Web page: http://home.satx.rr.com/lisahall Phone: (210) 829-4571 E-mail and MSN I.D.: lhall10@xxxxxxxxxxx