This should be of some interest to many! Copied from another list. Begin forwarded message: Subject: [Promotion-technology] Fwd: Announcing PDF2OCR Now available at http://www.EmpowermentZone.com/pdf2ocr.zip PDF2OCR 1.0 Released September 14, 2007 Public Domain by Jamal Mazrui Following up on a tip from Ken Perry about the open source Tesseract-OCR project at Google, I have tried to use this OCR engine to build a free program for producing accessible text from an image-based PDF. Such files are created by scanning equipment or software printer drivers that save only the picture of text, without the actual characters themselves. This makes them inaccessible to most PDF viewing utilities, which extract text but do not perform OCR on images. I could not find an existing Windows solution on the web, but did get useful ideas from Linux-oriented ones. What I am calling PDF2OCR combines Tesseract from http://code.google.com/p/tesseract-ocr with the GhostScript interpreter from http://ghostscript.com GhostScript creates a .tif file from the .pdf file of interest, and then Tesseract creates a .txt file from that. The current implementation is a batch file, pdf2ocr.bat, with the following syntax on the command line: pdf2ocr SourceRootName where SourceRootName is the name of a PDF file without the .pdf extension. This produces a text file with the same name except for a .txt extension. The PDF name can include a directory path, but not embedded spaces. For example, pdf2ocr c:\temp\test produces c:\temp\test.txt When complete, the batch file prints tesseract.log to the screen -- a file that is recreated for each conversion. Installation consists of unzipping the pdf2ocr.zip archive to a target directory, e.g., to one called C:\PDF2OCR This directory contains the executable files, as well as three subdirectories with support files. The gsdata subdirectory contains many files I gathered from an installed GhostScript directory tree. The tessdata subdirectory contains language support for Tesseract (I have only distributed English files, but other languages are available from the Google site). The misc subdirectory contains sample files, some source code, and this documentation. A sample image-based PDF is named mlk.pdf -- the letter Martin Luther King, Jr. wrote from the Birmingham Jail. Another sample is debate.pdf -- the legal agreement between the Bush and Kerry campaigns concerning Presidential debates. Two commercial OCR programs tested, Kurzweil 1000 and PDF Magic, converted one of these files well, but not the other at all (a different one for each). Their results, as well as that of PDF2OCR, are provided in text files. Please understand that Tesseract is not the best OCR available, though it is generally considered the best free OCR at present. In order to run the batch file from any directory, you can add the PDF2OCR directory to the path of a console session with a command like the following: set path=c:\pdf2ocr;%path% You can add the path for every console session via the Advanced tab page of the System applet in Control Panel. To easily convert multiple PDFs in a directory, I have also created a utility called dir2ocr.exe. Simply pass the directory name to process as a parameter, e.g., dir2ocr c:\temp If no parameter is passed, the current directory is assumed. Source code for this PowerBASIC program that calls pdf2ocr.bat is in the files dir2ocr.bas and fn.inc, located in the misc subdirectory. The PDF2OCR download is large, about 14 megabytes as a compressed archive. Other techniques of getting text from a PDF should probably be tried first. When other tools do not work or are unavailable, however, I hope this helps to bridge an accessibility gap. Feel free to enhance it in the spirit of open source development! Jamal Mazrui jamal@xxxxxxxxxxxxxxxxxxx ** To leave the list, click on the immediately-following link:- ** [mailto:access-uk-request@xxxxxxxxxxxxx?subject=unsubscribe] ** If this link doesn't work then send a message to: ** access-uk-request@xxxxxxxxxxxxx ** and in the Subject line type ** unsubscribe ** For other list commands such as vacation mode, click on the ** immediately-following link:- ** [mailto:access-uk-request@xxxxxxxxxxxxx?subject=faq] ** or send a message, to ** access-uk-request@xxxxxxxxxxxxx with the Subject:- faq