Introducing PDF2OCR and seeking testers

From: Jamal Mazrui <empower@xxxxxxxxx>
To: programmingblind@xxxxxxxxxxxxx
Date: Wed, 12 Sep 2007 19:03:33 -0400 (EDT)

Now available at
http://www.EmpowermentZone.com/pdf2ocr.zip

Following up on a tip from Ken Perry about the open source Tesseract
project at Google, I have tried to use this OCR engine to build a free
program for producing accessible text from an image-based PDF.  Such files
are created by scanning equipment or software printer drivers that save
only the picture of text, without the actual characters themselves.  This
makes them inaccessible to most PDF viewing utilities, which extract text
but do not perform OCR on images.

I could not find an existing Windows solution on the web, but did get
useful ideas from Linux-oriented ones.  What I am calling PDF2OCR combines
Tesseract from
http://code.google.com/p/tesseract-ocr
with the GhostScript interpreter from
http://ghostscript.com

GhostScript creates a .tif file from the .pdf file of interest, and then
Tesseract creates a .txt file from that.  The current implementation is a
simple batch file, pdf2ocr.bat, with the following syntax on the command
line:
pdf2ocr SourceRootName
where SourceRootName is the name of a PDF file without the .pdf extension.
This produces a text file with the same name except for a .txt extension.
The PDF name can include a directory path, but not embedded spaces.  For
example,
pdf2ocr c:\temp\test
produces
c:\temp\test.txt

I am seeking feedback on this initial test version.  I want to be sure it
works on computers that have not run the GhostScript installation program
for Windows.  The archive contains an image-based PDF for testing called
debate.pdf (the legal agreement between the Bush and Kerry campaigns
concerning Presidential debates).  Please understand that Tesseract is not
the best OCR available, though it is generally considered the best free
OCR.

Installation consists of unzipping the pdf2ocr.zip archive to a directory,
e.g., to one called
C:\PDF2OCR
The target directory will contain many files that I gathered from
subdirectories of an installed GhostScript directory tree.  It will also
contain one subdirectory called tessedata, which is required by the
tesseeract.exe program for language support (I have only distributed
English files, but other languages are available from the Google site).

In order to run the batch file from any directory, you can add the PDF2OCR
directory to the path of a console session with a command like the
following:
set path=c:\pdf2ocr;%path%
You can add the path for every console session via the Advanced tab page
of the System applet in Control Panel.

The pdf2ocr.zip download is large, about 14 megs, so it will probably
remain a stand-alone project, rather than being bundled with other
applications I develop.  Feel free to enhance it in the spirit of open
source development!

Jamal

__________
View the list's information and change your settings at 
//www.freelists.org/list/programmingblind

Follow-Ups:
- Re: Introducing PDF2OCR and seeking testers
  - From: Ian D. Nichols

Introducing PDF2OCR and seeking testers

Other related posts: