[guispeak] Announcing PDF2OCR

  • From: Jamal Mazrui <empower@xxxxxxxxx>
  • To: programmingblind@xxxxxxxxxxxxx, program-l@xxxxxxxxxxxxx, guispeak@xxxxxxxxxxxxx, uaccess-l@xxxxxxxxxxxxxx
  • Date: Fri, 14 Sep 2007 00:32:13 -0400 (EDT)

Now available at

Released September 14, 2007
Public Domain by Jamal Mazrui

Following up on a tip from Ken Perry about the open source Tesseract-OCR
project at Google, I have tried to use this OCR engine to build a free
program for producing accessible text from an image-based PDF.  Such files
are created by scanning equipment or software printer drivers that save
only the picture of text, without the actual characters themselves.  This
makes them inaccessible to most PDF viewing utilities, which extract text
but do not perform OCR on images.

I could not find an existing Windows solution on the web, but did get
useful ideas from Linux-oriented ones.  What I am calling PDF2OCR combines
Tesseract from
with the GhostScript interpreter from

GhostScript creates a .tif file from the .pdf file of interest, and then
Tesseract creates a .txt file from that.  The current implementation is a
batch file, pdf2ocr.bat, with the following syntax on the command line:
pdf2ocr SourceRootName
where SourceRootName is the name of a PDF file without the .pdf extension.
This produces a text file with the same name except for a .txt extension.
The PDF name can include a directory path, but not embedded spaces.  For
pdf2ocr c:\temp\test
When complete, the batch file prints tesseract.log to the screen -- a file
that is recreated for each conversion.

Installation consists of unzipping the pdf2ocr.zip archive to a target
directory, e.g., to one called
This directory contains the executable files, as well as three
subdirectories with support files.  The gsdata subdirectory contains many
files I gathered from an installed GhostScript directory tree.  The
tessdata subdirectory contains language support for Tesseract (I have only
distributed English files, but other languages are available from the
Google site).  The misc subdirectory contains sample files, some source
code, and this documentation.

A sample image-based PDF is named mlk.pdf -- the letter Martin Luther
King, Jr. wrote from the Birmingham Jail.  Another sample is debate.pdf --
the legal agreement between the Bush and Kerry campaigns concerning
Presidential debates.  Two commercial OCR programs tested, Kurzweil 1000
and PDF Magic, converted one of these files well, but not the other at all
(a different one for each).  Their results, as well as that of PDF2OCR,
are provided in text files.  Please understand that Tesseract is not the
best OCR available, though it is generally considered the best free OCR at

In order to run the batch file from any directory, you can add the PDF2OCR
directory to the path of a console session with a command like the
set path=c:\pdf2ocr;%path%
You can add the path for every console session via the Advanced tab page
of the System applet in Control Panel.

To easily convert multiple PDFs in a directory, I have also created a
utility called dir2ocr.exe.  Simply pass the directory name to process as
a parameter, e.g.,
dir2ocr c:\temp
If no parameter is passed, the current directory is assumed.  Source code
for this PowerBASIC program that calls pdf2ocr.bat is in the files
dir2ocr.bas and fn.inc, located in the misc subdirectory.

The PDF2OCR  download is large, about 14 megabytes as a compressed
archive.  Other techniques of getting text from a PDF should probably be
tried first.  When other tools do not work or are unavailable, however, I
hope this helps to bridge an accessibility gap.  Feel free to enhance it
in the spirit of open source development!

Jamal Mazrui

** To leave the list, click on the immediately-following link:-
** [mailto:guispeak-request@xxxxxxxxxxxxx?subject=unsubscribe]
** If this link doesn't work then send a message to:
** guispeak-request@xxxxxxxxxxxxx
** and in the Subject line type
** unsubscribe
** For other list commands such as vacation mode, click on the
** immediately-following link:-
** [mailto:guispeak-request@xxxxxxxxxxxxx?subject=faq]
** or send a message, to
** guispeak-request@xxxxxxxxxxxxx with the Subject:- faq

Other related posts:

  • » [guispeak] Announcing PDF2OCR