[access-uk] Scanned PDF to text

  • From: "David W Wood" <g3yxx@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
  • To: <bcab@xxxxxxxxxxxxx>, <access-uk@xxxxxxxxxxxxx>
  • Date: Sat, 15 Sep 2007 06:07:46 +0100

This should be of some interest to many!

Copied from another list.

Begin forwarded message:

Subject: [Promotion-technology] Fwd: Announcing PDF2OCR

Now available at
http://www.EmpowermentZone.com/pdf2ocr.zip

PDF2OCR 1.0
Released September 14, 2007
Public Domain by Jamal Mazrui

Following up on a tip from Ken Perry about the open source  
Tesseract-OCR
project at Google, I have tried to use this OCR engine to
build a  
free
program for producing accessible text from an image-based
PDF.   
Such files
are created by scanning equipment or software printer
drivers that  
save
only the picture of text, without the actual characters  
themselves.  This
makes them inaccessible to most PDF viewing utilities, which

extract text
but do not perform OCR on images.

I could not find an existing Windows solution on the web,
but did get
useful ideas from Linux-oriented ones.  What I am calling
PDF2OCR  
combines
Tesseract from
http://code.google.com/p/tesseract-ocr
with the GhostScript interpreter from
http://ghostscript.com

GhostScript creates a .tif file from the .pdf file of
interest,  
and then
Tesseract creates a .txt file from that.  The current  
implementation is a
batch file, pdf2ocr.bat, with the following syntax on the
command  
line:
pdf2ocr SourceRootName
where SourceRootName is the name of a PDF file without the
.pdf  
extension.
This produces a text file with the same name except for a
.txt  
extension.
The PDF name can include a directory path, but not embedded

spaces.  For
example,
pdf2ocr c:\temp\test
produces
c:\temp\test.txt
When complete, the batch file prints tesseract.log to the
screen  
-- a file
that is recreated for each conversion.

Installation consists of unzipping the pdf2ocr.zip archive
to a  
target
directory, e.g., to one called
C:\PDF2OCR
This directory contains the executable files, as well as
three
subdirectories with support files.  The gsdata subdirectory

contains many
files I gathered from an installed GhostScript directory
tree.  The
tessdata subdirectory contains language support for
Tesseract (I  
have only
distributed English files, but other languages are available
from the
Google site).  The misc subdirectory contains sample files,
some  
source
code, and this documentation.

A sample image-based PDF is named mlk.pdf -- the letter
Martin Luther
King, Jr. wrote from the Birmingham Jail.  Another sample is

debate.pdf --
the legal agreement between the Bush and Kerry campaigns
concerning
Presidential debates.  Two commercial OCR programs tested,  
Kurzweil 1000
and PDF Magic, converted one of these files well, but not
the  
other at all
(a different one for each).  Their results, as well as that
of  
PDF2OCR,
are provided in text files.  Please understand that
Tesseract is  
not the
best OCR available, though it is generally considered the
best  
free OCR at
present.

In order to run the batch file from any directory, you can
add the  
PDF2OCR
directory to the path of a console session with a command
like the
following:
set path=c:\pdf2ocr;%path%
You can add the path for every console session via the
Advanced  
tab page
of the System applet in Control Panel.

To easily convert multiple PDFs in a directory, I have also
created a
utility called dir2ocr.exe.  Simply pass the directory name
to  
process as
a parameter, e.g.,
dir2ocr c:\temp
If no parameter is passed, the current directory is assumed.

Source code
for this PowerBASIC program that calls pdf2ocr.bat is in the
files
dir2ocr.bas and fn.inc, located in the misc subdirectory.

The PDF2OCR  download is large, about 14 megabytes as a
compressed
archive.  Other techniques of getting text from a PDF should

probably be
tried first.  When other tools do not work or are
unavailable,  
however, I
hope this helps to bridge an accessibility gap.  Feel free
to  
enhance it
in the spirit of open source development!

Jamal Mazrui
jamal@xxxxxxxxxxxxxxxxxxx 

** To leave the list, click on the immediately-following link:-
** [mailto:access-uk-request@xxxxxxxxxxxxx?subject=unsubscribe]
** If this link doesn't work then send a message to:
** access-uk-request@xxxxxxxxxxxxx
** and in the Subject line type
** unsubscribe
** For other list commands such as vacation mode, click on the
** immediately-following link:-
** [mailto:access-uk-request@xxxxxxxxxxxxx?subject=faq]
** or send a message, to
** access-uk-request@xxxxxxxxxxxxx with the Subject:- faq

Other related posts: