[access-uk] Scanned PDF to text
- From: "David W Wood" <g3yxx@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
- To: <bcab@xxxxxxxxxxxxx>, <access-uk@xxxxxxxxxxxxx>
- Date: Sat, 15 Sep 2007 06:07:46 +0100
This should be of some interest to many!
Copied from another list.
Begin forwarded message:
Subject: [Promotion-technology] Fwd: Announcing PDF2OCR
Now available at
http://www.EmpowermentZone.com/pdf2ocr.zip
PDF2OCR 1.0
Released September 14, 2007
Public Domain by Jamal Mazrui
Following up on a tip from Ken Perry about the open source
Tesseract-OCR
project at Google, I have tried to use this OCR engine to
build a
free
program for producing accessible text from an image-based
PDF.
Such files
are created by scanning equipment or software printer
drivers that
save
only the picture of text, without the actual characters
themselves. This
makes them inaccessible to most PDF viewing utilities, which
extract text
but do not perform OCR on images.
I could not find an existing Windows solution on the web,
but did get
useful ideas from Linux-oriented ones. What I am calling
PDF2OCR
combines
Tesseract from
http://code.google.com/p/tesseract-ocr
with the GhostScript interpreter from
http://ghostscript.com
GhostScript creates a .tif file from the .pdf file of
interest,
and then
Tesseract creates a .txt file from that. The current
implementation is a
batch file, pdf2ocr.bat, with the following syntax on the
command
line:
pdf2ocr SourceRootName
where SourceRootName is the name of a PDF file without the
.pdf
extension.
This produces a text file with the same name except for a
.txt
extension.
The PDF name can include a directory path, but not embedded
spaces. For
example,
pdf2ocr c:\temp\test
produces
c:\temp\test.txt
When complete, the batch file prints tesseract.log to the
screen
-- a file
that is recreated for each conversion.
Installation consists of unzipping the pdf2ocr.zip archive
to a
target
directory, e.g., to one called
C:\PDF2OCR
This directory contains the executable files, as well as
three
subdirectories with support files. The gsdata subdirectory
contains many
files I gathered from an installed GhostScript directory
tree. The
tessdata subdirectory contains language support for
Tesseract (I
have only
distributed English files, but other languages are available
from the
Google site). The misc subdirectory contains sample files,
some
source
code, and this documentation.
A sample image-based PDF is named mlk.pdf -- the letter
Martin Luther
King, Jr. wrote from the Birmingham Jail. Another sample is
debate.pdf --
the legal agreement between the Bush and Kerry campaigns
concerning
Presidential debates. Two commercial OCR programs tested,
Kurzweil 1000
and PDF Magic, converted one of these files well, but not
the
other at all
(a different one for each). Their results, as well as that
of
PDF2OCR,
are provided in text files. Please understand that
Tesseract is
not the
best OCR available, though it is generally considered the
best
free OCR at
present.
In order to run the batch file from any directory, you can
add the
PDF2OCR
directory to the path of a console session with a command
like the
following:
set path=c:\pdf2ocr;%path%
You can add the path for every console session via the
Advanced
tab page
of the System applet in Control Panel.
To easily convert multiple PDFs in a directory, I have also
created a
utility called dir2ocr.exe. Simply pass the directory name
to
process as
a parameter, e.g.,
dir2ocr c:\temp
If no parameter is passed, the current directory is assumed.
Source code
for this PowerBASIC program that calls pdf2ocr.bat is in the
files
dir2ocr.bas and fn.inc, located in the misc subdirectory.
The PDF2OCR download is large, about 14 megabytes as a
compressed
archive. Other techniques of getting text from a PDF should
probably be
tried first. When other tools do not work or are
unavailable,
however, I
hope this helps to bridge an accessibility gap. Feel free
to
enhance it
in the spirit of open source development!
Jamal Mazrui
jamal@xxxxxxxxxxxxxxxxxxx
** To leave the list, click on the immediately-following link:-
** [mailto:access-uk-request@xxxxxxxxxxxxx?subject=unsubscribe]
** If this link doesn't work then send a message to:
** access-uk-request@xxxxxxxxxxxxx
** and in the Subject line type
** unsubscribe
** For other list commands such as vacation mode, click on the
** immediately-following link:-
** [mailto:access-uk-request@xxxxxxxxxxxxx?subject=faq]
** or send a message, to
** access-uk-request@xxxxxxxxxxxxx with the Subject:- faq
- Follow-Ups:
- [access-uk] Re: Scanned PDF to text
- From: Richard Godfrey-McKay
Other related posts:
- » [access-uk] Scanned PDF to text
- » [access-uk] Re: Scanned PDF to text
- » [access-uk] Re: Scanned PDF to text
- » [access-uk] Re: Scanned PDF to text
- [access-uk] Re: Scanned PDF to text
- From: Richard Godfrey-McKay