Re: Introducing PDF2OCR and seeking testers

  • From: "Matthew2007" <matthew2007@xxxxxxxxxxx>
  • To: <programmingblind@xxxxxxxxxxxxx>
  • Date: Wed, 12 Sep 2007 22:11:35 -0700

You know, it would be kinda interesting to run your file through acrobat Reader and K1000 and compare the results.


Thanks,
Matthew
---- Original Message ----- From: "Ian D. Nichols" <inich@xxxxxxxxxx>
To: <programmingblind@xxxxxxxxxxxxx>
Sent: Wednesday, September 12, 2007 8:14 PM
Subject: Re: Introducing PDF2OCR and seeking testers


Hi Jamal,

I have tested your new pdf2ocr utility on my PC, which so far as i know has not had the GhostScript program on it before now.

I downloaded pdf2ocr.zip and used FileDir to extract it to a new folder called c:\pdf2ocr. the folder then contained more than 700 files.

I used a command prompt - "pdf2ocr debate" - to run the program, and it finished in perhaps 10-15 seconds. I then had a file called "debate.txt", containing 1125 bytes. I viewed it in notepad, and it seemed to be a single page of the legal agreement, ending in the middle of a sentence. The quality of the ocr was not very good.

I hope that is useful information for you.

All the best,

Ian

Ian D. Nichols,
Toronto, Canada

----- Original Message ----- From: "Jamal Mazrui" <empower@xxxxxxxxx>
To: <programmingblind@xxxxxxxxxxxxx>
Sent: Wednesday, September 12, 2007 7:03 PM
Subject: Introducing PDF2OCR and seeking testers


Now available at
http://www.EmpowermentZone.com/pdf2ocr.zip

Following up on a tip from Ken Perry about the open source Tesseract
project at Google, I have tried to use this OCR engine to build a free
program for producing accessible text from an image-based PDF. Such files
are created by scanning equipment or software printer drivers that save
only the picture of text, without the actual characters themselves.  This
makes them inaccessible to most PDF viewing utilities, which extract text
but do not perform OCR on images.

I could not find an existing Windows solution on the web, but did get
useful ideas from Linux-oriented ones. What I am calling PDF2OCR combines
Tesseract from
http://code.google.com/p/tesseract-ocr
with the GhostScript interpreter from
http://ghostscript.com

GhostScript creates a .tif file from the .pdf file of interest, and then
Tesseract creates a .txt file from that.  The current implementation is a
simple batch file, pdf2ocr.bat, with the following syntax on the command
line:
pdf2ocr SourceRootName
where SourceRootName is the name of a PDF file without the .pdf extension.
This produces a text file with the same name except for a .txt extension.
The PDF name can include a directory path, but not embedded spaces.  For
example,
pdf2ocr c:\temp\test
produces
c:\temp\test.txt

I am seeking feedback on this initial test version.  I want to be sure it
works on computers that have not run the GhostScript installation program
for Windows.  The archive contains an image-based PDF for testing called
debate.pdf (the legal agreement between the Bush and Kerry campaigns
concerning Presidential debates). Please understand that Tesseract is not
the best OCR available, though it is generally considered the best free
OCR.

Installation consists of unzipping the pdf2ocr.zip archive to a directory,
e.g., to one called
C:\PDF2OCR
The target directory will contain many files that I gathered from
subdirectories of an installed GhostScript directory tree.  It will also
contain one subdirectory called tessedata, which is required by the
tesseeract.exe program for language support (I have only distributed
English files, but other languages are available from the Google site).

In order to run the batch file from any directory, you can add the PDF2OCR
directory to the path of a console session with a command like the
following:
set path=c:\pdf2ocr;%path%
You can add the path for every console session via the Advanced tab page
of the System applet in Control Panel.

The pdf2ocr.zip download is large, about 14 megs, so it will probably
remain a stand-alone project, rather than being bundled with other
applications I develop.  Feel free to enhance it in the spirit of open
source development!

Jamal

__________
View the list's information and change your settings at
//www.freelists.org/list/programmingblind




__________
View the list's information and change your settings at //www.freelists.org/list/programmingblind


__________ NOD32 2526 (20070912) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com



__________
View the list's information and change your settings at //www.freelists.org/list/programmingblind

Other related posts: