Re: Introducing PDF2OCR and seeking testers

  • From: Jamal Mazrui <empower@xxxxxxxxx>
  • To: programmingblind@xxxxxxxxxxxxx
  • Date: Thu, 13 Sep 2007 08:02:40 -0400 (EDT)

Thanks for the feedback, Ian.

I have posted a new beta (same URL) that puts all GhostScript files
except the executable in a gsdata subdirectory, parallel to the way that
Tesseract uses the tessdata subdirectory for support.  This should make
it easier to explore the PDF2OCR folder or run conversions from there
and find the results.  If anyone discovers that GhostScript can no
longer find certain files, please let me know.

As suggested for comparison, I ran Kurzweil 1000 on the debate.pdf file.
 The result, which is now included in the PDF2OCR directory, is, indeed,
much better.  Google readily acknowledges that Tesseract lacks layout
analysis at this time, which I suspect is part of the problem.  Ways of
improving the OCR seem to be under active development, however, as
described at
http://code.google.com/p/tesseract-ocr/wiki/ReadMe

On the other hand, PDF Magic was not able to perform OCR on the sample
PDF (one I happened to have that I new was image-based because of
difficulty reading it with various utilities).  PDF Magic said it was
not a valid image file -- so perhaps the sample I distributed is more
challenging than average.  I welcome any other comparisons people can
report.  For anyone familiar with batch files, one could set one up to
process many PDFs at once using syntax like
call pdf2ocr file1
call pdf2ocr file2
etc.

Jamal
On Wed, 12 Sep 2007, Ian D. Nichols
wrote:

> Date: Wed, 12 Sep 2007 23:14:00 -0400
> From: Ian D. Nichols <inich@xxxxxxxxxx>
> Reply-To: programmingblind@xxxxxxxxxxxxx
> To: programmingblind@xxxxxxxxxxxxx
> Subject: Re: Introducing PDF2OCR and seeking testers
>
> Hi Jamal,
>
> I have tested your new pdf2ocr utility on my PC, which so far as i know has
> not had the GhostScript program on it before now.
>
> I downloaded pdf2ocr.zip and used FileDir to extract it to a new folder
> called c:\pdf2ocr. the folder then contained more than 700 files.
>
> I used a command prompt - "pdf2ocr debate" - to run the program, and it
> finished in perhaps 10-15 seconds.  I then had a file called "debate.txt",
> containing 1125 bytes.  I viewed it in notepad, and it seemed to be a single
> page of the legal agreement, ending in the middle of a sentence.  The
> quality of the ocr was not very good.
>
> I hope that is useful information for you.
>
> All the best,
>
> Ian
>
> Ian D. Nichols,
> Toronto, Canada
>
> ----- Original Message -----
> From: "Jamal Mazrui" <empower@xxxxxxxxx>
> To: <programmingblind@xxxxxxxxxxxxx>
> Sent: Wednesday, September 12, 2007 7:03 PM
> Subject: Introducing PDF2OCR and seeking testers
>
>
> > Now available at
> > http://www.EmpowermentZone.com/pdf2ocr.zip
> >
> > Following up on a tip from Ken Perry about the open source Tesseract
> > project at Google, I have tried to use this OCR engine to build a free
> > program for producing accessible text from an image-based PDF.  Such files
> > are created by scanning equipment or software printer drivers that save
> > only the picture of text, without the actual characters themselves.  This
> > makes them inaccessible to most PDF viewing utilities, which extract text
> > but do not perform OCR on images.
> >
> > I could not find an existing Windows solution on the web, but did get
> > useful ideas from Linux-oriented ones.  What I am calling PDF2OCR combines
> > Tesseract from
> > http://code.google.com/p/tesseract-ocr
> > with the GhostScript interpreter from
> > http://ghostscript.com
> >
> > GhostScript creates a .tif file from the .pdf file of interest, and then
> > Tesseract creates a .txt file from that.  The current implementation is a
> > simple batch file, pdf2ocr.bat, with the following syntax on the command
> > line:
> > pdf2ocr SourceRootName
> > where SourceRootName is the name of a PDF file without the .pdf extension.
> > This produces a text file with the same name except for a .txt extension.
> > The PDF name can include a directory path, but not embedded spaces.  For
> > example,
> > pdf2ocr c:\temp\test
> > produces
> > c:\temp\test.txt
> >
> > I am seeking feedback on this initial test version.  I want to be sure it
> > works on computers that have not run the GhostScript installation program
> > for Windows.  The archive contains an image-based PDF for testing called
> > debate.pdf (the legal agreement between the Bush and Kerry campaigns
> > concerning Presidential debates).  Please understand that Tesseract is not
> > the best OCR available, though it is generally considered the best free
> > OCR.
> >
> > Installation consists of unzipping the pdf2ocr.zip archive to a directory,
> > e.g., to one called
> > C:\PDF2OCR
> > The target directory will contain many files that I gathered from
> > subdirectories of an installed GhostScript directory tree.  It will also
> > contain one subdirectory called tessedata, which is required by the
> > tesseeract.exe program for language support (I have only distributed
> > English files, but other languages are available from the Google site).
> >
> > In order to run the batch file from any directory, you can add the PDF2OCR
> > directory to the path of a console session with a command like the
> > following:
> > set path=c:\pdf2ocr;%path%
> > You can add the path for every console session via the Advanced tab page
> > of the System applet in Control Panel.
> >
> > The pdf2ocr.zip download is large, about 14 megs, so it will probably
> > remain a stand-alone project, rather than being bundled with other
> > applications I develop.  Feel free to enhance it in the spirit of open
> > source development!
> >
> > Jamal
> >
> > __________
> > View the list's information and change your settings at
> > //www.freelists.org/list/programmingblind
> >
> >
>
>
> __________
> View the list's information and change your settings at
> //www.freelists.org/list/programmingblind
>
__________
View the list's information and change your settings at 
//www.freelists.org/list/programmingblind

Other related posts: