RE: Introducing PDF2OCR and seeking testers

  • From: Jamal Mazrui <empower@xxxxxxxxx>
  • To: programmingblind@xxxxxxxxxxxxx
  • Date: Fri, 14 Sep 2007 07:55:14 -0400 (EDT)

Thanks for your ideas, Pete.  I  thought I was doing everything to put
as much information in the .tif file as possible, but your message
prompted me to investigate further.  It turns out that the OCR seems to
be better when no resolution parameter is supplied to GhostScript,
rather than the 300 DPI value I had been using -- a surprising result.
Tesseract then reports the DPI as 196 -- a value I find odd, but a
sighted reviewer said it worked best compared to several DPI points I
tried between 150 and 600 explicitly.  The next best after the
apparently default 196 value were 200 and 300, which were too close to
call.

I also discovered that Tesseract was only processing the first page in
the .tif file, so I made GhostScript output a .tif file for each page of
the PDF, and combined the seperate text files generated by Tesseract.
This lengthens the conversion time, but accuracy is more important than
performance.

Jamal
On Thu, 13 Sep 2007, Peter Torpey wrote:

> Date: Thu, 13 Sep 2007 09:15:56 -0400
> From: Peter Torpey <ptorpey@xxxxxxxxxxxxxxxx>
> Reply-To: programmingblind@xxxxxxxxxxxxx
> To: programmingblind@xxxxxxxxxxxxx
> Subject: RE: Introducing PDF2OCR and seeking testers
>
> Jamal,
>
> I haven't had an opportunity to look at this new utility yet and/or how you
> put it together.  Sounds like a clever use of Ghostscript.  Nice work.
>
> Anyway, I see some comments that the OCR isn't as good as might be expected.
> Of course one possibility is that the Google tool doesn't perform good OCR
> on such documents as the test document you attached.
>
> One other possibility that crossed my mine is how Ghostscript renders the
> file into a bitmap.  My question is, does the OCR quality depend on the
> resolution of the Ghostscript decomposition of the file?  For example, most
> rendering engines (such as Ghostscript) allow you to choose the resolution
> of the rendered bitmap image.  It may be that the default for Ghostscript is
> a 300 x 300 spots for inch (spi) bitmap output.  There should be options to
> render the final bitmap at other resolutions - Perhaps 600 x 600 spi bitmap
> outputs would enable better OCR (although it might take a bit longer).
>
> It may be worth a shot testing the resolution you set in the Ghostscript
> decomposer to see if that affects the OCR quality.
>
> -- Pete
>
>
> -----Original Message-----
> From: programmingblind-bounce@xxxxxxxxxxxxx
> [mailto:programmingblind-bounce@xxxxxxxxxxxxx] On Behalf Of Jamal Mazrui
> Sent: Thursday, September 13, 2007 8:03 AM
> To: programmingblind@xxxxxxxxxxxxx
> Subject: Re: Introducing PDF2OCR and seeking testers
>
> Thanks for the feedback, Ian.
>
> I have posted a new beta (same URL) that puts all GhostScript files except
> the executable in a gsdata subdirectory, parallel to the way that Tesseract
> uses the tessdata subdirectory for support.  This should make it easier to
> explore the PDF2OCR folder or run conversions from there and find the
> results.  If anyone discovers that GhostScript can no longer find certain
> files, please let me know.
>
> As suggested for comparison, I ran Kurzweil 1000 on the debate.pdf file.
>  The result, which is now included in the PDF2OCR directory, is, indeed,
> much better.  Google readily acknowledges that Tesseract lacks layout
> analysis at this time, which I suspect is part of the problem.  Ways of
> improving the OCR seem to be under active development, however, as described
> at http://code.google.com/p/tesseract-ocr/wiki/ReadMe
>
> On the other hand, PDF Magic was not able to perform OCR on the sample PDF
> (one I happened to have that I new was image-based because of difficulty
> reading it with various utilities).  PDF Magic said it was not a valid image
> file -- so perhaps the sample I distributed is more challenging than
> average.  I welcome any other comparisons people can report.  For anyone
> familiar with batch files, one could set one up to process many PDFs at once
> using syntax like call pdf2ocr file1 call pdf2ocr file2 etc.
>
> Jamal
> On Wed, 12 Sep 2007, Ian D. Nichols
> wrote:
>
> > Date: Wed, 12 Sep 2007 23:14:00 -0400
> > From: Ian D. Nichols <inich@xxxxxxxxxx>
> > Reply-To: programmingblind@xxxxxxxxxxxxx
> > To: programmingblind@xxxxxxxxxxxxx
> > Subject: Re: Introducing PDF2OCR and seeking testers
> >
> > Hi Jamal,
> >
> > I have tested your new pdf2ocr utility on my PC, which so far as i
> > know has not had the GhostScript program on it before now.
> >
> > I downloaded pdf2ocr.zip and used FileDir to extract it to a new
> > folder called c:\pdf2ocr. the folder then contained more than 700 files.
> >
> > I used a command prompt - "pdf2ocr debate" - to run the program, and
> > it finished in perhaps 10-15 seconds.  I then had a file called
> > "debate.txt", containing 1125 bytes.  I viewed it in notepad, and it
> > seemed to be a single page of the legal agreement, ending in the
> > middle of a sentence.  The quality of the ocr was not very good.
> >
> > I hope that is useful information for you.
> >
> > All the best,
> >
> > Ian
> >
> > Ian D. Nichols,
> > Toronto, Canada
> >
> > ----- Original Message -----
> > From: "Jamal Mazrui" <empower@xxxxxxxxx>
> > To: <programmingblind@xxxxxxxxxxxxx>
> > Sent: Wednesday, September 12, 2007 7:03 PM
> > Subject: Introducing PDF2OCR and seeking testers
> >
> >
> > > Now available at
> > > http://www.EmpowermentZone.com/pdf2ocr.zip
> > >
> > > Following up on a tip from Ken Perry about the open source Tesseract
> > > project at Google, I have tried to use this OCR engine to build a
> > > free program for producing accessible text from an image-based PDF.
> > > Such files are created by scanning equipment or software printer
> > > drivers that save only the picture of text, without the actual
> > > characters themselves.  This makes them inaccessible to most PDF
> > > viewing utilities, which extract text but do not perform OCR on images.
> > >
> > > I could not find an existing Windows solution on the web, but did
> > > get useful ideas from Linux-oriented ones.  What I am calling
> > > PDF2OCR combines Tesseract from
> > > http://code.google.com/p/tesseract-ocr
> > > with the GhostScript interpreter from http://ghostscript.com
> > >
> > > GhostScript creates a .tif file from the .pdf file of interest, and
> > > then Tesseract creates a .txt file from that.  The current
> > > implementation is a simple batch file, pdf2ocr.bat, with the
> > > following syntax on the command
> > > line:
> > > pdf2ocr SourceRootName
> > > where SourceRootName is the name of a PDF file without the .pdf
> extension.
> > > This produces a text file with the same name except for a .txt
> extension.
> > > The PDF name can include a directory path, but not embedded spaces.
> > > For example, pdf2ocr c:\temp\test produces c:\temp\test.txt
> > >
> > > I am seeking feedback on this initial test version.  I want to be
> > > sure it works on computers that have not run the GhostScript
> > > installation program for Windows.  The archive contains an
> > > image-based PDF for testing called debate.pdf (the legal agreement
> > > between the Bush and Kerry campaigns concerning Presidential
> > > debates).  Please understand that Tesseract is not the best OCR
> > > available, though it is generally considered the best free OCR.
> > >
> > > Installation consists of unzipping the pdf2ocr.zip archive to a
> > > directory, e.g., to one called C:\PDF2OCR The target directory will
> > > contain many files that I gathered from subdirectories of an
> > > installed GhostScript directory tree.  It will also contain one
> > > subdirectory called tessedata, which is required by the
> > > tesseeract.exe program for language support (I have only distributed
> > > English files, but other languages are available from the Google site).
> > >
> > > In order to run the batch file from any directory, you can add the
> > > PDF2OCR directory to the path of a console session with a command
> > > like the
> > > following:
> > > set path=c:\pdf2ocr;%path%
> > > You can add the path for every console session via the Advanced tab
> > > page of the System applet in Control Panel.
> > >
> > > The pdf2ocr.zip download is large, about 14 megs, so it will
> > > probably remain a stand-alone project, rather than being bundled
> > > with other applications I develop.  Feel free to enhance it in the
> > > spirit of open source development!
> > >
> > > Jamal
> > >
> > > __________
> > > View the list's information and change your settings at
> > > //www.freelists.org/list/programmingblind
> > >
> > >
> >
> >
> > __________
> > View the list's information and change your settings at
> > //www.freelists.org/list/programmingblind
> >
> __________
> View the list's information and change your settings at
> //www.freelists.org/list/programmingblind
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.485 / Virus Database: 269.13.15/1003 - Release Date: 9/12/2007
> 10:56 AM
>
>
> __________
> View the list's information and change your settings at
> //www.freelists.org/list/programmingblind
>
__________
View the list's information and change your settings at 
//www.freelists.org/list/programmingblind

Other related posts: