Re: Introducing PDF2OCR and seeking testers

From: phil haines <fonucmo@xxxxxxxxxxxx>
To: programmingblind@xxxxxxxxxxxxx
Date: Thu, 13 Sep 2007 17:20:44 +1000 (EST)
it would



--- Matthew2007 <matthew2007@xxxxxxxxxxx> wrote:

> You know, it would be kinda interesting to run your
> file through acrobat 
> Reader and K1000 and compare the results.
> 
> Thanks,
> Matthew
> ---- Original Message ----- 
> From: "Ian D. Nichols" <inich@xxxxxxxxxx>
> To: <programmingblind@xxxxxxxxxxxxx>
> Sent: Wednesday, September 12, 2007 8:14 PM
> Subject: Re: Introducing PDF2OCR and seeking testers
> 
> 
> > Hi Jamal,
> >
> > I have tested your new pdf2ocr utility on my PC,
> which so far as i know 
> > has not had the GhostScript program on it before
> now.
> >
> > I downloaded pdf2ocr.zip and used FileDir to
> extract it to a new folder 
> > called c:\pdf2ocr. the folder then contained more
> than 700 files.
> >
> > I used a command prompt - "pdf2ocr debate" - to
> run the program, and it 
> > finished in perhaps 10-15 seconds.  I then had a
> file called "debate.txt", 
> > containing 1125 bytes.  I viewed it in notepad,
> and it seemed to be a 
> > single page of the legal agreement, ending in the
> middle of a sentence. 
> > The quality of the ocr was not very good.
> >
> > I hope that is useful information for you.
> >
> > All the best,
> >
> > Ian
> >
> > Ian D. Nichols,
> > Toronto, Canada
> >
> > ----- Original Message ----- 
> > From: "Jamal Mazrui" <empower@xxxxxxxxx>
> > To: <programmingblind@xxxxxxxxxxxxx>
> > Sent: Wednesday, September 12, 2007 7:03 PM
> > Subject: Introducing PDF2OCR and seeking testers
> >
> >
> >> Now available at
> >> http://www.EmpowermentZone.com/pdf2ocr.zip
> >>
> >> Following up on a tip from Ken Perry about the
> open source Tesseract
> >> project at Google, I have tried to use this OCR
> engine to build a free
> >> program for producing accessible text from an
> image-based PDF.  Such 
> >> files
> >> are created by scanning equipment or software
> printer drivers that save
> >> only the picture of text, without the actual
> characters themselves.  This
> >> makes them inaccessible to most PDF viewing
> utilities, which extract text
> >> but do not perform OCR on images.
> >>
> >> I could not find an existing Windows solution on
> the web, but did get
> >> useful ideas from Linux-oriented ones.  What I am
> calling PDF2OCR 
> >> combines
> >> Tesseract from
> >> http://code.google.com/p/tesseract-ocr
> >> with the GhostScript interpreter from
> >> http://ghostscript.com
> >>
> >> GhostScript creates a .tif file from the .pdf
> file of interest, and then
> >> Tesseract creates a .txt file from that.  The
> current implementation is a
> >> simple batch file, pdf2ocr.bat, with the
> following syntax on the command
> >> line:
> >> pdf2ocr SourceRootName
> >> where SourceRootName is the name of a PDF file
> without the .pdf 
> >> extension.
> >> This produces a text file with the same name
> except for a .txt extension.
> >> The PDF name can include a directory path, but
> not embedded spaces.  For
> >> example,
> >> pdf2ocr c:\temp\test
> >> produces
> >> c:\temp\test.txt
> >>
> >> I am seeking feedback on this initial test
> version.  I want to be sure it
> >> works on computers that have not run the
> GhostScript installation program
> >> for Windows.  The archive contains an image-based
> PDF for testing called
> >> debate.pdf (the legal agreement between the Bush
> and Kerry campaigns
> >> concerning Presidential debates).  Please
> understand that Tesseract is 
> >> not
> >> the best OCR available, though it is generally
> considered the best free
> >> OCR.
> >>
> >> Installation consists of unzipping the
> pdf2ocr.zip archive to a 
> >> directory,
> >> e.g., to one called
> >> C:\PDF2OCR
> >> The target directory will contain many files that
> I gathered from
> >> subdirectories of an installed GhostScript
> directory tree.  It will also
> >> contain one subdirectory called tessedata, which
> is required by the
> >> tesseeract.exe program for language support (I
> have only distributed
> >> English files, but other languages are available
> from the Google site).
> >>
> >> In order to run the batch file from any
> directory, you can add the 
> >> PDF2OCR
> >> directory to the path of a console session with a
> command like the
> >> following:
> >> set path=c:\pdf2ocr;%path%
> >> You can add the path for every console session
> via the Advanced tab page
> >> of the System applet in Control Panel.
> >>
> >> The pdf2ocr.zip download is large, about 14 megs,
> so it will probably
> >> remain a stand-alone project, rather than being
> bundled with other
> >> applications I develop.  Feel free to enhance it
> in the spirit of open
> >> source development!
> >>
> >> Jamal
> >>
> >> __________
> >> View the list's information and change your
> settings at
> >> //www.freelists.org/list/programmingblind
> >>
> >>
> >
> >
> > __________
> > View the list's information and change your
> settings at 
> > //www.freelists.org/list/programmingblind
> >
> >
> > __________ NOD32 2526 (20070912) Information
> __________
> >
> > This message was checked by NOD32 antivirus
> system.
> > http://www.eset.com
> >
> > 
> 
> __________
> View the list's information and change your settings
> at 
> //www.freelists.org/list/programmingblind
> 
> 



      
____________________________________________________________________________________
Sick of deleting your inbox? Yahoo!7 Mail has free unlimited storage.
http://au.docs.yahoo.com/mail/unlimitedstorage.html

__________
View the list's information and change your settings at 
//www.freelists.org/list/programmingblind
References:
- Re: Introducing PDF2OCR and seeking testers
  - From: Matthew2007
Re: Introducing PDF2OCR and seeking testers

Other related posts: