it would --- Matthew2007 <matthew2007@xxxxxxxxxxx> wrote: > You know, it would be kinda interesting to run your > file through acrobat > Reader and K1000 and compare the results. > > Thanks, > Matthew > ---- Original Message ----- > From: "Ian D. Nichols" <inich@xxxxxxxxxx> > To: <programmingblind@xxxxxxxxxxxxx> > Sent: Wednesday, September 12, 2007 8:14 PM > Subject: Re: Introducing PDF2OCR and seeking testers > > > > Hi Jamal, > > > > I have tested your new pdf2ocr utility on my PC, > which so far as i know > > has not had the GhostScript program on it before > now. > > > > I downloaded pdf2ocr.zip and used FileDir to > extract it to a new folder > > called c:\pdf2ocr. the folder then contained more > than 700 files. > > > > I used a command prompt - "pdf2ocr debate" - to > run the program, and it > > finished in perhaps 10-15 seconds. I then had a > file called "debate.txt", > > containing 1125 bytes. I viewed it in notepad, > and it seemed to be a > > single page of the legal agreement, ending in the > middle of a sentence. > > The quality of the ocr was not very good. > > > > I hope that is useful information for you. > > > > All the best, > > > > Ian > > > > Ian D. Nichols, > > Toronto, Canada > > > > ----- Original Message ----- > > From: "Jamal Mazrui" <empower@xxxxxxxxx> > > To: <programmingblind@xxxxxxxxxxxxx> > > Sent: Wednesday, September 12, 2007 7:03 PM > > Subject: Introducing PDF2OCR and seeking testers > > > > > >> Now available at > >> http://www.EmpowermentZone.com/pdf2ocr.zip > >> > >> Following up on a tip from Ken Perry about the > open source Tesseract > >> project at Google, I have tried to use this OCR > engine to build a free > >> program for producing accessible text from an > image-based PDF. Such > >> files > >> are created by scanning equipment or software > printer drivers that save > >> only the picture of text, without the actual > characters themselves. This > >> makes them inaccessible to most PDF viewing > utilities, which extract text > >> but do not perform OCR on images. > >> > >> I could not find an existing Windows solution on > the web, but did get > >> useful ideas from Linux-oriented ones. What I am > calling PDF2OCR > >> combines > >> Tesseract from > >> http://code.google.com/p/tesseract-ocr > >> with the GhostScript interpreter from > >> http://ghostscript.com > >> > >> GhostScript creates a .tif file from the .pdf > file of interest, and then > >> Tesseract creates a .txt file from that. The > current implementation is a > >> simple batch file, pdf2ocr.bat, with the > following syntax on the command > >> line: > >> pdf2ocr SourceRootName > >> where SourceRootName is the name of a PDF file > without the .pdf > >> extension. > >> This produces a text file with the same name > except for a .txt extension. > >> The PDF name can include a directory path, but > not embedded spaces. For > >> example, > >> pdf2ocr c:\temp\test > >> produces > >> c:\temp\test.txt > >> > >> I am seeking feedback on this initial test > version. I want to be sure it > >> works on computers that have not run the > GhostScript installation program > >> for Windows. The archive contains an image-based > PDF for testing called > >> debate.pdf (the legal agreement between the Bush > and Kerry campaigns > >> concerning Presidential debates). Please > understand that Tesseract is > >> not > >> the best OCR available, though it is generally > considered the best free > >> OCR. > >> > >> Installation consists of unzipping the > pdf2ocr.zip archive to a > >> directory, > >> e.g., to one called > >> C:\PDF2OCR > >> The target directory will contain many files that > I gathered from > >> subdirectories of an installed GhostScript > directory tree. It will also > >> contain one subdirectory called tessedata, which > is required by the > >> tesseeract.exe program for language support (I > have only distributed > >> English files, but other languages are available > from the Google site). > >> > >> In order to run the batch file from any > directory, you can add the > >> PDF2OCR > >> directory to the path of a console session with a > command like the > >> following: > >> set path=c:\pdf2ocr;%path% > >> You can add the path for every console session > via the Advanced tab page > >> of the System applet in Control Panel. > >> > >> The pdf2ocr.zip download is large, about 14 megs, > so it will probably > >> remain a stand-alone project, rather than being > bundled with other > >> applications I develop. Feel free to enhance it > in the spirit of open > >> source development! > >> > >> Jamal > >> > >> __________ > >> View the list's information and change your > settings at > >> //www.freelists.org/list/programmingblind > >> > >> > > > > > > __________ > > View the list's information and change your > settings at > > //www.freelists.org/list/programmingblind > > > > > > __________ NOD32 2526 (20070912) Information > __________ > > > > This message was checked by NOD32 antivirus > system. > > http://www.eset.com > > > > > > __________ > View the list's information and change your settings > at > //www.freelists.org/list/programmingblind > > ____________________________________________________________________________________ Sick of deleting your inbox? Yahoo!7 Mail has free unlimited storage. http://au.docs.yahoo.com/mail/unlimitedstorage.html __________ View the list's information and change your settings at //www.freelists.org/list/programmingblind