[recoll-user] Re: full text searching for parts of a known date

  • From: <jfd@xxxxxxxxxx>
  • To: recoll-user@xxxxxxxxxxxxx
  • Date: Wed, 25 Jul 2012 08:30:17 +0200

Alexander Beulich writes:
 > Hi jf,
 > 
 > thanks for your response. I'm happy to explain my idea in more detail:
 > 
 > I'm planning to scan and recognize (OCR) official documents 
 > (insurance, bank statements and so on) from the last years.
 > 
 > To find them later on I would like to use recoll. As a test I 
 > scanned three bank statements and made the full text available 
 > by sending them trough OCR and adding the recognized text to the pdfs.
 > 
 > Then I configured recoll to index these pdfs.
 > 
 > The documents always contain the date when they got sent to me. For
 > instance two contain the string "18.04.2012" (german date format)
 > amongst lots of other strings like my name, my bank account number, the
 > name of the bank and so on. The third bank statement contains another
 > date string ("23.04.2012"). 
 > 
 > The search is returning the correct results when searching for the exact
 > dates (like "18.04.2012"). 
 > 
 > But what if I wanted to find all bank statements sent to me between the
 > 01.04.2012 and 30.04.2012?  
 > 
 > I thought it might work when doing a wildcard search like *.04.2012 AND
 > "bank statement" but that results in a long  duration of the actual
 > search process (actually I never waited for it to finish, since it took
 > more than 5 minutes without completing). 
 > 
 > Please keep in mind that the file creation date doesn't help, since I'm
 > currently handling documents dating back to 2010, 2011 and so on. They
 > will all have a creation date like XX.XX.2012. 

Thanks for the explanation. 

A string like 30.04.2012 is currently indexed as a single term, so the only
standard possibility is the one you used, a left-side wildcard, which yields
unacceptable performance because the search will have to scan the whole
term list.

A modified approach would be to scan the pdf text for dates before indexing
and use, for example, the first value found as the document date. I can see
two ways to do this:

- Either post-process the pdfs and set the file's date using "touch" as
  part of the scanning process or just after.

- Or use a modified rclpdf script. The modified script should scan the
  document text for dates and and set a Recoll "date" header inside the
  generated html: 
     <meta name="date" content="2012-04-20 00:00:00">
  This is non-standard html, but it will be recognized by the indexer as the
  document date.

  As I could not resist doing it, I am attaching a modified rclpdf which
  should hopefuly do what you need. You need to copy it somewhere, make it
  executable, and add the following to ~/.recoll/mimeconf:
    [index]
    application/pdf = exec /path/to/my/modified/rclpdf
  As I have no sample document to test with, I am not too sure if it
  actually works. You can try it on the command line: rclpdf mypdf.pdf, and
  check that a "date" header line is inserted (at the end of the header
  section). 
  
With both approaches, you should then be able to use the normal date/time
search facility in Recoll.

Please tell us how this works.

Cheers,

jf


Other related posts: