[recoll-user] Re: full text searching for parts of a known date

  • From: "Alexander Beulich" <AlexBeulich@xxxxxx>
  • To: recoll-user@xxxxxxxxxxxxx
  • Date: Wed, 25 Jul 2012 22:25:28 +0200

Hi,

first of all many thanks for your efforts. I'm pretty stunned
by your response times.  

I'll give your suggestions a try and update the list about 
the outcome.

And I exactly know what you mean by not being able to resist to 
try something that came to mind...immediately. :-)

Anyway I probably will need some days to test the modified rclpdf. 
:-( 
 
Best, Alex

-------- Original-Nachricht --------
> Datum: Wed, 25 Jul 2012 08:30:17 +0200
> Von: jfd@xxxxxxxxxx
> An: recoll-user@xxxxxxxxxxxxx
> Betreff: [recoll-user] Re: full text searching for parts of a known date

> Alexander Beulich writes:
>  > Hi jf,
>  > 
>  > thanks for your response. I'm happy to explain my idea in more detail:
>  > 
>  > I'm planning to scan and recognize (OCR) official documents 
>  > (insurance, bank statements and so on) from the last years.
>  > 
>  > To find them later on I would like to use recoll. As a test I 
>  > scanned three bank statements and made the full text available 
>  > by sending them trough OCR and adding the recognized text to the pdfs.
>  > 
>  > Then I configured recoll to index these pdfs.
>  > 
>  > The documents always contain the date when they got sent to me. For
>  > instance two contain the string "18.04.2012" (german date format)
>  > amongst lots of other strings like my name, my bank account number, the
>  > name of the bank and so on. The third bank statement contains another
>  > date string ("23.04.2012"). 
>  > 
>  > The search is returning the correct results when searching for the
> exact
>  > dates (like "18.04.2012"). 
>  > 
>  > But what if I wanted to find all bank statements sent to me between the
>  > 01.04.2012 and 30.04.2012?  
>  > 
>  > I thought it might work when doing a wildcard search like *.04.2012 AND
>  > "bank statement" but that results in a long  duration of the actual
>  > search process (actually I never waited for it to finish, since it took
>  > more than 5 minutes without completing). 
>  > 
>  > Please keep in mind that the file creation date doesn't help, since I'm
>  > currently handling documents dating back to 2010, 2011 and so on. They
>  > will all have a creation date like XX.XX.2012. 
> 
> Thanks for the explanation. 
> 
> A string like 30.04.2012 is currently indexed as a single term, so the
> only
> standard possibility is the one you used, a left-side wildcard, which
> yields
> unacceptable performance because the search will have to scan the whole
> term list.
> 
> A modified approach would be to scan the pdf text for dates before
> indexing
> and use, for example, the first value found as the document date. I can
> see
> two ways to do this:
> 
> - Either post-process the pdfs and set the file's date using "touch" as
>   part of the scanning process or just after.
> 
> - Or use a modified rclpdf script. The modified script should scan the
>   document text for dates and and set a Recoll "date" header inside the
>   generated html: 
>      <meta name="date" content="2012-04-20 00:00:00">
>   This is non-standard html, but it will be recognized by the indexer as
> the
>   document date.
> 
>   As I could not resist doing it, I am attaching a modified rclpdf which
>   should hopefuly do what you need. You need to copy it somewhere, make it
>   executable, and add the following to ~/.recoll/mimeconf:
>     [index]
>     application/pdf = exec /path/to/my/modified/rclpdf
>   As I have no sample document to test with, I am not too sure if it
>   actually works. You can try it on the command line: rclpdf mypdf.pdf,
> and
>   check that a "date" header line is inserted (at the end of the header
>   section). 
>   
> With both approaches, you should then be able to use the normal date/time
> search facility in Recoll.
> 
> Please tell us how this works.
> 
> Cheers,
> 
> jf
> 
> 

Other related posts: