Hi, first of all many thanks for your efforts. I'm pretty stunned by your response times. I'll give your suggestions a try and update the list about the outcome. And I exactly know what you mean by not being able to resist to try something that came to mind...immediately. :-) Anyway I probably will need some days to test the modified rclpdf. :-( Best, Alex -------- Original-Nachricht -------- > Datum: Wed, 25 Jul 2012 08:30:17 +0200 > Von: jfd@xxxxxxxxxx > An: recoll-user@xxxxxxxxxxxxx > Betreff: [recoll-user] Re: full text searching for parts of a known date > Alexander Beulich writes: > > Hi jf, > > > > thanks for your response. I'm happy to explain my idea in more detail: > > > > I'm planning to scan and recognize (OCR) official documents > > (insurance, bank statements and so on) from the last years. > > > > To find them later on I would like to use recoll. As a test I > > scanned three bank statements and made the full text available > > by sending them trough OCR and adding the recognized text to the pdfs. > > > > Then I configured recoll to index these pdfs. > > > > The documents always contain the date when they got sent to me. For > > instance two contain the string "18.04.2012" (german date format) > > amongst lots of other strings like my name, my bank account number, the > > name of the bank and so on. The third bank statement contains another > > date string ("23.04.2012"). > > > > The search is returning the correct results when searching for the > exact > > dates (like "18.04.2012"). > > > > But what if I wanted to find all bank statements sent to me between the > > 01.04.2012 and 30.04.2012? > > > > I thought it might work when doing a wildcard search like *.04.2012 AND > > "bank statement" but that results in a long duration of the actual > > search process (actually I never waited for it to finish, since it took > > more than 5 minutes without completing). > > > > Please keep in mind that the file creation date doesn't help, since I'm > > currently handling documents dating back to 2010, 2011 and so on. They > > will all have a creation date like XX.XX.2012. > > Thanks for the explanation. > > A string like 30.04.2012 is currently indexed as a single term, so the > only > standard possibility is the one you used, a left-side wildcard, which > yields > unacceptable performance because the search will have to scan the whole > term list. > > A modified approach would be to scan the pdf text for dates before > indexing > and use, for example, the first value found as the document date. I can > see > two ways to do this: > > - Either post-process the pdfs and set the file's date using "touch" as > part of the scanning process or just after. > > - Or use a modified rclpdf script. The modified script should scan the > document text for dates and and set a Recoll "date" header inside the > generated html: > <meta name="date" content="2012-04-20 00:00:00"> > This is non-standard html, but it will be recognized by the indexer as > the > document date. > > As I could not resist doing it, I am attaching a modified rclpdf which > should hopefuly do what you need. You need to copy it somewhere, make it > executable, and add the following to ~/.recoll/mimeconf: > [index] > application/pdf = exec /path/to/my/modified/rclpdf > As I have no sample document to test with, I am not too sure if it > actually works. You can try it on the command line: rclpdf mypdf.pdf, > and > check that a "date" header line is inserted (at the end of the header > section). > > With both approaches, you should then be able to use the normal date/time > search facility in Recoll. > > Please tell us how this works. > > Cheers, > > jf > >