Alexander Beulich writes: > Hi jf, > > thanks for your response. I'm happy to explain my idea in more detail: > > I'm planning to scan and recognize (OCR) official documents > (insurance, bank statements and so on) from the last years. > > To find them later on I would like to use recoll. As a test I > scanned three bank statements and made the full text available > by sending them trough OCR and adding the recognized text to the pdfs. > > Then I configured recoll to index these pdfs. > > The documents always contain the date when they got sent to me. For > instance two contain the string "18.04.2012" (german date format) > amongst lots of other strings like my name, my bank account number, the > name of the bank and so on. The third bank statement contains another > date string ("23.04.2012"). > > The search is returning the correct results when searching for the exact > dates (like "18.04.2012"). > > But what if I wanted to find all bank statements sent to me between the > 01.04.2012 and 30.04.2012? > > I thought it might work when doing a wildcard search like *.04.2012 AND > "bank statement" but that results in a long duration of the actual > search process (actually I never waited for it to finish, since it took > more than 5 minutes without completing). > > Please keep in mind that the file creation date doesn't help, since I'm > currently handling documents dating back to 2010, 2011 and so on. They > will all have a creation date like XX.XX.2012. Thanks for the explanation. A string like 30.04.2012 is currently indexed as a single term, so the only standard possibility is the one you used, a left-side wildcard, which yields unacceptable performance because the search will have to scan the whole term list. A modified approach would be to scan the pdf text for dates before indexing and use, for example, the first value found as the document date. I can see two ways to do this: - Either post-process the pdfs and set the file's date using "touch" as part of the scanning process or just after. - Or use a modified rclpdf script. The modified script should scan the document text for dates and and set a Recoll "date" header inside the generated html: <meta name="date" content="2012-04-20 00:00:00"> This is non-standard html, but it will be recognized by the indexer as the document date. As I could not resist doing it, I am attaching a modified rclpdf which should hopefuly do what you need. You need to copy it somewhere, make it executable, and add the following to ~/.recoll/mimeconf: [index] application/pdf = exec /path/to/my/modified/rclpdf As I have no sample document to test with, I am not too sure if it actually works. You can try it on the command line: rclpdf mypdf.pdf, and check that a "date" header line is inserted (at the end of the header section). With both approaches, you should then be able to use the normal date/time search facility in Recoll. Please tell us how this works. Cheers, jf