[recoll-user] Re: full text searching for parts of a known date

  • From: Alexander <alexbeulich@xxxxxx>
  • To: recoll-user@xxxxxxxxxxxxx
  • Date: Sat, 28 Jul 2012 00:43:28 +0200

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.



Alexander Beulich <AlexBeulich@xxxxxx> schrieb:

Hi JF,

thanks again for the script! I tried it, it works and
this is what it adds to the head tag:

<head>
<title></title>
<meta name="Producer" content="ABBYY FineReader 8.0 Professional Edition"/>
<meta name="CreationDate" content=""/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="date" content="2012-02-18T00:00:00">
</head>

It seems to be picked up by recoll correctly, since I was 
subsequently able to find the docs using a keyword in combination 
with the date filter of the advanced search. :-)

If you find the time to answer I have two further questions 
regarding the customisation of recoll.

1. Will recollindex pick up more than one date tag?
(I was thinking of adding all contained dates to the head)

2. Does recoll take the original CreationDate from pdftotext into account? (I 
was thinking of putting the real file creation date 
reported by stat there)

Thanks again, mate.
I really appreciate your help.

Alex

-------- Original-Nachricht --------
> Datum: Fri, 27 Jul 2012 08:18:10 +0200
> Von: Jean-Francois Dockes <jf@xxxxxxxxxx>
> An: "Alexander Beulich" <AlexBeulich@xxxxxx>
> Betreff: Re: [recoll-user] Re: full text searching for parts of a known date

> Alexander Beulich writes:
> > Hi,
> > 
> > it's not a waste of time! I'm learning a little bit about the recoll
> internals at least.
> > 
> > I tried the modified rclpdf script before I saw your email.
> > This is what it currently adds to the html file:
> > 
> > <meta name="Producer" content="ABBYY FineReader 8.0 Professional
> Edition"/>
> > <meta name="CreationDate" content=""/>
> > <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
> > <meta name="date" value="18.02.2012">
> > 
> > The script is basically working then. :-)
> 
> Yes, it seems so. There are two issues with the current script, it should
> use "content" not "value", and it should change the date format to what
> the
> indexer expects: %Y-%-%dT%H:%M:%S. 
> 
> I am attaching a hopefully corrected script.
> 
> > Looking through the script I wondered if the CreationDate is 
> > supposed to be empty? I double checked using the original 
> > rclpdf and the output for that field was the same.
> 
> Recoll uses "date", not "CreationDate". Apparently, pdftotext uses
> "CreationDate" for pdf meta info. I'm not too sure that this was the case
> when I first wrote the script (there are probably a few things in the
> script such as escaping HTML special characters which are not necessary
> any
> more with the current pdftotext, the old one was very buggy).
> 
> Anyway, a quick test seems to indicate that "CreationDate" is most often
> empty. Generally speaking, pdf metadata is mostly useless.
> 
> And I can't seem to find a real standard for indicating dates in "meta"
> HTML elements, so why not "date" :)
> 
> > I have another question regarding testing: 
> > How does the indexer keep track of files already indexed?
> 
> It looks at the files' stat() data: date modified and size.
> 
> > My test files were indexed before and that's why I'm not sure how 
> > to make recollindex process them again?
> 
> There are several ways: either reindex in bulk using "recollindex -z" (or
> just delete ~/.recoll/xapiandb). When testing with this, it may be
> convenient to set up the config so that it indexes only one or a few test
> files.
> 
> 
> Or you can force reindexing a single file by using the -e/-i recollindex
> options (erase/index):
> 
> recollindex -e /path/to/my/file
> recollindex -i /path/to/my/file
> 
> In both cases you should take care that the paths used are consistent with
> what is set in ~/.recoll/recoll.conf, the path comparison is textual. 
> 
> So, for example, if there is a /home -> /usr/home symbolic link,
> recollindex will refuse to index "/home/me/myfile" if "/usr/home/me" is in
> topdirs. Using ~/ should work with the default recoll.conf
> 
> Cheers,
> 
> jf
> 

Other related posts: