[recoll-user] Re: indexing and searching customized metadata fields in pdf

  • From: Ciaran Farrell <ciaran@xxxxxxxxxx>
  • To: recoll-user@xxxxxxxxxxxxx
  • Date: Fri, 21 Mar 2014 08:44:28 +0100

On Thu, 2014-03-20 at 21:54 +0100, Jean-Francois Dockes wrote:
> Ciaran Farrell writes:
>  > Hi,
>  > 
>  > I have a directory with lots of PDFs (not text, they were scanned). To
>  > manage these PDFs I'd like to add custom metadata fields. With the pdfrw
>  > module in python it was quite easy to do this. exiftool shows that the
>  > fields were indeed added to the pdf. With exiftool -CustomFieldName1
>  > -CustomFieldName2 I can extract the metadata.
>  > 
>  > I'd like to have recoll do the heavy lifting for me. However, I see that
>  > by default it isn't possible to index/search on custom fields. I read
>  > through https://bitbucket.org/medoc/recoll/wiki/HandleCustomField and
>  > followed the instructions there (using exiftool to extract the metadata
>  > instead of pdfinfo - which can't do it on the commandline for me).
> 
> Just to be sure I understand: you have modified rclpdf (or made a
> modified copy somewhere), and you execute this during indexing (either it's
> modified in place, or you modified mimeconf to execute the new copy) ?

Yes - here is the content of ~/.recoll/mimeconf
[index]
application/pdf = exec /home/cfarrell/.recoll/rclpdf

And the corresponding custom copy of rclpdf has the following:
[...]

checkcmds exiftool pdftotext iconv awk
set `exiftool "$infile" | egrep ^Eff Date`
effDate=`printf "%s" $2`
# Run pdftotext and fix the result (add a charset tag and fix the html
escaping
# The strange 'BEGIN' setup is to prevent 'file' from thinking this file
# is an awk program
pdftotext $optionraw -htmlmeta -enc UTF-8 -eol unix -q "$infile" - |
iconv -f UTF-8 -t UTF-8 -c -s |
awk -v EffDate="$effDate" 'BEGIN'\
' {
  doescape = 0
  cont = ""
  charsetmeta = "<meta http-equiv=\"Content-Type\" content=\"text/html;
charset=UTF-8\">"
  effdatemeta = "<meta name=\"effdate\" content=\"" EffDate "\">\n"
}
{
  $0 = cont $0
  cont = ""
  # Insert charset meta tag at end of header
  if(doescape == 0 && $0 ~ /<\/head>/) {
    match($0, /<\/head>/)
    part1 = substr($0, 0, RSTART-1)
    part2 = substr($0, RSTART, length($0))
    $0 =  part1 charsetmeta effdatemeta part2
  }
[...]

Then I also modified ~/.recoll/fields as follows (just the last line was
added - i.e. effdate - the others were there already):

(in [prefixes]):
filename = XSFN
rclUnsplitFN = XSFS
xapyear = Y
effdate = XYPDFP

(in [stored]):
rclbes=
recipient=
effdate=


Then I rebuilt the index, but as mentioned, it didn't work.

> 
>  > However, whereas I see CustomFieldName appearing in the GUI (e.g. in the
>  > advanced search window), no results are returned, irrespective of what I
>  > do. For example, (on the commandline) recoll -t effDate:2012-12-01
>  > should certainly have returned something (I can return results if I do
>  > something like recoll -t fileType:pdf).
> 
> I think that the fact that the field appears in the GUI just means that it
> was found from the "fields" file, not necessarily that there is anything in
> the index for this field.
> 
> You can check if anything is stored for your field by running the indexer
> with the log at level 6: the data records are dumped, and you can see all
> stored fields.
> 
> Then, for the field to be searchable, you must also define an  index prefix
> for it (in the prefixes section of the fields file). Did you do this ?
> 
> Last, I see that your field is a date, but Recoll will not be able to
> handle this properly, it will just handle it as text. This could still sort
> of work because the phrase search for 2012-12 or 2012-12-01 could return
> meaningful results, but be aware that this is not the same thing as a
> properly processed date field.

That is ok - I think. I intend to use the python api to access the data,
so a strptime on that should work.
> 
>  > Is there any simpler way of doing it than having a customized rclpdf in
>  > e.g. ~/.recoll and editing mimeconf to exec that? If not, what could I
>  > be doing wrong (or not doing) that would stop the indexing/searching on
>  > the custom field?
> 
> There is actually another way than a modified filter: recoll can also
> execute an additional command to extract metadata. See:
> http://www.lesbonscomptes.com/recoll/usermanual/usermanual.html#RCL.INSTALL.CONFIG.RECOLLCONF.METADATACMDS

Checking it out - need some coffee first ...
> 
> But in your case, it seems more natural to modify rclpdf.
> 
> I'd check each step: check that you do have an appropriate "meta" field in
> the HTML which is generated by the modified rclpdf, check that the right
> filter is effectively executed, check that the data record does have the
> entry for your field.

I'm not sure the meta field I added was correct (the one I posted
inline) - the field names seem to be case sensitive sometimes and
ducktyping seems to be expanded into two words (but exiftool seems to do
this too) - i.e. effDate becomes Eff Date in exiftool.
> 
> And don't hesitate to ask more questions, there is no reason why this
> should not work, but I don't usually get it right the first time myself...

Great, and thanks for the help!

Ciaran

Attachment: signature.asc
Description: This is a digitally signed message part

Other related posts: