On Thu, 2014-03-20 at 21:54 +0100, Jean-Francois Dockes wrote: > Ciaran Farrell writes: > > Hi, > > > > I have a directory with lots of PDFs (not text, they were scanned). To > > manage these PDFs I'd like to add custom metadata fields. With the pdfrw > > module in python it was quite easy to do this. exiftool shows that the > > fields were indeed added to the pdf. With exiftool -CustomFieldName1 > > -CustomFieldName2 I can extract the metadata. > > > > I'd like to have recoll do the heavy lifting for me. However, I see that > > by default it isn't possible to index/search on custom fields. I read > > through https://bitbucket.org/medoc/recoll/wiki/HandleCustomField and > > followed the instructions there (using exiftool to extract the metadata > > instead of pdfinfo - which can't do it on the commandline for me). > > Just to be sure I understand: you have modified rclpdf (or made a > modified copy somewhere), and you execute this during indexing (either it's > modified in place, or you modified mimeconf to execute the new copy) ? Yes - here is the content of ~/.recoll/mimeconf [index] application/pdf = exec /home/cfarrell/.recoll/rclpdf And the corresponding custom copy of rclpdf has the following: [...] checkcmds exiftool pdftotext iconv awk set `exiftool "$infile" | egrep ^Eff Date` effDate=`printf "%s" $2` # Run pdftotext and fix the result (add a charset tag and fix the html escaping # The strange 'BEGIN' setup is to prevent 'file' from thinking this file # is an awk program pdftotext $optionraw -htmlmeta -enc UTF-8 -eol unix -q "$infile" - | iconv -f UTF-8 -t UTF-8 -c -s | awk -v EffDate="$effDate" 'BEGIN'\ ' { doescape = 0 cont = "" charsetmeta = "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">" effdatemeta = "<meta name=\"effdate\" content=\"" EffDate "\">\n" } { $0 = cont $0 cont = "" # Insert charset meta tag at end of header if(doescape == 0 && $0 ~ /<\/head>/) { match($0, /<\/head>/) part1 = substr($0, 0, RSTART-1) part2 = substr($0, RSTART, length($0)) $0 = part1 charsetmeta effdatemeta part2 } [...] Then I also modified ~/.recoll/fields as follows (just the last line was added - i.e. effdate - the others were there already): (in [prefixes]): filename = XSFN rclUnsplitFN = XSFS xapyear = Y effdate = XYPDFP (in [stored]): rclbes= recipient= effdate= Then I rebuilt the index, but as mentioned, it didn't work. > > > However, whereas I see CustomFieldName appearing in the GUI (e.g. in the > > advanced search window), no results are returned, irrespective of what I > > do. For example, (on the commandline) recoll -t effDate:2012-12-01 > > should certainly have returned something (I can return results if I do > > something like recoll -t fileType:pdf). > > I think that the fact that the field appears in the GUI just means that it > was found from the "fields" file, not necessarily that there is anything in > the index for this field. > > You can check if anything is stored for your field by running the indexer > with the log at level 6: the data records are dumped, and you can see all > stored fields. > > Then, for the field to be searchable, you must also define an index prefix > for it (in the prefixes section of the fields file). Did you do this ? > > Last, I see that your field is a date, but Recoll will not be able to > handle this properly, it will just handle it as text. This could still sort > of work because the phrase search for 2012-12 or 2012-12-01 could return > meaningful results, but be aware that this is not the same thing as a > properly processed date field. That is ok - I think. I intend to use the python api to access the data, so a strptime on that should work. > > > Is there any simpler way of doing it than having a customized rclpdf in > > e.g. ~/.recoll and editing mimeconf to exec that? If not, what could I > > be doing wrong (or not doing) that would stop the indexing/searching on > > the custom field? > > There is actually another way than a modified filter: recoll can also > execute an additional command to extract metadata. See: > http://www.lesbonscomptes.com/recoll/usermanual/usermanual.html#RCL.INSTALL.CONFIG.RECOLLCONF.METADATACMDS Checking it out - need some coffee first ... > > But in your case, it seems more natural to modify rclpdf. > > I'd check each step: check that you do have an appropriate "meta" field in > the HTML which is generated by the modified rclpdf, check that the right > filter is effectively executed, check that the data record does have the > entry for your field. I'm not sure the meta field I added was correct (the one I posted inline) - the field names seem to be case sensitive sometimes and ducktyping seems to be expanded into two words (but exiftool seems to do this too) - i.e. effDate becomes Eff Date in exiftool. > > And don't hesitate to ask more questions, there is no reason why this > should not work, but I don't usually get it right the first time myself... Great, and thanks for the help! Ciaran
Attachment:
signature.asc
Description: This is a digitally signed message part