[recoll-user] Re: indexing and searching customized metadata fields in pdf

  • From: Jean-Francois Dockes <jfd@xxxxxxxxxx>
  • To: recoll-user@xxxxxxxxxxxxx
  • Date: Fri, 21 Mar 2014 13:11:11 +0100

Ciaran Farrell writes:
 > On Fri, 2014-03-21 at 10:49 +0100, Jean-Francois Dockes wrote:
 > > My guess would be that termdate is not defined as a field for some
 > > reason. When the field name is not recognized, recoll just runs the query
 > > without a field, and 2012 (without field) does index the doc.
 > > 
 > > You can check two things:
 > > 
 > >  - What exact query was run for the second query: are the terms prefixed by
 > >    the prefix defined in "fields"? If not, there is a problem in the
 > >    "fields" file.
 > 
 > This appears to have been the problem. I copied the prefixes from the
 > example here:
 > https://bitbucket.org/medoc/recoll/wiki/HandleCustomField
 > 
 > pdfpages = XYPDFP
 > 
 > Whereby of course PDFP referred to the PDFPAGES custom field filter that
 > is described in the example itself. I guess it comes down to me having
 > no idea what those prefixes are for :-)

Ok, just for the record, there is only one Xapian index for all terms,
belonging to a specific field or not. The prefixes are all-caps because
terms are normally lower-cased, and they partition the index into general
terms (no prefix), and terms from a given field (with the appropriate
prefix).

If you have:

[prefixes]
effdate = XYEFFDATE

in the fields file, when indexing, Recoll records terms from this field by
concatenating the prefix and the term: XYEFFDATEsometerm

When Recoll sees "effdate:someterm" in a query it asks Xapian for documents
indexed by "XYEFFDATEsometerm". 

I guess that we could just as well use a colon separator inside the index,
but the old Xapian convention is to use case separation. Maybe this saved
space at some point, but as far as know, Xapian now uses prefix compression
when storing data, so the size of the prefix should not be very
significant (it would be interesting to check that this is the case though). 

Cheers,

jf



Other related posts: