[haiku] Full Text Indexing (was Need Some GSoC Advice)

  • From: "Alexander G. M. Smith" <agmsmith@xxxxxx>
  • To: haiku@xxxxxxxxxxxxx
  • Date: Thu, 26 Mar 2009 12:39:41 -0400 EDT

All this talk about full text indexing reminds me of the AGMSRAMFS
experiment a few years ago.  http://www.bebits.com/app/3642

It's actually pretty easy to do full text with BFS.  First you need
attributes which can store a list of things to be indexed, not just one
value.  Then set the attribute to all the keywords in the file (the
application writing the file can do this or you need a 3rd party keyword
extractor utility for unsupported file types).  Standardise on something
like "META:keywords" or "META:tags" for the attribute name so other
programs can use it.  Then use queries almost as normal.

One of the almost trivial additions in AGMSRAMFS was to accept attributes
which are arrays.  They were marked by a flag bit when the attribute's
index was created.  Fixed size attributes are easy, you just divide the
size of the attribute by the fixed size of the data type to get the number
of elements.  String attributes were handled by breaking them into words.
The index can handle strings of at most 255 characters, so you can't have
any individual word be longer, but the whole attribute is limited only to
memory space in size (BFS actually stores long attributes in an on-disk
structure that is the same as a file) so you can have many words in the
attribute and index.

When you changed an array attribute, the file system would add entries to
the index for each of the parts of the attribute, the index key being the
attribute element (usually a word for text attributes), with the index
value being a pointer to the file.  This results in quite a few entries in
the index pointing to the same file.  When you do a query, it works as
nrormal but you may get several hits for the same file, not just one,
so you'll have to filter out duplicates.  Live queries will have to
keep a running count of the number of hits per file.

- Alex


Other related posts:

  • » [haiku] Full Text Indexing (was Need Some GSoC Advice) - Alexander G. M. Smith