[haiku] Re: Need Some GSoC Advice

  • From: "François Revol" <revol@xxxxxxx>
  • To: haiku@xxxxxxxxxxxxx
  • Date: Tue, 24 Mar 2009 11:29:14 +0100 CET

> > The problem is BFS cannot index string attributes larger than 255
> > bytes; so this cannot be used for full content indexing.
> 
> Full text indexing does not work by storing the full content directly
> in the index. Each term must be indexed independently, otherwise the
> lookup won't benefit much from the index. Of course, repeated terms
> are only added once and words that are too common or too short are
> ignored.
> 
> This means that the size limit is not the big problem. What is needed
> is a mechanism for storing multiple strings in each attribute.

Yes but this will need to be added then.

> 
> > Now, I don't see full content indexing as really mandatory. I 
> > believe
> > well weighted keyword extraction should be enough (there is already 
> > a
> > META:keyw attribute defined somewhere, for People files IIRC).
> 
> Keyword and label-type attributes also need multiple strings. Simply
> setting META:keyw to "christmas holiday france" is not good enough.
> Queries then need to use wildcards, which will give terrible
> performance. Generally when searching an index, wildcards at the end
> are ok, but prefix wildcards mean that you have to sequentially scan
> the whole index.

Well, Using wildcards when searching mails (and I have a lot of them) 
isn't that slow.
Of course it's not the perfect way, but at least it works.

> 
> > A "spotlight" like app would then just start several queries at 
> > once
> > and merge relevant results.
> 
> Yes, for each extractor plugin there would be a search plugin. For
> instance, the one for mp3 files knows that the attributes for artist,
> album, title, year, etc should be included in the query, and that 
> year
> is a number.

No need.
The mime db already knows what kind of attribute each is for each mime.
Besides indexes also are typed.

François.

Other related posts: