[haiku] Re: Need Some GSoC Advice

  • From: "François Revol" <revol@xxxxxxx>
  • To: haiku@xxxxxxxxxxxxx
  • Date: Mon, 23 Mar 2009 23:43:52 +0100 CET

> Ankur Sethi wrote:
> 
> > (1) Create an initial index of all the data on the disk. This takes 
> > a
> > *very* long time and consumes a large amount of CPU. Somethimes,
> > seemingly at random, Spotlight will decide to build the entire 
> > index
> > from scratch, and then there is very little you can do about it.
> 
> This is one place where the BFS indexing shines. A userspace process
> that tries to keep an index of data on the disk needs to reindex from
> scratch every time it starts up if it wants to be absolutely sure its
> database is up-to-date. The reason is that it cannot know how much
> changed on disk while the process was not running. BFS does not have
> this problem because, obviously, no attributes can ever be written
> without BFS noticing.

The problem is BFS cannot index string attributes larger than 255 
bytes; so this cannot be used for full content indexing.

Now, I don't see full content indexing as really mandatory. I believe 
well weighted keyword extraction should be enough (there is already a 
META:keyw attribute defined somewhere, for People files IIRC).

A "spotlight" like app would then just start several queries at once 
and merge relevant results.

> It would be interesting if a plugin based extractor could be 
> triggered
> from BFS. It doesn't have to run in kernel space; just spawn a
> userspace process, read a simple key-value format from its stdout, 
> and
> set the attributes.

As I said, extending Translation kit addons could be an easy first 
step.

registrar would spawn an indexer app when new files appear.

> There would still be a problem with attributes getting out-of-date
> when extractor plugins change. The extractor could solve this by
> making a query for existing files with the mime-types handled by the
> changed plugins, and update the attributes on those.

Another problem is when someone manually fixes the attributes (cause 
they are wrong, or to add more info), reindexing them would loose the 
changes.
This could be handled with a last_indexed date attribute compared to 
the last_modified one maybe.

François.

Other related posts: