[haiku] Re: Need Some GSoC Advice

From: Truls Becken <truls.becken@xxxxxxxxx>
To: haiku@xxxxxxxxxxxxx
Date: Tue, 24 Mar 2009 08:50:01 +0100

François Revol wrote:

>> Ankur Sethi wrote:
>>
>>> (1) Create an initial index of all the data on the disk. This takes
>>> a
>>> *very* long time and consumes a large amount of CPU. Somethimes,
>>> seemingly at random, Spotlight will decide to build the entire
>>> index from scratch, and then there is very little you can do about it.
>>
>> This is one place where the BFS indexing shines. A userspace process
>> that tries to keep an index of data on the disk needs to reindex from
>> scratch every time it starts up if it wants to be absolutely sure its
>> database is up-to-date. The reason is that it cannot know how much
>> changed on disk while the process was not running. BFS does not have
>> this problem because, obviously, no attributes can ever be written
>> without BFS noticing.
>
> The problem is BFS cannot index string attributes larger than 255
> bytes; so this cannot be used for full content indexing.

Full text indexing does not work by storing the full content directly
in the index. Each term must be indexed independently, otherwise the
lookup won't benefit much from the index. Of course, repeated terms
are only added once and words that are too common or too short are
ignored.

This means that the size limit is not the big problem. What is needed
is a mechanism for storing multiple strings in each attribute.

> Now, I don't see full content indexing as really mandatory. I believe
> well weighted keyword extraction should be enough (there is already a
> META:keyw attribute defined somewhere, for People files IIRC).

Keyword and label-type attributes also need multiple strings. Simply
setting META:keyw to "christmas holiday france" is not good enough.
Queries then need to use wildcards, which will give terrible
performance. Generally when searching an index, wildcards at the end
are ok, but prefix wildcards mean that you have to sequentially scan
the whole index.

> A "spotlight" like app would then just start several queries at once
> and merge relevant results.

Yes, for each extractor plugin there would be a search plugin. For
instance, the one for mp3 files knows that the attributes for artist,
album, title, year, etc should be included in the query, and that year
is a number.

Follow-Ups:
- [haiku] Re: Need Some GSoC Advice
  - From: François Revol

References:
- [haiku] Re: Need Some GSoC Advice
  - From: Truls Becken
- [haiku] Re: Need Some GSoC Advice
  - From: François Revol

[haiku] Re: Need Some GSoC Advice

Other related posts: