[haiku-development] Re: TrackerGrep [was Re: missing -lm?]

From: "Alexander G. M. Smith" <agmsmith@xxxxxxxxxx>
To: haiku-development@xxxxxxxxxxxxx
Date: Sat, 03 May 2008 12:19:09 -0400 EDT

Axel Dörfler wrote on Sat, 03 May 2008 13:13:26 +0200 CEST:
> "Alexander G. M. Smith" <agmsmith@xxxxxxxxxx> wrote:
> > The applications (or a background thread running a Translator)
> > would have to be adapted to write a standard "META:keywords"
> > attribute with all the keywords for a modified document.
> > 
> > The file system would need only a few changes.  When adding an
> > attribute to the index, each word in the attribute is added
> > separately to the index, with a back pointer to the file as usual.
> > Thus the file is in the index multiple times.  Deleting the
> > keywords is similar to adding.  Searching (and live queries) is a
> > bit different, since you may get the same file multiple times.  The
> > simple solution is to have user level code filter out the
> > duplicates.
> 
> But that would result in a pretty bad solution: first of all what 
> keywords? And second of all, the BFS query mechanism hasn't been 
> written for arbitrary matches, it has been written for direct lookup. 
> Using pattern matching (which Tracker does by default) makes searching 
> anything really slow. Having many keywords put into those indices would 
> slow it down considerably more.
> And it would still not even work like what a user expects (similarity 
> search, ignoring diacritics, etc.).

I can answer most of those objections.

The application decides on the keywords.  When it saves a file, it also
updates the META:keywords attribute (or META:keyw or whatever it will be).
It can also simplify the search by converting the keywords to lower case
without diacritical marks before saving them in the attribute.

Searching is faster with every word in the index.  If I look for
"META:keywords=happy*" then the index can quickly jump to the range of
files starting with "happy" (iterates from "happy" to "happz").  It
doesn't iterate through all files!

François suggested using the whole keyword string as the entry in the
index, as it is now.  To find happy things you'd search for the keyword
anywhere in the middle of the string, using "META:keywords=*happy*".
This would indeed have to iterate through all of the index, which would
be slow.  That's the advantage of having each keyword added separately
to the index.

I only see a couple of disadvantages to my idea.  One is the need to
filter out multiple returns of the same file in a query.  The other is
that there are three copies of each word: one in the original document,
one in the attribute and one in the index.  A separate index server
(like MacOS SpotLight) reduces that to two copies of each word.

In any case, having a META:keywords attribute would be useful by itself
even if it is badly indexed.  It's much faster to search through strings
in an index than opening all the files on the drive and looking at their
attributes.

- Alex

P.S. I'd like to search my old e-mails for key words, and grep is
just too annoyingly slow!

Follow-Ups:
- [haiku-development] Re: TrackerGrep [was Re: missing -lm?]
  - From: Ingo Weinhold

References:
- [haiku-development] Re: TrackerGrep [was Re: missing -lm?]
  - From: Axel Dörfler

[haiku-development] Re: TrackerGrep [was Re: missing -lm?]

Other related posts: