[haiku] Re: Need Some GSoC Advice

  • From: "Cyan" <cyanh256@xxxxxxxxxxxx>
  • To: haiku@xxxxxxxxxxxxx
  • Date: Mon, 23 Mar 2009 22:54:25 GMT

Humdinger <humdingerb@xxxxxxxxxxxxxx> wrote:
> > A correct read/write implementation would have to update the 
> > indexes regardless.
> IC. Then let's hope for correct implementations. :)

Would it be technically feasible (with modifications to the file
system driver if necessary) to perform a query for all files which
*lack* a particular attribute?

That could possibly work around problems like this -- allowing
identification of files which don't yet have the index data
attributes at all. (present but outdated index data could be caught
using a version number attribute)

It would also be useful in the case of adding checksum attributes
to each file for data integrity checking -- currently I'm doing this
by grinding through all ~1 million files on the disk and checking
for the presence of the attribute. Needless to say, the procedure is
not particularly quick.


As for the indexer itself, it sounds like a great idea! I think the
index data would need to be stored as BFS attributes as others have
suggested -- a separate database would be a kludge IMO, and likely
to lead to sync problems.

The only problem that comes to mind is the 256-byte limitation for
indexed attributes on BFS -- does this also exist in OpenBFS?

If not (or if it's practical to remove), maybe the entire raw text of
non-text documents could be stored as a separate BFS-indexed
"raw_text" attribute? Then a query for something like
"((mime_type == text/plain) && (file_data [contains] search_string))
|| ((mime_type == *) && (raw_text [contains] search_string))" would
perform the necessary search.


Also another thing that comes to mind, how to handle files for which
the MIME type is not yet set? I suppose the sniffer should just run
on the file before deferring the parsing task to the necessary
translation kit plug-in?


(!!Feature creep alert!!)

It would also be nice if the indexer could index data of different
"types" (but still possibly represented as text).
We already need to differentiate other types of text data (e.g., ID3
tag fields from raw text of a document) to allow the user to search
for e.g. a particular song name as well as a general string.

However if the indexer is flexible enough, maybe it could also handle
other types of data, such as MIDI note data and tempo derived using a
pitch detection algorithm from MP3 files. This would enable the user
to recall a particular song by whistling the tune into the search
application, or it could allow a music player to better assemble an
automatic playlist by matching tempos, etc.

Images could also get the same treatment: machine vision could be
used to identify the class of image (drawing, photograph, etc.),
search for images similar to a reference image (possibly by deriving
a text-based "fingerprint" from the image), or search for text
obtained using OCR.


Needless to say, all these features are creeping to the extreme, but
maybe it wouldn't be too much effort to make an indexer which could
be extended by third parties using Translation kit plug-ins in
similarly exotic directions?

Other related posts: