[haiku-gsoc] [hcd09] Assorted Questions About the Indexing Daemon

  • From: Ankur Sethi <get.me.ankur@xxxxxxxxx>
  • To: haiku-gsoc@xxxxxxxxxxxxx
  • Date: Tue, 16 Jun 2009 12:57:51 +0530

Hi,

I have spent the past week playing and experimenting with the Haiku
and CLucene APIs. I'm starting work on the indexing daemon. I had an
email discussion with Rene, and he says I should discuss these few
issues on the ML.

1. To what extent can timestamps on files be trusted? What happens
when the user tinkers with the system time?

2. Writing data translators to extract text from PDF, ODF etc. seems
like a nice idea. That way, other apps may also benefit from the code.
Would it be a good permanent solution or should the indexing daemon
implement its own plugin API?

3. To store the indices, the daemon will create a folder called .index
on every volume it indexes. This way, old indices are not lost when
the user reinstalls Haiku and multiple Haiku installations on a single
computer can use the same indices. I hope this is acceptable?

4. I feel it's best if we do not index removable media by default. In
case the user does want to index his removable devices, the indices
for those go in /boot/home/config/index/. So, no polluting the USB
devices with junk.

5. Rene thinks storing all indices in /boot/home/config/index/ should
be fine, regardless of whether the volume is removable or not. Would
this be a better option?

6. Indexing 100KB of data from any file should be more than enough.
250KB tops. Thoughts?

I indexed about 650megs of Project Gutenberg texts using CLucene,
indexing the entire files in the first test run and indexing only the
first 100KB in the second test run. In both cases, the only fields I
added to the index for each file were the file contents and path. In
the first case, the index was more or less the same size as the
indexed content (as expected) but once I add a few extra fields to the
index, the index will grow much larger than the content it indexes. In
the second case, the index was just 85MB. The quality of search
results in both cases was more or less the same.

The point is that indexing entire files will needlessly fill up the
HDD and that indexing even 100KB of text is good enough in practice.

(BTW, the project is called Beacon :) )

-- 
Ankur Sethi (GeneralMaximus)

Other related posts: