[haiku-gsoc] Re: [hcd09] Assorted Questions About the Indexing Daemon

On 2009-06-16 at 09:27:51 [+0200], Ankur Sethi <get.me.ankur@xxxxxxxxx> 
wrote:
> I have spent the past week playing and experimenting with the Haiku and 
> CLucene APIs. I'm starting work on the indexing daemon. I had an email 
> discussion with Rene, and he says I should discuss these few issues on 
> the ML.
> 
> 1. To what extent can timestamps on files be trusted? What happens
> when the user tinkers with the system time?

I think they cannot be trusted. So while they are probably a good mechanism 
for a first level filter, a second mechanism should kick in to prevent 
whatever it is that is bad because time stamps cannot be trusted. :-}

> 2. Writing data translators to extract text from PDF, ODF etc. seems
> like a nice idea. That way, other apps may also benefit from the code. 
> Would it be a good permanent solution or should the indexing daemon 
> implement its own plugin API?

We already have examples of text Translators. The Translator API can be 
extended via the extension message protocoll, in any way you like. So I 
think going this route should be the most useful for the long term. In the 
end, you are the one most knowledgable about whether you can indeed use 
Translators, or if you have to run your own solution.

> 3. To store the indices, the daemon will create a folder called .index
> on every volume it indexes. This way, old indices are not lost when the 
> user reinstalls Haiku and multiple Haiku installations on a single 
> computer can use the same indices. I hope this is acceptable?
> 
> 4. I feel it's best if we do not index removable media by default. In
> case the user does want to index his removable devices, the indices for 
> those go in /boot/home/config/index/. So, no polluting the USB devices 
> with junk.
> 
> 5. Rene thinks storing all indices in /boot/home/config/index/ should
> be fine, regardless of whether the volume is removable or not. Would this 
> be a better option?

I am undecided. In any case this apspect is something that is easily 
changed later on, no? It depends a bit on whether the indeces are something 
user specific. If two users of the same machine will for some reason need 
their own indices, then /boot/home/config/index sounds like a good plan.

> 6. Indexing 100KB of data from any file should be more than enough.
> 250KB tops. Thoughts?
> 
> I indexed about 650megs of Project Gutenberg texts using CLucene, 
> indexing the entire files in the first test run and indexing only the 
> first 100KB in the second test run. In both cases, the only fields I 
> added to the index for each file were the file contents and path. In the 
> first case, the index was more or less the same size as the indexed 
> content (as expected) but once I add a few extra fields to the index, the 
> index will grow much larger than the content it indexes. In the second 
> case, the index was just 85MB. The quality of search results in both 
> cases was more or less the same.
> 
> The point is that indexing entire files will needlessly fill up the HDD 
> and that indexing even 100KB of text is good enough in practice.

I think the fact that the index will be as large or larger than the content 
if whole files are indexed makes it clear that the content needs to be 
limited. Maybe one can also index the first 60 K of large files and index 
anohter 40 K from chunks of the rest of the file.

Best regards,
-Stephan

Other related posts: