[haiku-gsoc] Re: [hcd09] Assorted Questions About the Indexing Daemon

  • From: Ankur Sethi <get.me.ankur@xxxxxxxxx>
  • To: haiku-gsoc@xxxxxxxxxxxxx
  • Date: Tue, 16 Jun 2009 23:30:57 +0530

Thanks for the inputs :)

For now, this is what I'm going with. I'm sure changes can be made later on.

1. Just using UNIX time for now (isn't that what BStatable ::
GetModificationTime() would give me?) since it does not depend on the
timezone.

2. Translation Kit it is.

3. Storing indices in /boot/common/data/index/ for now. Seems more plausible.

4/5. For now, I'm not indexing USB devices.

6. Stephan's idea is really nice. For now, I'm indexing just the first
100KB. I'll change it later to index several small chunks from larger
files.

Stephan wrote:
> Maybe one can also index the first 60 K of large files and index
> anohter 40 K from chunks of the rest of the file.

Ingo wrote:
> That's actually not at all what I expected. What data are stored in the
> indices exactly? Are the positions of the contained words stored for each
> file? Otherwise I can't really believe that beyond the 100 KB limit there
> will be a lot more different words.

Well, CLucene comes with a set of analyzers, or you can write your
own. I ran these tests with the StandardAnalyzer, which doesn't do
much (at least that's what I make of it). That might explain the index
size. Moreover, these tests indexed several thousand small text files,
which bloats up the index size. I'm sure the same test with a smaller
number of files (but the same total size of 600megs) will result in a
much smaller index.

I'll create a project on OSDrawer once I have some code written.

-- 
Ankur Sethi (GeneralMaximus)

Other related posts: