[haiku-gsoc] Re: [hcd09] Assorted Questions About the Indexing Daemon

  • From: "Axel Dörfler" <axeld@xxxxxxxxxxxxxxxx>
  • To: haiku-gsoc@xxxxxxxxxxxxx
  • Date: Tue, 16 Jun 2009 23:25:27 +0200 CEST

Ankur Sethi <get.me.ankur@xxxxxxxxx> wrote:
> 3. Storing indices in /boot/common/data/index/ for now. Seems more 
> plausible.
> 4/5. For now, I'm not indexing USB devices.
> 6. Stephan's idea is really nice. For now, I'm indexing just the 
> first
> 100KB. I'll change it later to index several small chunks from larger
> files.

Well, if you insist on it - this can all be fixed later ;-)

> Ingo wrote:
> > That's actually not at all what I expected. What data are stored in 
> > the
> > indices exactly? Are the positions of the contained words stored 
> > for each
> > file? Otherwise I can't really believe that beyond the 100 KB limit 
> > there
> > will be a lot more different words.
> Well, CLucene comes with a set of analyzers, or you can write your
> own. I ran these tests with the StandardAnalyzer, which doesn't do
> much (at least that's what I make of it). That might explain the 
> index
> size. Moreover, these tests indexed several thousand small text 
> files,
> which bloats up the index size. I'm sure the same test with a smaller
> number of files (but the same total size of 600megs) will result in a
> much smaller index.

Just another thing one can look into once the thing is basically 
working (only to reenable full doc scanning afterwards, of course).
If the files were short anyway, I don't really understand how the index 
size could differ...

Bye,
   Axel.


Other related posts: