[haiku-gsoc] Re: [hcd09] Assorted Questions About the Indexing Daemon

  • From: Ingo Weinhold <ingo_weinhold@xxxxxx>
  • To: haiku-gsoc@xxxxxxxxxxxxx
  • Date: Tue, 16 Jun 2009 13:37:08 +0200

On 2009-06-16 at 11:03:17 [+0200], Axel Dörfler <axeld@xxxxxxxxxxxxxxxx> 
wrote:
> Ankur Sethi <get.me.ankur@xxxxxxxxx> wrote:
> > 1. To what extent can timestamps on files be trusted? What happens
> > when the user tinkers with the system time?
> 
> The only problem are time zone changes, as the time of the files is not
> absolute in BeOS or Haiku - they will change with the time zone,
> probably also depending on your local/GMT setting.

I wonder how other OSs do that. I would find it obvious for the FS to store 
a canonical time (i.e. GMT) on disk and convert from/to the current time 
zone as needed.

[...]
> > I indexed about 650megs of Project Gutenberg texts using CLucene,
> > indexing the entire files in the first test run and indexing only the
> > first 100KB in the second test run. In both cases, the only fields I
> > added to the index for each file were the file contents and path. In
> > the first case, the index was more or less the same size as the
> > indexed content (as expected) but once I add a few extra fields to
> > the
> > index, the index will grow much larger than the content it indexes.
> > In
> > the second case, the index was just 85MB. The quality of search
> > results in both cases was more or less the same.

That's actually not at all what I expected. What data are stored in the 
indices exactly? Are the positions of the contained words stored for each 
file? Otherwise I can't really believe that beyond the 100 KB limit there 
will be a lot more different words.

CU, Ingo

Other related posts: