[haiku-gsoc] Re: Full Text Search and Indexing -- Looking for opinions/comments

  • From: Johannes Wischert <johanneswi@xxxxxxxxx>
  • To: haiku-gsoc@xxxxxxxxxxxxx
  • Date: Mon, 25 May 2009 19:27:20 +0200

On Mon, May 25, 2009 at 8:17 AM, Ryan Leavengood <leavengood@xxxxxxxxx> wrote:
> On Sat, May 23, 2009 at 9:33 AM, Ankur Sethi <get.me.ankur@xxxxxxxxx> wrote:
>>
>
>> 2. The Indexing Daemon: Will keep the database in sync as files change
>> on disk. It's starting to dawn on me that this might be an area that
>> would require a lot of thought.
>

> The trick will be finding the right mix between keeping things updated
> all the time (which will slow down the overall throughput and speed of
> the system), and periodically doing updates. It would probably be a
> good idea to make this tunable in some way, so you (and others) can
> literally test and find the sweet spot. It might also be smart to
> eventually tie into the future power management framework and perform
> batch updates when the user is definitely idle (and it must be able to
> stop immediately when the user returns.) Though having a very low
> priority for your updating thread(s) may be fine too.
>

how about queing some of the work while the machine is under heavy
load and index it only when the user realy wants to search for
something (or the load goes down off course)

> Another thing that would be cool is to have the database already set
> up for a default Haiku install. This means that default applications,
> documentation, etc will be immediately indexed and searchable out of
> the box. Combining this with the right setting in the tuning described
> above and the user may NEVER have to endure a long period of harddrive
> scanning.
>

An attribute that allows the indexer to ignore the file would be good, too.

>> The indexing daemon will have a set of plugins that will convert data
>> from different file formats (PDF, ODF, DOC etc.) to a format
>> compatible with the indexing library.
>
> As I'm sure you are aware, this is just like Spotlight (and of course
> is a fine idea.) In the article I linked above it mentioned that
> Spotlight (and apparently Google too) only care about the first 100k
> bytes of each file. So that would probably be a good limit to impose
> on our system. No reason to completely process multi-megabyte PDFs,
> etc.
>

I don't think thats a good idea since many PDFs are multi megabyte
PDFs today (because of embedded graphics and fonts and stuff) and I
still want to be able to find a word if it is in the last chapter of
such a pdf

Other related posts: