[haiku-gsoc] Re: Full Text Search and Indexing -- Looking for opinions/comments

  • From: Ryan Leavengood <leavengood@xxxxxxxxx>
  • To: haiku-gsoc@xxxxxxxxxxxxx
  • Date: Mon, 25 May 2009 02:17:04 -0400

On Sat, May 23, 2009 at 9:33 AM, Ankur Sethi <get.me.ankur@xxxxxxxxx> wrote:
>
> I have been looking around for already available information retrieval
> libraries. The two major projects I found are CLucene and Xapian.
> CLucene is the more popular of the two, but I think building it will
> require GCC4. Xapian is under the GPL, so I don't know if that will be
> acceptable. The last option is, of course, writing one from scratch,
> which may not be a good idea given the project timeline.

In general I do not recommend rewriting something if there are already
good options. As Axel and Ingo have said, needing GCC4 should not be
an issue, so CLucene is probably a good choice. I know the Java
version and the clone in Ruby (Ferret) are pretty popular.

> 2. The Indexing Daemon: Will keep the database in sync as files change
> on disk. It's starting to dawn on me that this might be an area that
> would require a lot of thought.

I have probably said this before, but I'll say it again: PLEASE be
sure that this daemon (or server in the BeOS/Haiku vocabulary) does
not constantly grind the harddrive or otherwise impede the user
experience. It probably goes without saying, but my experience with
indexing tools in Windows and Linux has not been good. I hate hearing
my harddrive going crazy and having the system feel sluggish from all
the IO. Mac OS X Spotlight seems to do it right (probably no surprise
since the creator of the BFS, Dominic Giampaolo, worked on Spotlight,
http://daringfireball.net/2004/07/spotlight_on_spotlight.)

Haiku may have some advantage in this area because of what BFS
provides as well as the kernel monitoring services.

The trick will be finding the right mix between keeping things updated
all the time (which will slow down the overall throughput and speed of
the system), and periodically doing updates. It would probably be a
good idea to make this tunable in some way, so you (and others) can
literally test and find the sweet spot. It might also be smart to
eventually tie into the future power management framework and perform
batch updates when the user is definitely idle (and it must be able to
stop immediately when the user returns.) Though having a very low
priority for your updating thread(s) may be fine too.

Another thing that would be cool is to have the database already set
up for a default Haiku install. This means that default applications,
documentation, etc will be immediately indexed and searchable out of
the box. Combining this with the right setting in the tuning described
above and the user may NEVER have to endure a long period of harddrive
scanning.

> The indexing daemon will have a set of plugins that will convert data
> from different file formats (PDF, ODF, DOC etc.) to a format
> compatible with the indexing library.

As I'm sure you are aware, this is just like Spotlight (and of course
is a fine idea.) In the article I linked above it mentioned that
Spotlight (and apparently Google too) only care about the first 100k
bytes of each file. So that would probably be a good limit to impose
on our system. No reason to completely process multi-megabyte PDFs,
etc.

> 3. A Set of Commandine Tools: Refreshing the database, querying the
> database from the commandline, forcing updates etc.
>
> 4. The GUI Front End: Full text indexing functionality will be
> integrated into the already existing search UI. I was also thinking of
> a very minimal UI that could be quickly brought up/dismissed with a
> simple keystroke.

Sounds good.

Regards,
Ryan

Other related posts: