[dokuwiki] Re: Search Index

  • From: Andreas Gohr <andi@xxxxxxxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Wed, 10 Aug 2005 12:43:55 +0200

Harry Fuecks writes:

This is what I would prefer. The biggest problem I see currently with
this aproach is to design an efficient index which is updatable on a per
document basis. (Suggestions welcome)

I guess that depends alot on how you want to search. Implementing something based purely on matching word probably isn't too difficult. At a first guess, would think having a file for each word might be the way to go (ignore issues of UTF-8 + filesystem or disk space usage for a moment).

This has two problems.

It would create a huge amount of files. I think we can expect about 5000
to 25000 unique words for an avaerage wiki (just guessing here). I wrote
a mysql based searchengine for one of my websites once (look for xinabse
on splitbrain.org) it has currently 1411 single pages indexed (english
and german) and has 88319 words - having this in single files wouldn't
be the greatest idea I think.

The second problem is to update this kind of index, is it's only one way efficient. Deleting or updating the index for a single document would mean to search each of these files to remove the document's entry... However I'm not sure how to really solve this problem. I guess at last two indexes would be needed to have fast resolution both ways.

Problem with this index design though is it wouldn't help you find
phrases in a document usefully. Perhaps it would be worth adding the
"word count" of the word in the document, so if somone searches for
"dokuwiki performance" you can assign a higher rank if you find those
two words right next to each other in a document. At the same time,
can imagine that could make searches an order of magnitude slower.

Phrase search could be done the old way. First use the fast index to get all documents containing all the wanted words, then grep these results for the exact phrase. I think this should still be pretty fast.


Anyway - should point out I'm guessing here. Don't have alot of
experience with building search engines. Those Tim Bray articles are
worth reading though - there's some good thoughts on what type of
information you want to have in an index file.

I had first look at these documents - they provide some nice background info but unfortunately aren't very technical...


Andi

Other related posts: