[dokuwiki] Re: Search Index

  • From: Harry Fuecks <hfuecks@xxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Wed, 10 Aug 2005 13:11:55 +0200

On 8/10/05, Andreas Gohr <andi@xxxxxxxxxxxxxx> wrote:
> Harry Fuecks writes:
> 
> >> This is what I would prefer. The biggest problem I see currently with
> >> this aproach is to design an efficient index which is updatable on a per
> >> document basis. (Suggestions welcome)
> >
> > I guess that depends alot on how you want to search. Implementing
> > something based purely on matching word probably isn't too difficult.
> > At a first guess, would think having a file for each word might be the
> > way to go (ignore issues of UTF-8 + filesystem or disk space usage for
> > a moment).
> 
> This has two problems.
> 
> It would create a huge amount of files

That's where my disclaimer comes in "ignoring filesystem or disk space usage for
 a moment" ;)

>. I think we can expect about 5000
> to 25000 unique words for an avaerage wiki (just guessing here). I wrote
> a mysql based searchengine for one of my websites once (look for xinabse
> on splitbrain.org) it has currently 1411 single pages indexed (english
> and german) and has 88319 words - having this in single files wouldn't
> be the greatest idea I think.

There may be a smarter way to do that - rather than use entire words, use
the first "x" letters so one file contains multiple words. Ideally
this should adapt
to the number of records in the file to avoid giving PHP large files
to parse but that might be very tricky to implement. And there's still
the issue of non-ascii characters and whether the filesystem will
support them. Perhaps it would better to use the ord() function?

> 
> The second problem is to update this kind of index, is it's only one way
> efficient. Deleting or updating the index for a single document would mean
> to search each of these files to remove the document's entry... However I'm
> not sure how to really solve this problem. I guess at last two indexes would
> be needed to have fast resolution both ways.

Good point. Perhaps this can be done by perserving the last
"index_update" file. For updates you then need to compare the current
word list with the last "index_update" list to identify deletions.
Unfortunately that's likely to make it harder to re-build the entire
index in case of problems - if you lose the last index_update file,
you're in trouble.

> I had first look at these documents - they provide some nice background info
> but unfortunately aren't very technical...

Perhaps plucene is worth more investigating. May be some other Perl
modules to get ideas from. A quick search on CPAN for "index" turned
up http://cpan.uwinnipeg.ca/htdocs/Search-Indexer/README.html;

"The indexer uses three files in BerkeleyDB format : a) a mapping from
words to wordIds; b) a mapping from wordIds to lists of documents ; c)
a mapping from pairs (docId, wordId) to lists of positions within the
document. This third file holds detailed information and therefore is
quite big ; but it allows us to quickly retrieve "exact phrases"
(sequences of adjacent words) in the document."

Will have a look around
--
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Other related posts: