[dokuwiki] Re: Search Index

  • From: Harry Fuecks <hfuecks@xxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Mon, 8 Aug 2005 16:11:37 +0200

Another project to look at might be PLucene, which is a Perl port of
Lucene (without any Java dependencies I believe)

http://search.cpan.org/dist/Plucene/
http://www.perl.com/pub/a/2004/02/19/plucene.html

Also recommend reading this by Tim Bray:
http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC

Regarding the actual building of search indexes, personally think it
has to be done in the form of a "batch" (e.g. using a cron job every X
minutes). For those without access to cron, there's pseudocron:
http://www.bitfolge.de/pseudocron-en.html - basically you use
something like an image in the page to run a PHP script "in the
background" - users are not aware that the server is doing some
serious number crunching - they experience no delay.

The other approach is something like having updates to a given page
trigger an update to the search indices. This probably results in more
effecient execution - it's not one process scanning massive amounts of
data but rather an incremental processing of a small set of data (from
a single page). The problem is it can be very had to implement without
having a chance of race conditions, where two sets of updates from
different pages are competing with each other to update the indices.
That said, it may be, for a wiki where there's only a few updates
going on, this isn't a real problem. It may also be avoidable
depending on the actual design of the search indices and the data they
contain, in particular if there are relationships to maintain if an
update to page X means that related updates have to be made for pages
Y and Z, it gets hard.

A middle ground might be when a page gets updated, it places some kind
of "update message", containing instructions for how up to update the
indices, in a "queue" (which might simply be a directory ordered by
filemtime). An "offline" (or "out-of-band" like pseudocron) job
processes these changes and is responsible for updating the indices
and is the only process allowed to modify the indices, avoiding most
of the trouble with file locking. That could work out pretty efficient
although will need careful design as it's potentially easy to break a
system like this and hard to debug when it is broken.

One other implementation point there - if updates to the page are
going to be used to trigger something, would strongly recommend aiming
for a code design that's easily "pluginable" early - could be a demand
for building other types of indexes when a page gets updated (e.g. a
list of other pages it links to)

A side note - was reading this article
http://www.zend.com/pecl/tutorials/sdo.php - this isn't really ready
for use expect for those will to install "bleeding edge" PHP versions
but it sounds like it would handle management of search indexes pretty
well, helping avoid race conditions but may be I mistunderstood.
--
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Other related posts: