[dokuwiki] Re: Search Index

  • From: "Chris Smith" <chris@xxxxxxxxxxxxx>
  • To: <dokuwiki@xxxxxxxxxxxxx>
  • Date: Thu, 11 Aug 2005 08:39:39 +0100

Hi,

First let me say, I don't know a whole lot about search engine
implementation, so feel free to take this with a pinch of salt or to fall
about laughing.

A couple of left field ideas.

Why not make use of the parser to formulate your search index updating.
Doing a straight update based on the text files isn't guaranteed to provide
you with an accurate set of terms. Using the parser would also make it easy
to assign different weightings to individual terms based on their context
(e.g. within a header, within a level 1 block).  There wouldn't be a lot to
write as most modes could reuse the same method.

After each page modification, a search update would then consist of running
the old revision and the new revision through the renderer with a mode* of
"search" (rather than xhtml). Individual mode output would be term + weight.
Overall output would be an aggregation of those terms + weights. A single
page update would be new revision (terms & weights) less old revision (terms
& weights) applied to the index - which for minor updates would be virtually
no work at all.  If somehow spawned by the page update rather than running
with it, that should fit pretty easily within 8MB and 30 seconds.

For those so inclined, a keyword syntax plugin could be made to provide
heavily weighted words that didn't display in xthml output.

On search indexes. One page index file + one file per file system equivalent
first character in a term doesn't sound too bad given the numbers you are
talking about. For a search on two or three terms thats not many files to
process.  Each file consists of lines, term - idx to page,weight; ... (text
or binary). Simple and possibly slow, but it would get an interface in
place.

Throw in some localisation scripts to eliminate common words from your index
and to match words to their derivatives, equate synonyms, etc.




Chris

*I wish we could change renderer's "mode" to "format" and avoid confusion
with syntax modes.

-- 
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Other related posts: