Harry Fuecks writes:
This is what I would prefer. The biggest problem I see currently with this aproach is to design an efficient index which is updatable on a per document basis. (Suggestions welcome)
I guess that depends alot on how you want to search. Implementing something based purely on matching word probably isn't too difficult. At a first guess, would think having a file for each word might be the way to go (ignore issues of UTF-8 + filesystem or disk space usage for a moment).
This has two problems.
It would create a huge amount of files. I think we can expect about 5000 to 25000 unique words for an avaerage wiki (just guessing here). I wrote a mysql based searchengine for one of my websites once (look for xinabse on splitbrain.org) it has currently 1411 single pages indexed (english and german) and has 88319 words - having this in single files wouldn't be the greatest idea I think.
Problem with this index design though is it wouldn't help you find phrases in a document usefully. Perhaps it would be worth adding the "word count" of the word in the document, so if somone searches for "dokuwiki performance" you can assign a higher rank if you find those two words right next to each other in a document. At the same time, can imagine that could make searches an order of magnitude slower.
Anyway - should point out I'm guessing here. Don't have alot of experience with building search engines. Those Tim Bray articles are worth reading though - there's some good thoughts on what type of information you want to have in an index file.