On 8/10/05, Andreas Gohr <andi@xxxxxxxxxxxxxx> wrote: > Harry Fuecks writes: > > >> This is what I would prefer. The biggest problem I see currently with > >> this aproach is to design an efficient index which is updatable on a per > >> document basis. (Suggestions welcome) > > > > I guess that depends alot on how you want to search. Implementing > > something based purely on matching word probably isn't too difficult. > > At a first guess, would think having a file for each word might be the > > way to go (ignore issues of UTF-8 + filesystem or disk space usage for > > a moment). > > This has two problems. > > It would create a huge amount of files That's where my disclaimer comes in "ignoring filesystem or disk space usage for a moment" ;) >. I think we can expect about 5000 > to 25000 unique words for an avaerage wiki (just guessing here). I wrote > a mysql based searchengine for one of my websites once (look for xinabse > on splitbrain.org) it has currently 1411 single pages indexed (english > and german) and has 88319 words - having this in single files wouldn't > be the greatest idea I think. There may be a smarter way to do that - rather than use entire words, use the first "x" letters so one file contains multiple words. Ideally this should adapt to the number of records in the file to avoid giving PHP large files to parse but that might be very tricky to implement. And there's still the issue of non-ascii characters and whether the filesystem will support them. Perhaps it would better to use the ord() function? > > The second problem is to update this kind of index, is it's only one way > efficient. Deleting or updating the index for a single document would mean > to search each of these files to remove the document's entry... However I'm > not sure how to really solve this problem. I guess at last two indexes would > be needed to have fast resolution both ways. Good point. Perhaps this can be done by perserving the last "index_update" file. For updates you then need to compare the current word list with the last "index_update" list to identify deletions. Unfortunately that's likely to make it harder to re-build the entire index in case of problems - if you lose the last index_update file, you're in trouble. > I had first look at these documents - they provide some nice background info > but unfortunately aren't very technical... Perhaps plucene is worth more investigating. May be some other Perl modules to get ideas from. A quick search on CPAN for "index" turned up http://cpan.uwinnipeg.ca/htdocs/Search-Indexer/README.html; "The indexer uses three files in BerkeleyDB format : a) a mapping from words to wordIds; b) a mapping from wordIds to lists of documents ; c) a mapping from pairs (docId, wordId) to lists of positions within the document. This third file holds detailed information and therefore is quite big ; but it allows us to quickly retrieve "exact phrases" (sequences of adjacent words) in the document." Will have a look around -- DokuWiki mailing list - more info at http://wiki.splitbrain.org/wiki:mailinglist