[dokuwiki] Re: search improvements

On 31 août at 01:58, Chris Smith wrote:

> Hi,

  Hi Chris,


> My preference would be to run with the utf8 algorithm.

  +1

> For my test wiki two similar search terms producing similar results, but 
> selected from opposite ends of an 11,000+ word index resulted in a 
> doubling of the search time.  Guy found similar results for single 
> search terms at opposite ends of his ~10,000 word index.  It would seem 
> the bigger the wiki, the more words likely to get in the index, the 
> slower, on average, searching is likely to be.  Ideas on improving this 
> are welcome :-)

  Use a db to store these indexes? :-)

  AFAIU, a search on a word/pagename is a list containing the line
  number the word appears in word.idx, than the numbers are lines read
  from index.idx (line numbers in word.idx are the same as in
  index.idx) on which are listed one or several pageid:hits-in-page.
  When the content of a wiki grows, both index.idx and word.idx
  inflate. I have a 100 MB weighted wiki with 6k files, and had to
  build the index using an external shell script (php binary not
  available on the target host) but it was useless, as php needed more
  than 80 MB memory to process the two 19 MB indexes files.

  How about merging the index.idx and word.idx files to:

    index.idx
      word1 pageid1:hits pageid2:hits pageid5:hits pageid6:hits
      word2 pageid2:hits pageid3:hits pageid4:hits
      word3 pageid3:hits

  that is the word as first column instead of nothing (which in fact
  is "line number") and thus avoid loading a word.idx file that grows
  O(n) at least?

  Maybe the page.idx file should store a unique number (id) for each
  page, instead of using the line number in this file as the pageid
  (pid). This could let us call a page by its id (pid) and shorten
  very long URLs.

-- 
  bug

--
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Other related posts: