[dokuwiki] Re: First working indexing function

From: "Chris Smith" <chris@xxxxxxxxxxxxx>
To: <dokuwiki@xxxxxxxxxxxxx>
Date: Sun, 14 Aug 2005 23:20:39 +0100

> I tested with two pages (wiki:sysntax and wiki:parser) so
> far. The biggest problem currently is splitting the raw data
> into words. It needed about 20 seconds to index wiki:parser
> (the biggest available page). Of these 20 seconds 17 were
> used in utf8_stripspecials(). So we definitive need some
> tuning here. Has anyone an idea?

I get a 35-40% improvement by spliting the page into tokens, combining the
token list, and then only applying utf8_stripspecials to the list of tokens.
Strangely I ended up with only 880 words this way, compared with 1270 words
found using the original method, possibly that accounts for the time saving.

On my system, the original idx_getPageWords takes about 6.1 seconds to run
on wiki:parser.  After the change it took about 3.9 seconds.  The majority
of that time is taken up in spliting the page into tokens (3.1 seconds).

I hacked together a quick indexer extension of the renderer class, it runs
in 1.5 seconds and finds 1040 words.  The missing words here are most likely
from urls.

Rather than post the code around, I have added source to
http://wiki.jalakai.co.uk/wiki:searching

Cheers,

Chris










-- 
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Follow-Ups:
- [dokuwiki] Re: First working indexing function
  - From: Alex Pooley

References:
- [dokuwiki] First working indexing function
  - From: Andreas Gohr

[dokuwiki] Re: First working indexing function

Other related posts: