[dokuwiki] Re: search improvements

Andreas Gohr wrote:
On Sat, 26 Aug 2006 22:48:01 +0100
Chris Smith <chris@xxxxxxxxxxxxx> wrote:


Also this morning when I attacked the problem again, the upshot was:
- the main problem was aligning the start and end of the context
snippet with a utf-8 character boundary.

Okay. I'm not sure if I understand why this is a problem which can't be
simply solved by stripping broken multibyte sequences from start and
beginning of the context. But your clever reindex function already
solves it anyway.
The strip function isn't as fast as the reindex and it looks at all the characters in the string, when only the first and last can be a problem.
- a secondary problem is in multibyte utf-8 text the number of characters returned in the snippet will be less than 100 - perhaps as low as 33 or even 25 in some alphabets/writing systems.

I don't really think this is a problem. We could increase the context
size a little bit, eg. to 70 bytes and we should get enough context for
all languages.

I have solved that problem in one of the later patches, ie the context is counted in utf8 characters rather than bytes. Its not included in Guy's analysis, but my testing indicated its not significantly different from the other new algorithms. ie. using any one of the three, ft_snippet is no longer a significant contributor to the page execution time.


Sorry, about the large number of patches. I never really intended to spend much time on this. However, shortly after each time I put it down, I'd get another idea. :-)

Cheers,

Chris
--
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Other related posts: