[dokuwiki] Re: search improvements
- From: Chris Smith <chris@xxxxxxxxxxxxx>
- To: dokuwiki@xxxxxxxxxxxxx
- Date: Sun, 27 Aug 2006 23:01:03 +0100
Andreas Gohr wrote:
On Sat, 26 Aug 2006 22:48:01 +0100
Chris Smith <chris@xxxxxxxxxxxxx> wrote:
Also this morning when I attacked the problem again, the upshot was:
- the main problem was aligning the start and end of the context
snippet with a utf-8 character boundary.
Okay. I'm not sure if I understand why this is a problem which can't be
simply solved by stripping broken multibyte sequences from start and
beginning of the context. But your clever reindex function already
solves it anyway.
The strip function isn't as fast as the reindex and it looks at all the
characters in the string, when only the first and last can be a problem.
- a secondary problem is in multibyte utf-8 text the number of
characters returned in the snippet will be less than 100 - perhaps as
low as 33 or even 25 in some alphabets/writing systems.
I don't really think this is a problem. We could increase the context
size a little bit, eg. to 70 bytes and we should get enough context for
all languages.
I have solved that problem in one of the later patches, ie the context
is counted in utf8 characters rather than bytes. Its not included in
Guy's analysis, but my testing indicated its not significantly different
from the other new algorithms. ie. using any one of the three,
ft_snippet is no longer a significant contributor to the page execution
time.
Sorry, about the large number of patches. I never really intended to
spend much time on this. However, shortly after each time I put it
down, I'd get another idea. :-)
Cheers,
Chris
--
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist
- References:
- [dokuwiki] search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
- [dokuwiki] Re: search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
- [dokuwiki] Re: search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
- [dokuwiki] Re: search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
Other related posts:
- » [dokuwiki] search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
On Sat, 26 Aug 2006 22:48:01 +0100 Chris Smith <chris@xxxxxxxxxxxxx> wrote:
Also this morning when I attacked the problem again, the upshot was:
- the main problem was aligning the start and end of the context
snippet with a utf-8 character boundary.
Okay. I'm not sure if I understand why this is a problem which can't be
simply solved by stripping broken multibyte sequences from start and
beginning of the context. But your clever reindex function already
solves it anyway.
- a secondary problem is in multibyte utf-8 text the number of characters returned in the snippet will be less than 100 - perhaps as low as 33 or even 25 in some alphabets/writing systems.
I don't really think this is a problem. We could increase the context
size a little bit, eg. to 70 bytes and we should get enough context for
all languages.
- [dokuwiki] search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
- [dokuwiki] Re: search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
- [dokuwiki] Re: search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
- [dokuwiki] Re: search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr