[dokuwiki] Re: search improvements
- From: Chris Smith <chris@xxxxxxxxxxxxx>
- To: dokuwiki@xxxxxxxxxxxxx
- Date: Sat, 26 Aug 2006 22:48:01 +0100
Andreas Gohr wrote:
I really like your patch. Cool idea, but I'm not sure if it is needed.
If I understand you correctly you're concerned about the OFFSET_CAPTURE
option returning byte indexes only, right? But since the matching is
done using the /u modifier it should already return a byteindex pointing
to a UTF-8 character boundary!? So the only problem with messing up
boundaries should happen when selecting the 50 bytes surrounding context
of a found snippet, shouldn't it?
Andi
Maybe ... though I have forgotten my original reasoning - I guess it may
have been flawed, here goes...
The context selection amounts - are only 50 bytes if I use substr(). If
use utf8_substr() they would be utf-8 characters. But then when I come
to plug the offset back into preg_match, I don't know the byte amount.
That would mean using two utf8_substr(), one for the "pre" snippet and
one for the "post" snippet, so that I could then run strlen on the match
+ post snippet to ascertain the new amount for offset.
Looking at the utf8_functions, these use preg_match to do a substring,
and I can feel any efficiency gains slipping away.
Also this morning when I attacked the problem again, the upshot was:
- the main problem was aligning the start and end of the context snippet
with a utf-8 character boundary.
- a secondary problem is in multibyte utf-8 text the number of
characters returned in the snippet will be less than 100 - perhaps as
low as 33 or even 25 in some alphabets/writing systems.
Solution to the first problem is utf8_correctIdx ( which perhaps should
be called utf8_bytealignCharacter ). That is a very quick and simple
solution, especially when compared to using utf8_substr.
The second problem is more difficult. Perhaps it can be solved with a
config setting to set the context size in bytes - though that seems hackish.
I'll redo another algorithm using the utf8_ functions and do some
profiling to see how it compares with the other two.
I am open to any other ideas.
Cheers,
Chris
--
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist
- Follow-Ups:
- [dokuwiki] Re: search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
- References:
- [dokuwiki] search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
- [dokuwiki] Re: search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
- [dokuwiki] Re: search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
Other related posts:
- » [dokuwiki] search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
- » [dokuwiki] Re: search improvements
I really like your patch. Cool idea, but I'm not sure if it is needed. If I understand you correctly you're concerned about the OFFSET_CAPTURE option returning byte indexes only, right? But since the matching is done using the /u modifier it should already return a byteindex pointing to a UTF-8 character boundary!? So the only problem with messing up boundaries should happen when selecting the 50 bytes surrounding context of a found snippet, shouldn't it?
Andi
- [dokuwiki] Re: search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
- [dokuwiki] search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
- [dokuwiki] Re: search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr
- [dokuwiki] Re: search improvements
- From: Chris Smith
- [dokuwiki] Re: search improvements
- From: Andreas Gohr