[dokuwiki] Re: search improvements

Andreas Gohr wrote:


I really like your patch. Cool idea, but I'm not sure if it is needed. If I understand you correctly you're concerned about the OFFSET_CAPTURE option returning byte indexes only, right? But since the matching is done using the /u modifier it should already return a byteindex pointing to a UTF-8 character boundary!? So the only problem with messing up boundaries should happen when selecting the 50 bytes surrounding context of a found snippet, shouldn't it?

Andi


Maybe ... though I have forgotten my original reasoning - I guess it may have been flawed, here goes...


The context selection amounts - are only 50 bytes if I use substr(). If use utf8_substr() they would be utf-8 characters. But then when I come to plug the offset back into preg_match, I don't know the byte amount. That would mean using two utf8_substr(), one for the "pre" snippet and one for the "post" snippet, so that I could then run strlen on the match + post snippet to ascertain the new amount for offset.

Looking at the utf8_functions, these use preg_match to do a substring, and I can feel any efficiency gains slipping away.

Also this morning when I attacked the problem again, the upshot was:
- the main problem was aligning the start and end of the context snippet with a utf-8 character boundary.
- a secondary problem is in multibyte utf-8 text the number of characters returned in the snippet will be less than 100 - perhaps as low as 33 or even 25 in some alphabets/writing systems.


Solution to the first problem is utf8_correctIdx ( which perhaps should be called utf8_bytealignCharacter ). That is a very quick and simple solution, especially when compared to using utf8_substr.

The second problem is more difficult. Perhaps it can be solved with a config setting to set the context size in bytes - though that seems hackish.

I'll redo another algorithm using the utf8_ functions and do some profiling to see how it compares with the other two.

I am open to any other ideas.

Cheers,

Chris



--
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Other related posts: