[dokuwiki] Re: search improvements

  • From: Chris Smith <chris@xxxxxxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Sat, 26 Aug 2006 10:15:05 +0100

Andreas Gohr wrote:

On Sat, 26 Aug 2006 09:21:12 +0100
Chris Smith <chris@xxxxxxxxxxxxx> wrote:

Andreas Gohr wrote:
I noticed the use of some strlen calls there. Are they
used in a UTF-8 safe way there or would it be possible that they
split a multibyte char? If that could happen we should add a check
to strip invalid UTF-8 chars from beginning and end of the snippet -
this would be a nice addition to the utf-8 lib.
Yes, I think that is the best solution, adjusting the strings to
ensure they always start/end at utf-8 character boundaries. I'll see
what I can come up with.

I just pushed a patch adding a function from Harry's utf8 library to strip bad bytes.

Andi

That may not entirely fix the problem. I am not certain if preg_match will break down if not asked to start at a proper utf8 character boundary using offset. I am working on a fix to adjust the snippet start and end indexes to the nearest utf8 boundary before using substr and preg_match. That should mean that although I am dealing with byte indexes and byte lengths, those numbers will always correspond to utf8 character boundaries.

I'll send it through once I have finished checking it.

Cheers,

Chris
--
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Other related posts: