Andreas Gohr wrote:
That may not entirely fix the problem. I am not certain if preg_match will break down if not asked to start at a proper utf8 character boundary using offset. I am working on a fix to adjust the snippet start and end indexes to the nearest utf8 boundary before using substr and preg_match. That should mean that although I am dealing with byte indexes and byte lengths, those numbers will always correspond to utf8 character boundaries.On Sat, 26 Aug 2006 09:21:12 +0100 Chris Smith <chris@xxxxxxxxxxxxx> wrote:
Andreas Gohr wrote:
I noticed the use of some strlen calls there. Are theyYes, I think that is the best solution, adjusting the strings to
used in a UTF-8 safe way there or would it be possible that they
split a multibyte char? If that could happen we should add a check
to strip invalid UTF-8 chars from beginning and end of the snippet -
this would be a nice addition to the utf-8 lib.
ensure they always start/end at utf-8 character boundaries. I'll see
what I can come up with.
I just pushed a patch adding a function from Harry's utf8 library to strip bad bytes.
Andi
I'll send it through once I have finished checking it.
Cheers,
Chris -- DokuWiki mailing list - more info at http://wiki.splitbrain.org/wiki:mailinglist