[dokuwiki] Re: search improvements

  • From: Andreas Gohr <andi@xxxxxxxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Sat, 26 Aug 2006 14:19:34 +0200

> > I just pushed a patch adding a function from Harry's utf8 library to
> > strip bad bytes.
> >   
> That may not entirely fix the problem.  I am not certain if preg_match
> will break down if not asked to start at a proper utf8 character 
> boundary using offset.  I am working on a fix to adjust the snippet 
> start and end indexes to the nearest utf8 boundary before using substr
> and  preg_match.  That should mean that although I am dealing with
> byte  indexes and byte lengths, those numbers  will always correspond
> to utf8  character boundaries.

I really like your patch. Cool idea, but I'm not sure if it is needed.
If I understand you correctly you're concerned about the OFFSET_CAPTURE
option returning byte indexes only, right? But since the matching is
done using the /u modifier it should already return a byteindex pointing
to a UTF-8 character boundary!? So the only problem with messing up
boundaries should happen when selecting the 50 bytes surrounding context
of a found snippet, shouldn't it?

Andi

-- 
http://www.splitbrain.org
-- 
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Other related posts: