> > I just pushed a patch adding a function from Harry's utf8 library to > > strip bad bytes. > > > That may not entirely fix the problem. I am not certain if preg_match > will break down if not asked to start at a proper utf8 character > boundary using offset. I am working on a fix to adjust the snippet > start and end indexes to the nearest utf8 boundary before using substr > and preg_match. That should mean that although I am dealing with > byte indexes and byte lengths, those numbers will always correspond > to utf8 character boundaries. I really like your patch. Cool idea, but I'm not sure if it is needed. If I understand you correctly you're concerned about the OFFSET_CAPTURE option returning byte indexes only, right? But since the matching is done using the /u modifier it should already return a byteindex pointing to a UTF-8 character boundary!? So the only problem with messing up boundaries should happen when selecting the 50 bytes surrounding context of a found snippet, shouldn't it? Andi -- http://www.splitbrain.org -- DokuWiki mailing list - more info at http://wiki.splitbrain.org/wiki:mailinglist