On Sat, 26 Aug 2006 22:48:01 +0100 Chris Smith <chris@xxxxxxxxxxxxx> wrote: > Also this morning when I attacked the problem again, the upshot was: > - the main problem was aligning the start and end of the context > snippet with a utf-8 character boundary. Okay. I'm not sure if I understand why this is a problem which can't be simply solved by stripping broken multibyte sequences from start and beginning of the context. But your clever reindex function already solves it anyway. > - a secondary problem is in multibyte utf-8 text the number of > characters returned in the snippet will be less than 100 - perhaps as > low as 33 or even 25 in some alphabets/writing systems. I don't really think this is a problem. We could increase the context size a little bit, eg. to 70 bytes and we should get enough context for all languages. Andi