[dokuwiki] UTF Normalization

  • From: "Harry Fuecks" <hfuecks@xxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Tue, 21 Mar 2006 16:01:31 +0100

Wondering if Andi / anyone has an opinion on this yet.

Some relevant links;

http://annevankesteren.nl/2005/05/unicode
http://diveintomark.org/archives/2004/07/06/nfc
http://cvs.sourceforge.net/viewcvs.py/wikipedia/phase3/includes/normal/
< Mediawiki's implementation which they're using on all input

Now I haven't fully grasped the issue suffice to say that ASCII
characters can be represented in UTF-8 in multiple ways and also UTF-8
can be used to represent non-Unicode characters.

For example a newline character could be represented with;

0x0A   << the normal way - same as ASCII
0xc0 0x8A
0xe0 0x80 0x8A
0xf0 0x80 0x80 0x8A
0xf8 0x80 0x80 0x80 0x8A
0xfc 0x80 0x80 0x80 0x80 0x8A

What I haven't figured out is how much of an issue is this for an
application like dokuwiki?

By normalizing such characters to a single representation, it could
help searching for example.

There may also be security issues here, not already handled by
dokuwiki's utf8_strip (i.e. where utf8_strip is not being used)?
Validation for example?

Another question is where would non-normal form characters come from?
Presumably browsers would stick to the normal form of something like a
newline, but perhaps this is not the case (or if you copy and paste
from another application into your browser).

Anyway - just pondering out loud right now. Any thoughts appreciated.
--
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Other related posts: