[dokuwiki] Re: syntax plugin bug? (unicode range regexp)

  • From: Christopher Smith <chris@xxxxxxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Wed, 21 Oct 2009 15:59:35 +0100


On 21 Oct 2009, at 13:02, Michiel Kamermans wrote:

Hi

I'm trying to write a syntax plugin to replace ([\x{4E00}-\x{9FFFF} \x{3005}\x{30F6}]+)\(([\x{3040}-\x{30FF}]+)\) with something sensible, but it would appears that this regexp makes the syntax plugin system screw up big time. Even something as simple as:

function connectTo($mode) { $this->Lexer- >addSpecialPattern("[\x{4E00}-\x{9FFFF}]+", $mode, 'plugin_myplugin'); }

seems to completely kill off any and all syntax parsing, instead showing me the plain unparsed document text. Is it possible that the lexer doesn't take unicodeness into account? Could it be that the pattern matching is missing the 'u' pattern modifier? (which makes the pattern matching use PCRE8 parsing, which treats all strings as utf8 strings, rather than a series of bytes).


Yes and Yes.

Its a long time since I looked at this. But once, a while back, I did try to add the u flag to the lexer. As I recall it broke. I guess things could have changed.

You can experiment with it yourself - inc/parser/lexer.php, line #213. Add a 'u' into both strings.

-        return ($this->_case ? "msS" : "msSi");
+        return ($this->_case ? "msSu" : "msSiu");

- Chris
--
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Other related posts: