[dokuwiki] Re: utf-8 flag for the lexer (formerly acronym bug)

From: Harry Fuecks <hfuecks@xxxxxxxxx>
To: dokuwiki@xxxxxxxxxxxxx
Date: Wed, 21 Sep 2005 22:09:28 +0200

> The lexer is set to handle other inline options flags, just not (?u).  It
> shouldn't be difficult or problematic to extend it.

I guess I need to actually look at the code again ;-)

> The lexer needs to take look aheads and look behinds out of the pattern to
> escape special characters in the pattern itself.  The placeholder it uses
> consists of "<<<<" and ">>>>" meaning if the last character in the look
> ahead is ">" the replacement will be incorrect.  The problem should only
> affect ">" and only if it is the last character of the look ahead/behind
> pattern.

> This can be overcome by using a rarely used non-printable character from
> \x1 - \x19. I tried \x1 and it seems to work ok (\x0 didn't).  Can any one
> think of any potential problems in using one of these?

Think that would be a good solution.

> > Otherwise some general ideas for things I think that should improve re
> > parser some time;
> >
> > - Modes should be better organised - perhaps a class hierarchy that
> > distinguishes block from inline level formatting. Made some comments
> > related to that at the bottom of this page -
> > http://wiki.ioslo.net/dokuwiki/csv - which illustrates the problem.
> > Better organisation would probably help plugin development. That said,
> > is this now solved by getType() ? Not up to date there
>
> the whole PType thing needs more work.  I had already commented on that page
> ;)

You're way ahead ;-)

>
> > - The parser needs to report states it failed to exit e.g an opening
> > <del> tag with no closing tag.
> >
>
> I agree with this.  In particular it should exit all states in order at the
> end of the document.

That might not be too hard to add - it would mean something like
calling Doku_LexerStateStack::leave(), from somewhere inside
Doku_Lexer, after the end of the raw text is reached, until leave()
returns false. The code there get's a little tricky though.

It may also have a surprising effect on some documents or lead to
XHTML issues - it should be the case, but I could be wrong, that it's
not possible for someone to produce bad XHTML by using bad wiki
syntax. At the same time, think it will frustrate users if suddenly
they get a ton of error messages about badly formed wiki syntax -
think a policy of "write garbage, get garbage" is friendlier than
"write garbage, get nothing but an error message".

One other point while I think of it - the parser probably needs
reviewing with PHP 4.4.0 and PHP 5.0.5, both of which have changes to
the way notices are handled (more E_NOTICES in 4.4.0 and fatal errors
in 5.0.5) - probably easiest to check with the test suite (will need
the latest version of Simple Test though). It may be worth comparing
again with the original code the lexer came from (also part of Simple
Test) - 
http://cvs.sourceforge.net/viewcvs.py/simpletest/simpletest/parser.php?rev=1.70&view=log
--
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

References:
- [dokuwiki] Re: utf-8 flag for the lexer (formerly acronym bug)
  - From: Harry Fuecks
- [dokuwiki] Re: utf-8 flag for the lexer (formerly acronym bug)
  - From: Chris Smith

[dokuwiki] Re: utf-8 flag for the lexer (formerly acronym bug)

Other related posts: