Efficient string parsing

  • From: Wolfgang Pupp <wolfgang.pupp@xxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Wed, 24 Oct 2012 00:15:52 +0200

Hello list,

I'm about to implement a simple parser in LuaJIT, and now considering
on the best approach.
The target language is somewhat similar to Lua, but more of a
data-description-thingy (no loops, no conditionals).
The input chunks will probably always remain <10k lines (much, much
lower most of the time)- but there's gonna be a lot of them.

Design criteria are:
 - No dependencies on dynamically linked libraries/C- code (ffi-usage
should be optional, at most)
 - Not too complicated
 - As fast as possible

The main criteria is troublesome (I'd immediately go for LPeg if I could :/ ).
I've recently tested a markdown-parser implementation in LuaJIT, and
that made me worried about performance (took about 900ms compared to
<10ms of the C-binding). That code made extensive use of string.*
functions and string concatenation, which I'm now trying to avoid like
the plague.

My current plan is to traverse the input byte-wise, keep some
context-information (surrounding expression, its start-index, etc.)
and to detect keywords/operators/... via string.byte- comparison
instead of string.* functions.
Am I on the right track? Might I as well take substrings and compare
those directly (without fear of triggering allocations all the time)?

I'd be very thankful for any input- I am not too familiar with parser
design in general, and all the information I found only suggested to
not use string functions/concatenation since it prevents JIT
compilation, but no alternatives...

--Wolfgang

PS: pointers to relevant/helpful literature on the topic are always appreciated!

Other related posts: