Re: LuaJIT in realtime applications

Luke Gorrie wrote:
> I'm interested in using LuaJIT in a realtime application. In particular I'd
> like to use LuaJIT code instead of iptables-style patterns in a network
> forwarding engine, and I'd like to be able to establish some upper-bounds
> on processing time for my own peace of mind. For example, to be able to be
> confident that rules of a certain basic complexity level would never take
> more than (say) 50us to execute.
> 
> Is this a realistic notion?

With soft-realtime you need to specify more parameters: the worst
case latency under 'usual' operating conditions, the tolerable
worst case latency under averse conditions and the acceptable
probability of that happening. And of course the often forgotten
maximum acceptable bandwidth consumed by the memory allocator and
the garbage collector.

But the real question is: do you want to find these parameters for
a specific implementation (with LuaJIT) or do you have strict
bounds for these numbers and want to shape the implementation to
match them?

> I'm guessing that GC is the main issue to be concerned about.

As Thomas already said: avoiding allocations is the simplest
recipe.

But the incremental GC in LuaJIT 2.0 (same as Lua 5.1) is not that
bad. It does have some atomic pauses that may be of concern:

- Stacks are traversed atomically -- don't create huge stacks
  (deep recursion).

- Each table is traversed atomically -- don't create huge tables
  (millions of elements). Or consider using FFI structures.

- Tables that hit a write barrier will be remarked atomically --
  this is usually not an issue, unless they are huge (see above).

- The list of userdata objects is traversed atomically -- don't
  create too many of them. Or consider using FFI cdata.

- Userdata and FFI cdata finalizers may be invoked on any GC
  checkpoint -- don't create long-running finalizer functions.

IMHO it's pretty easy to avoid these issues in your code. [The
planned new GC for LuaJIT 2.1 will eliminate most of these pauses
or try to reduce their impact.]

You can reduce the length of each incremental GC step with the
"setstepmul" parameter. But note that your throughput will suffer
if the value is too low. You really need to measure the GC step
duration within your application, since it depends a lot on the
mix of objects, cache behavior etc.

The builtin allocator is a variant of dlmalloc. I'm sure someone
else has already figured out the worst case pauses this might
incur.

The JIT compiler is mostly incremental, too. The recording phase
(which invokes most optimizations on-the-fly) is fully incremental.
There are some non-incremental phases, like the LOOP, SPLIT and
SINK optimization passes (these are pretty fast) and the backend
assembler. They are linearly bounded by the maximum trace size
(-Omaxtrace=x). We're talking about a couple of microseconds, so
this shouldn't be too much of a concern.

I'm leaving out the discussion of OS/cache latencies here, since
you need to take care of these, anyway.

--Mike

Other related posts: