Tuning numerical computations for LuaJIT (was Re: [ANN] Sci-1.0-beta1)

  • From: Mike Pall <mike-1209@xxxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Thu, 20 Sep 2012 14:28:42 +0200

Markus Walther wrote:
> On Thu, Sep 20, 2012 at 1:35 AM, Geoff Leyland
> <geoff_leyland@xxxxxxxxxxx> wrote:
> > https://github.com/vection/LuaFFT would appear to be just that.
> 
> That port is pure Lua... It's likely that FFI would help quite a bit
> to make it faster.

Probably. But the code is inefficient in many ways.

> Still, according to FFTW benchmarks, Ooura's FFT is quite a bit faster
> than kissfft, so might be a better base for a LuaJIT port:

Maybe or maybe not. It looks rather branchy.

The following things are needed to get the best speed out for
numerical computations with LuaJIT (in order of importance):

* Reduce number of unbiased/unpredictable branches.
  - Heavily biased branches (>95% in one direction) are fine.
  - Prefer branch-free algorithms.
    - Use math.min() and math.max().
    - Use bit.*, e.g. for conditional index computations.

* Use FFI data structures.
  - Use int32_t, avoid uint32_t data types.
  - Use double, avoid float data types.
  - Metamethods are fine, but don't overdose them.

* Call C functions only via the FFI.
  - Avoid calling trivial functions, better rewrite them in Lua.
  - Avoid callbacks -- use pull-style APIs (read/write) and
    iterators instead.

* Use plain 'for i=start,stop,step do ... end' loops.
  - Prefer plain array indexing, e.g. 'a[i+2]'.
  - Avoid pointer arithmetic.

* Find the right balance for unrolling.
  - Avoid inner loops with low iteration count (< 10).
  - Only unroll loops if the loop body has not too many instructions.
  - Consider using templates instead of hand-unrolling (see GSL Shell).
  - You may have to experiment a bit.

* Define and call only 'local' (!) functions within a module.

* Cache often-used functions from other modules in upvalues.
  - E.g. local sin = math.sin ... local function foo() return 2*sin(x) end
  - Don't do this for FFI C functions, cache the namespace
    instead, e.g. local lib = ffi.load("lib").

* Avoid inventing your own dispatch mechanisms.
  - Prefer to use built-in mechanisms, e.g. metamethods.

* Do not try to second-guess the JIT compiler.
  - It's perfectly ok to write 'z = x[a+b] + y[a+b]'.
    - Do not try CSE by hand, e.g. 'local c = a+b'.
  - It's perfectly ok to write 'a[i][j] = a[i][j] * a[i][j+1]'.
    - Do not try to cache partial FFI struct/array references
      (e.g. a[i]) unless they are long-lived (e.g. in a big loop).

* Be careful with aliasing, esp. when using multiple arrays.
  - LuaJIT uses strict type-based disambiguation, but there are
    limits to this due to C99 conformance.
  - E.g. in 'x[i] = a[i] + c[i]; y[i] = a[i] + d[i]' the load of
    a[i] needs to be done twice, because x could alias a.
    It *does* make sense to use 'do local t = a[i] ... end' here.

* Reduce the number of live temporary variables.
  - Best to initialize on definition, e.g. 'local y = f(x)'
    - Yes, this means you should interleave this with other code.
    - Do not hoist variable definitions to the start of a
      function -- Lua is not JavaScript nor K&R C.
  - Use 'do local y = f(x) ... end' to bound variable lifetimes.

* Do not intersperse expensive or uncompiled operations.
  - print() is not compiled, use io.write().
  - E.g. avoid assert(type(x) == "number", "x is a "..mytype(x)")
    - The problem is not the assert() or the condition (basically
      free). The problem is the string concatenation, which has to
      be executed every time, even if the assertion never fails!
  - Watch the output of -jv and -jdump.

You need to take all of these factors into account before deciding
on a certain algorithm. An advanced algorithm, that's fast in
theory, may be slower than a simpler algorithm, if the simpler
algorithm has much fewer unbiased branches.

--Mike

Other related posts: