Re: Tuning numerical computations for LuaJIT (was Re: [ANN] Sci-1.0-beta1)

  • From: Evan Wies <evan@xxxxxxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Thu, 20 Sep 2012 10:21:37 -0400

Yes, good stuff.  I just put it on the wiki:
http://wiki.luajit.org/Numerical-Computing-Performance-Guide

-Evan

On 09/20/2012 10:04 AM, Coda Highland wrote:
On Thu, Sep 20, 2012 at 5:28 AM, Mike Pall <mike-1209@xxxxxxxxxx> wrote:
Markus Walther wrote:
On Thu, Sep 20, 2012 at 1:35 AM, Geoff Leyland
<geoff_leyland@xxxxxxxxxxx> wrote:
https://github.com/vection/LuaFFT would appear to be just that.
That port is pure Lua... It's likely that FFI would help quite a bit
to make it faster.
Probably. But the code is inefficient in many ways.

Still, according to FFTW benchmarks, Ooura's FFT is quite a bit faster
than kissfft, so might be a better base for a LuaJIT port:
Maybe or maybe not. It looks rather branchy.

The following things are needed to get the best speed out for
numerical computations with LuaJIT (in order of importance):

* Reduce number of unbiased/unpredictable branches.
   - Heavily biased branches (>95% in one direction) are fine.
   - Prefer branch-free algorithms.
     - Use math.min() and math.max().
     - Use bit.*, e.g. for conditional index computations.

* Use FFI data structures.
   - Use int32_t, avoid uint32_t data types.
   - Use double, avoid float data types.
   - Metamethods are fine, but don't overdose them.

* Call C functions only via the FFI.
   - Avoid calling trivial functions, better rewrite them in Lua.
   - Avoid callbacks -- use pull-style APIs (read/write) and
     iterators instead.

* Use plain 'for i=start,stop,step do ... end' loops.
   - Prefer plain array indexing, e.g. 'a[i+2]'.
   - Avoid pointer arithmetic.

* Find the right balance for unrolling.
   - Avoid inner loops with low iteration count (< 10).
   - Only unroll loops if the loop body has not too many instructions.
   - Consider using templates instead of hand-unrolling (see GSL Shell).
   - You may have to experiment a bit.

* Define and call only 'local' (!) functions within a module.

* Cache often-used functions from other modules in upvalues.
   - E.g. local sin = math.sin ... local function foo() return 2*sin(x) end
   - Don't do this for FFI C functions, cache the namespace
     instead, e.g. local lib = ffi.load("lib").

* Avoid inventing your own dispatch mechanisms.
   - Prefer to use built-in mechanisms, e.g. metamethods.

* Do not try to second-guess the JIT compiler.
   - It's perfectly ok to write 'z = x[a+b] + y[a+b]'.
     - Do not try CSE by hand, e.g. 'local c = a+b'.
   - It's perfectly ok to write 'a[i][j] = a[i][j] * a[i][j+1]'.
     - Do not try to cache partial FFI struct/array references
       (e.g. a[i]) unless they are long-lived (e.g. in a big loop).

* Be careful with aliasing, esp. when using multiple arrays.
   - LuaJIT uses strict type-based disambiguation, but there are
     limits to this due to C99 conformance.
   - E.g. in 'x[i] = a[i] + c[i]; y[i] = a[i] + d[i]' the load of
     a[i] needs to be done twice, because x could alias a.
     It *does* make sense to use 'do local t = a[i] ... end' here.

* Reduce the number of live temporary variables.
   - Best to initialize on definition, e.g. 'local y = f(x)'
     - Yes, this means you should interleave this with other code.
     - Do not hoist variable definitions to the start of a
       function -- Lua is not JavaScript nor K&R C.
   - Use 'do local y = f(x) ... end' to bound variable lifetimes.

* Do not intersperse expensive or uncompiled operations.
   - print() is not compiled, use io.write().
   - E.g. avoid assert(type(x) == "number", "x is a "..mytype(x)")
     - The problem is not the assert() or the condition (basically
       free). The problem is the string concatenation, which has to
       be executed every time, even if the assertion never fails!
   - Watch the output of -jv and -jdump.

You need to take all of these factors into account before deciding
on a certain algorithm. An advanced algorithm, that's fast in
theory, may be slower than a simpler algorithm, if the simpler
algorithm has much fewer unbiased branches.

--Mike

This is amazingly useful and I am saving this for later.

Is this information posted on a website somewhere in a format that's
as easy to read and understand as this?

/s/ Adam


Other related posts: