Re: Array performance with 2.0.0-beta10 versus git HEAD

  • From: Peter Colberg <peter@xxxxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Tue, 28 Aug 2012 14:24:20 -0400

On Tue, Aug 28, 2012 at 07:27:36PM +0200, Mike Pall wrote:
> Peter Colberg wrote:
> > Why is there such a significant difference with 2.0.0-beta10, and not
> > with git HEAD? Could this actually be caused by the absolute versus
> > relative script path?
> 
> Trace selection is probabilistic and memory layout influences the
> algorithm. The key problem is the innermost loop with a low, but
> non-constant iteration count (from the compilers perspective).
> Sometimes it gets unrolled, sometimes not, depending on what's
> compiled first.

I tried three cases for the innermost loop: manual unrolling, a loop
with constant upper bound, and the naïve version with # operator.

So manual unrolling or a constant iteration count would avoid trace
selection, and thus result in predictable run-time independent of
the memory layout, correct?

The reason for not unrolling manually is to have dimension-independent
code, suitable for both two- and three-dimensional systems of particles.

Is the iteration count constant to the compiler if I rewrite the loop as

local dim = #a[1]

for r = 1, 500 do
    for i = 1, #a do
        for j = 1, dim do
            a[i][j] = 2 * a[i][j]
        end
    end
end

> Anyway, I'm sure there are plenty of examples for FFI vector and
> array classes out there, that don't suffer from this problem. E.g.
> properly unrolled arithmetic metamethods, no 1-based indexing,
> structs instead of tables as wrappers, bounds checks dynamically
> compiled only for debugging, etc.

Thanks, I will search again in depth for such array classes.

I was actually surprised to see that the bound checks have no
influence on the performance (maybe the above “benchmark” is too
trivial and flawed…). Is ABCelim in lj_opt_fold.c responsible for
eliminating such bound checks?

> > I built LuaJIT on a current Debian wheezy (x86_64) using
> >
> > make amalg "CFLAGS=-fPIC -DLUAJIT_ENABLE_LUA52COMPAT -DLUAJIT_CPU_SSE2"
> 
> On x64, -DLUAJIT_CPU_SSE2 is pointless. And -fPIC is not useful.
> The shared library is compiled with -fPIC, anyway.

Thanks, I will omit -fPIC. LUAJIT_CPU_SSE2 was a leftover from
compilation for x86, to avoid different rounding behaviour of x87.

Peter

Other related posts: