Re: Allocation sinking in git HEAD

  • From: Mike Pall <mike-1207@xxxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Wed, 4 Jul 2012 22:10:05 +0200

Adam Strzelecki wrote:
> Unfortunately it seems to bring absolutely no boost for
> multiplication of 4x4 matrices (struct of 16 elements), are we
> running short on registers here?

It was running short of PHIs -- I've bumped the limit now (was
probably too low, anyway).

> However newer GCC like 4.7 or Intel C++ C 12.x have really
> decent auto parallelisation that can make your app run 3-5x
> faster than scalar code depending on emitted instruction set -
> SSE vs AVX.

I guess you mean auto-vectorization: automatically turning scalar
code into SIMD-code, i.e. without manual insertion of builtins.
Parallelization is something very different: running code in
parallel on different CPU cores.

However, auto-vectorization is a very, very hard problem. A couple
years ago, even the best compilers could only auto-vectorize simple
textbook examples, but failed on anything remotely interesting.
This has changed a bit, of course. But it would take an extraordinary
amount of research and lots of code to become competitive in this
area.

> If we want to match recent C++ compilers in this matter we would
> need to bring parallelisation to LuaJIT as well, at least for 2
> or more assignments in the row that come from same expression,
> i.e.:
>   __mul = function(a, b)
>     return vec4(a.x * b.x, a.y * b.y, a.z * b.z, a.w * b.w)
>   end

The logical first step is to add SIMD builtins and hand-vectorize
the code. The SIMD builtins etc. are one of the future extensions
mentioned in the roadmap.

--Mike

Other related posts: