Re: alleviate the load of the GC

  • From: Laurent Deniau <Laurent.Deniau@xxxxxxx>
  • To: "<luajit@xxxxxxxxxxxxx>" <luajit@xxxxxxxxxxxxx>
  • Date: Wed, 2 Sep 2015 16:09:59 +0000

On Sep 2, 2015, at 3:30 PM, Stefano <phd.st.p@xxxxxxxxx>
wrote:

On 2 September 2015 at 10:49, Laurent Deniau <Laurent.Deniau@xxxxxxx> wrote:
Motivation:
Overloading other operators (beyond syntactic sugar) leads to some
optimisation like sharing temporaries, building lazy expressions, or flatten
expressions that alleviate a lot the GC. C++ libraries are using these kind
of optimisations for decades now with success. To be a bit provocative,
overloading operators has little sense without being able to overload '=',
except basic syntactic sugar. The operator '=' should be seen as a component
of the expression (or statement), and not as a special case, and there is no
reason to not be able to overload it.

Experience:
I have observed with LuaJIT (and other GC-ed languages including C/C++ and
BoehmGC) that I can get a speed up of x10-100 when I can properly manage and
share temporaries within expressions but this requires some _local_
"finalisation" when the results is assigned somewhere. The 'where' is not
important (local, global, key, whatever) because what matters is to know if
the user can reuse (assigned the resulting temporaries) or not (no access to
the temporary), that is if the result is semantically anchored. Without the
reuse of temporaries, the GC gets quadratically slower (specially with large
objects not sinked for LuaJIT).

Best,
Laurent.

I believe that what you are referring to is mainly the application of
expression template techniques as in the Eigen C++ matrix/vector
algebra library (http://eigen.tuxfamily.org/).
By overloading the arithmetic and assignment operators it is possible
to have expressions of vectors such as:
x += y + z + w;
that gets 'expanded' at compile time to a single loop and with no need
of temporaries.

Not only. I have been using template expressions back in 1997 in my SL++
library (the equivalent of Blitz++ for Linear Algebra). So the idea is not new,
but it is not the only application of overloading the '=' operator. The move
semantic of C++ is going that way too, in a different manner. So the need is
there for two decades, with many applications. So my question is not if it is
useful (I know the answer for 20 years) , but if it is feasible in LuaJIT
implementation as part of my own patch. I do not say that it should be part of
LuaJIT nor of Lua, but if I can make a patch and show how to use it in Lua,
check feasibility and inconsistency of the approach and so on, I could apply it
for our needs and maybe other people would find it useful.

Previous versions of my sci.alg (http://scilua.org/sci_alg.html)
module employed similar techniques and you could write:
x:set(x + y + z + w)

Not always possible if x is a new local variable holding subexpressions.

Here is a concrete example:

local function track_drift(m, e)
local pz = var( e.L/sqrt(1 + (2/e.b)*m.ps + m.ps^2 - m.px^2 - m.py^2) )
m.x = m.x + m.px*m.pz
m.y = m.y + m.py*m.pz
m.s = m.s + (1/e.b + m.ps)*m.pz - (1-e.T)*e.LD/e.b
pz:release()
end

var is what I want to avoid
release could be omitted and rely on the GC with little load.
values manipulated there are a mix of polynomials (cdata >128 bytes) and reals
depending on the values stored in m.

With release, it needs only 3 temporaries, without release it uses the same 3
temporaries plus one new per call, without var and optimisation in operators it
used 25 new temporaries per call.

I tested different implementations, both via cdata and tables, that
would rely heavily on allocation sinking to achieve good performance.
The micro benchmarks (like iterating the above thousand times)
revealed competitive performance, indeed in some benchmarks it was
faster than Eigen.
However, when employed in more complex simulations it ended up putting
too much of a burden on the LuaJIT compiler so I have ditched the idea
for now (I suspect further LuaJIT optimisations and hyperblock
scheduling could make it work).

I observed the same, that is why I moved all the algebra in C and wrap things
into LuaJIT, otherwise even simple code is not anymore compiled. The code for
matrices and complex matrices is there

https://github.com/MethodicalAcceleratorDesign/MAD/blob/master/lua/matrix.lua

and the benchmark show good results (LuaJIT reports 95-100% compiled from run
to run).

For now I am resorting to plain old loops for expressions like the
above, but I am working on extending the Lua syntax to allow for
things like:
x[] = x[] + y[] + z[] + w[]
The way this is implemented is via code-transform: a new Lua file with
inlined expressions is generated and executed.

Apologies if I am off-topic, it seemed to me that this is what you
wanted to achieve and wanted to share my (long) experience and
conclusions.

Btw, your proposal would introduce issues with function calls for
instance (__assign being called for function arguments?)

Why? Semantically f(x*y) should be equivalent to local _tmp = x*y ; f(_tmp)

(you are somehow right as I was planning to follow the __assign proposal by
something about _ENV ;-) But one thing after the other. If I cannot even
understand how to modify LuaJIT for basic stuff about __assign, I doubt that I
could do something for _ENV which is not even there…)

Best,
Laurent.


Other related posts: