Re: Few questions about LuaJIT internals on 64 bit

From: Mike Pall <mikelj-1703@xxxxxxx>
To: luajit@xxxxxxxxxxxxx
Date: Sat, 25 Mar 2017 19:31:52 +0100

Peter Cawley wrote:

In the long term, I'm not sure anybody has a great answer
- as pointers get larger, NaN packing gets harder, and it may well
become impractical, which would then require a major overhaul of
LuaJIT.

Well, I've been thinking about a solution for this issue. First,
some context:

Choosing the right object reference representation for a dynamic
language VM is crucial. It's hard to overhaul, once the VM gets
more complex. Also, this design choice influences what kind of
optimizations you need in a VM to get acceptable performance.

E.g. with slots that hold pure untagged pointers to objects, every
pointer in a slot needs to be dereferenced on access. Both for the
type check and to access the value. And you need to allocate every
object on the heap, including integers and other number types.
Performance will suffer until you do lots of unboxing optimizations,
pre-allocation of common objects (low integers) etc. And it's very
hard to eliminate that overhead completely. It's a simple design
and it shows.

OTOH a tagged value design with a fixed slot size has to make a
trade-off between the slot size and the ability to inline some
object types in a slot, e.g. integers. Not all objects will fit,
no matter how big you make the slots. And not all objects in a
language have value semantics either. Bigger slots offer more
options, but also introduce more CPU and memory overhead.

IMHO there's a sweet spot for a fixed-size tagged value design and
that's at the 64 bits with NaN-tagging that LuaJIT currently uses.
But, of course, it has to box 64 bit integers and it would
eventually have to box (full) 64 bit pointers. Which makes it
problematic, going forward.

Around 10 years ago (*), I experimented with different slot sizes
and 128 bit slots showed noticeably lower performance for the core
operations. That's probably less of a concern with a simple
interpreter, where the major overhead is elsewhere. But with an
advanced JIT compiler, you'll certainly notice.

(*) Yes, one would need to redo this experiment with current-
generation desktop, server and embedded systems.

Also, while we seem to have plenty of memory space nowadays,
memory bandwidth is still a major bottleneck. This is less of an
issue with stacks, which mainly need to fit into the L1 cache. But
if you want to use the same slot design for stacks and aggregates,
then you'll pay the price on aggregates. A mixed-size design is
conceivable, but then you might as well go for a variable-size
design.

A variable-size slot design needs one extra indirection: An offset
into a separate area. And you really want to avoid another
indirection. So, here's the laundry list: let's inline _all_
objects with value semantics (up to a sane limit). This completely
eliminates the need for boxing/unboxing, too. The tag lives
together with the offset, to move type checks to an earlier
dependency chain. Then let's work on passing values more
efficiently between interpreter operations. Finally, design a
generic aggregate that can hold variable-sized objects and can be
specialized for same-typed values.

While we're at it ... if we're already having the indirections,
we might as well execute SSA form directly, instead of paying for
register allocation and traditional bytecode generation.

Actually, there's quite a bit of literature about that topic. The
key issues are the indirections, the memory waste and the parallel
PHI assignment property. OK, so we've already agreed to pay the
price for the indirections. The memory waste can be reduced, but
it's mostly a non-issue as we're only talking about the stack, not
the aggregates. And the PHI assignments can be linearized for the
few uncommon functions that need it.

A little-known fact is that one can do a one-pass translator from
source code to linear SSA form for most imperative languages, too.
There are only a few complications with Lua -- it needs a bit of
backtracking and upvalue immutability is tricky to handle (though
that might be considered an optimization).

Putting everything together, the plan would be to translate Lua
source code to an untyped executable SSA form, interpret that with
variable-sized slots, record the observed types and turn it into
either a typed bytecode (for the non-JIT platforms) or machine
code. Trace vs. method-based compiler is an orthogonal design
choice, though the former probably works better in such a setting.

Summing up, this solves not just the issues with the fixed-size
tagged value design, but you also get dynamic typing with
efficient unboxed value types (integers, FP numbers, complex
numbers, structs), just like in C/C++.

There are interesting implications for GC, handling of temporaries
and memory allocation in general. E.g. the separate area mentioned
above can also be seen as a GC nursery. Also, scoped finalization
is trivial. There are other nice consequences of this design, but
this message is already too long.

Now ... if you're wondering what this entails: Yes, this would be
a major departure from the current execution model of LuaJIT. It
would be more or less a complete rewrite.

Which poses the question whether one wouldn't want to choose or
design a dynamic language plus its ecosystem that better fits that
execution model, to get the most out of it.

--Mike

Follow-Ups:
- Re: Few questions about LuaJIT internals on 64 bit
  - From: Demi Obenour
- Re: Few questions about LuaJIT internals on 64 bit
  - From: Dibyendu Majumdar

References:
- Few questions about LuaJIT internals on 64 bit
  - From: Blaise R.
- Re: Few questions about LuaJIT internals on 64 bit
  - From: Peter Cawley

Re: Few questions about LuaJIT internals on 64 bit

Other related posts: