Consoles and interpreters (was Re: FYI: No JIT on Windows 8 for ARM)

  • From: Mike Pall <mike-1205@xxxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Fri, 11 May 2012 00:27:10 +0200

Tomas Lundell wrote:
> I don't recall we tested the dual-number mode, so I don't know if it works
> on the consoles or what the performance would be like.

You tested dual-number mode.

> Double arithmetic isn't *that* slow on consoles, so I would
> hazard to guess the winner is whichever mode incurs the least
> load-hit-stores.

PPC has strictly segregated integer and FP register banks. You can
only transfer from one to the other via memory. This means all
double-to-integer conversions have to go through memory and incur
the dreaded 40 cycle penalty. Yes, fourty cycles!

Let's take a look at a trivial loop that goes through an array and
how it's run with an interpreter:

  -- [... code for filling the array omitted ...]
  local x = 0
  for i=1,100 do x = x + a[i] end

This will incur an extra 40 cycle delay if 'i' is a double that
needs to be converted to an integer for array indexing. That's
obviously the case for single-number mode. In dual-number mode
'i' is kept as an integer, so there's no extra penalty.

You still pay for the other l-h-s penalties, but they can overlap
a bit, so this probably costs only 2*40 cycles:

   _________ +40
  |         V
  | tmp = a[i]  -- +40 for single-number mode, +0 for dual-number mode
  |   \_______
  |  _____+40 \ +40
  | |     V    V
  | | x = x + tmp
  | |_|
  |_____ +40
  |     V
  | i = i + 1
  | if i <= 100 then goto ::loop:: end

Then there's also the indirect branch prediction capability for
bytecode dispatch (different between Cell and Xenon). That's a
penalty of up to 3*20 cycles for this loop. I guess this means the
loop runs at 2*40+3*20 = 140 cycles per iteration in the worst case.
The few cycles for the actual computations and the overhead of the
bytecode dispatch don't matter that much in comparison.

Yes, this is all due to the @$%&)" design of these chips, that
penalizes all interpreters. On top of this, the console
manufacturers had the brilliant idea to ban JIT compilers, which
wouldn't suffer from all of this. :-/

Now you know why a Lua interpreter runs dog-slow on consoles, even
though they run at around 3GHz, which is similar to your desktop PC.


Other related posts: