I am pleased to inform you that I have found the solution. The geometric mean
of the duration per run is now stable at 5.8 seconds (with ASLR switched on)
and at 5.6 seconds (with ASLR switched off). The corresponding speed-down
factor compared to the C++ version is now only 1.1.
The following steps brought me to the solution. First I temporarily switched to
LuaJIT 2.1 (commit 570e758) and instrumented the application with
luaJIT_profile_start. This allowed me to confirm that the percentage of
compiled code is actually highest in the first run and then decreases over
time. Interestingly, the duration did not increase linearly with the
percentage, but remained consistently good until the compiled code percentage
was about 45%, and then suddenly doubled. In general, performance was worse
with LuaJIT 2.1 and the probability of halving speed was much higher than with
LuaJIT 2.0. I also could not reproduce the effect reported by Laurent Deniau,
according to which the profiler stabilizes the behavior of LuaJIT and leads to
higher performance.
I then started looking in the tracer source code of LuaJIT 2.0 for reasons why
good traces are abandoned. This led me to a set of parameters in lj_jit.h, line
96ff. From my university studies I remembered queueing theory, where you see
similar non-linear effects in multilevel queuing systems, and incorrectly
dimensioned parameters can have a devastating effect on throughput.
In an extensive series of tests I then varied a subset of these parameters
until I had a stable minimum duration (i.e. no sudden speed-down). The above
mentioned results were achieved with the following parameter values:
- maxtrace = 100'000 (instead of 1'000)
- maxrecord = 40'000 (instead of 4'000)
- maxside = 1'000 (instead of 100)
- sizemcode = 64 (instead of 32)
- maxmcode = 4'096 (instead of 512)
It would be presumptuous to assume that the present value set is already
optimal, but it at least shows that the approach works and that a good, stable
performance can be achieved with the LuaJIT tracer. I assume that the
previously observed effects might have the following causes (bear with me that
I'm not an expert on LuaJIT internals though):
- since the trace memory is limited, traces must be abandoned or no new traces
can be allocated when it is full;
- when the cache is full, access becomes more expensive.
- since the length of a trace is limited, JIT will be less efficient in
applications that would tend to require longer traces;
- since the memory area for machine code is limited, the effort to find a free
area becomes greater the longer the trace is and the more traces there are.
Best
R.
_______________________________
From: rochus.keller@xxxxxxxxxx
Sent on: Wed, 15 Jul 2020 23:19:45 +0200
To: luajit@xxxxxxxxxxxxx
Cc:
Subject: Fwd: Inexplicable behaviour of JIT (sudden speed-down of thesamecode
that was running fast at first) in a long-running Lua program,update 2
Just wanted to let you know that I observe exactly the same issue with LuaJIT
2.1 as well.
No other changes to my application or test setup, just replaced the LuaJIT
version 2.0 used for my tests so far by the most recent LuaJIT 2.1 commit
570e758 and rebuilt everything. The results look pretty much the same as
reported in
http://software.rochus-keller.info/St80LjVirtualMachine_0.6_Performance_Report_2020-07-14.pdf
Best
R.
_______________________________
From: rochus.keller@xxxxxxxxxx
Sent on: Wed, 15 Jul 2020 00:40:01 +0200
To: luajit@xxxxxxxxxxxxx
Cc:
Subject: Fwd: Inexplicable behaviour of JIT (sudden speed-down of the samecode
that was running fast at first) in a long-running Lua program, update
Meanwhile I managed to speed-up the display code which is used by both my C++
and my Lua version of the Smalltalk-80 VM. One run of the benchmark now only
takes 5.1 seconds with the C++ version and 9.6 seconds (best case) with the Lua
version. Here is the code https://github.com/rochus-keller/Smalltalk ;(build
with StVirtualMachine.pro and StLjVirtualMachine.pro; there is a precompiled
Windows version).
And the observed issue is now becoming much more apparent. Here is a report of
my test runs conducted so far:
http://software.rochus-keller.info/St80LjVirtualMachine_0.6_Performance_Report_2020-07-14.pdf
In most cases the first run was fast (around 11 seconds) and the subsequent
runs were slow (around 21 seconds). But there are many cases where the first
two runs were fast. I only had one case where all runs where fast, and another
case where all runs were between 14 and 15 seconds.
So LuaJIT can obviously run the benchmark in ~11 seconds several times in a
row, but then suddenly shifts into a lower gear for no apparent reason. The
time gap between the runs does not seem to be relevant. I also find it
remarkable that it is again 10 seconds by which a run is extended (see my first
post where the former version of the VM slowed down from 40 to 50 seconds).
Remember that it is the Smalltalk interpreter which is written in Lua and which
continuously runs in a loop. The benchmark is actually Smalltalk (byte)code
which is executed by the interpreter. Neither the Lua context, nor the
interpreter, nor the benchmark change during the sequence of runs. For each new
sequence of runs the application (and with it LuaJIT) is re-started.
Has anyone observed this behaviour of LuaJIT before? Is there an explanation
for it and possibly a work-around?
Thanks
R.