On 16 Jul 2020, at 03:03, Chris <celess22@xxxxxxxxx<mailto:celess22@xxxxxxxxx>>
wrote:
It’s because the slowdowns are not in LuaJIT itself.
Each run will behave the same inside LuaJIT. The calls to the OS may be faster
or slower. Some of the OS calls can cause subsequent calls to become slower
temporarily, even across launches of the executable.
We experienced significant performance variations between runs too, but once
you disable the Address Space Layout Randomisation of your OS, the performance
are almost deterministic from run to run. Unfortunately this is only a
temporary solution for debugging and it does not explains degradation during a
single run (unless trace flush occurs).
For the same reason, any change in the code can lead to better or worst
performance. See section 4.1.3 in
https://github.com/MethodicalAcceleratorDesign/MADdocs/blob/master/luajit/luajit-doc.pdf
[disclamer: this is our understanding of the cause, not a proof].
I also recently posted that using the profiler (e.g. -jp=vl) stabilises the
problem and even increase significantly the performance. I don’t really know
why though as I would expect the opposite...
Hope this helps a bit.
Best regards,
Laurent.
Some of this can be mitigated through altering some of how the more unique
parts of LuaJIT behave. Some of the unique parts of LuaJIT are rare enough in
other programs, that you will see unique results that seem really strange. Like
the compiled code memory allocator and manager.
----
There is only one other part of LuaJIT that I know if that can alter behavior
from run to run, based on pseudo-random generated number. Its often an
interplay between that seed which can be first seeded in an actual call to get
memory for the code cache, and then the same seeded number is used as a part of
the trace code.
This seed can cause different trace code choices between runs because it may
get seeded differently, or moved to a different pseudo-random number by
differences in memory layout when the program is launched. This affects the
number of times the generator is run, because of the number of times it has to
test for a block being in a 2GB boundary.
In my code I separate these tests, and use the randomness to alter the
heuristics of the throttle, rather than effect the tracecode directly, the best
I can remember.
So in the end:
The net effect is that if you only have the stock code, you can chance the
‘maxcode’ option once at the top of a file, and see both if you get different
results from changing number of OS calls, and from changing the pseudo-random
randomness.
This setting will in effect make both reasons for getting different results per
run change. And in my opinion, will let you know if one of these is the source
of your problem. If it is then mystery solved, and you are where I am.
My code tries to rewire things to account for all of these long standing issues.
From: rochus.keller@xxxxxxxxxx<mailto:rochus.keller@xxxxxxxxxx>
Sent: Wednesday, July 15, 2020 5:41 PM
To: luajit@xxxxxxxxxxxxx<mailto:luajit@xxxxxxxxxxxxx>
Subject: Re: Re: RE: Re: RE: Inexplicable behaviour of JIT (sudden speed-down
ofthe samecode that was running fast at first) in a long-running Lua
program,update
@ Chris
Thanks for the additional information. At the moment I still cannot see exactly
how this relates to my observations and questions, probably because I am not
familiar enough with the internals of LuaJIT.
Obviously LuaJIT found optimal traces on the first run of the benchmark, but
seems to give them up (why else should it slow down?) on later runs for unknown
reasons. All runs use the same sequence of operations and run in the same
LuaJIT session. I don't see a reason to assume that the hotspot statistics
changed or there should suddendly be something blacklisted which could not have
been blacklisted long before.
Anyway, a pragmatic approach would be, if I just try the LuaJIT version with
your proposed changes. Can you post a link please where I can download your
LuaJIT version with the described modifications?
Best
R.
_______________________________
From: Chris
Sent on: Thu, 16 Jul 2020 01:29:25 +0200
To: luajit@xxxxxxxxxxxxx<mailto:luajit@xxxxxxxxxxxxx>
Cc:
Subject: RE: Re: RE: Inexplicable behaviour of JIT (sudden speed-down ofthe
samecode that was running fast at first) in a long-running Lua program,update
Sorry. What I wrote may make more sense in context of the LJ internals. It’s my
fault.
I wanted to just get the bulk of what I found in one spot for everyone on the
list, so I covered a lot of ground.
I should have made a TL;DR but is a complex topic. My experience may or may not
apply. But it’s an example that I found of LJ seeming to go through the roof,
from run to run showing dramatic differences in run time.
My understanding and observations below.
Summary:
Basically LJ is losing its sweet spot on balance between the OS keeping things
in order and how often LJ requests virtual memory block. But it’s not for the
bulk of the memory Lua, the GC or internal LuaJIT uses, just for allocations
for the compiled Lua code cache.
Changing the frequency seems to keep these in balance. This requires teaching
LJ some new tricks and rebalancing other areas of code to take advantage of the
facilities, and alter heuristics to keep all this in mind.
You can approximate coarsely, often time, by simply changing the max code block
in the LJ settings even at run time to a low number, like 8 or 10. Likely it
will be slower, but will effectively spend more time executing compiled and
non-compiled Lua code, because the bad teardowns will take less time and
because it will fallback to Interp more often/easily. The teardown and rebuild
does not cost much but when the chatter with the VM code in the OS hits a
certain frequency it can seem to stall every call for a while. This is enough
to make a run suddenly look very terrible, even though nearly everything
happened just right.
----
There’s a lot of Jujitsu going on with how the OS presents virtual memory
allocations, from all the work that has to eventually occur. On your own
thread you get a sweet optimized path up to a point, then you start causing the
real work to occur on that call on your thread in order to complete the call.
LuaJIT general memory:
LJ has its own memory manager compiled in. Asking for mem within an existing VM
allocation is just internal house keeping on your side, the LJ VMs side. Or
also on the C runtimes side for free/alloc for any of those.
Each allocation and deallocation of Lua memory is just another set of reads and
writes to bookkeeping its own already allocated OS memory inside its own mem
manager, like any other read or write. The manager is pretty good about
keeping and getting of real memory, and doesn’t need to make many real OS
calls. The kind you can observe with ‘top’, or Perf Counter or in Task Manager.
The bulk of interaction with memory is 1 to 1 CPU load store and so fort, and
is just memory bandwidth and modern CPUs are built for the abuse. Super
dramatic activity can happen up and down the cache hierarchy, and the benchmark
numbers will barely seem to move.
Complied Lua code memory:
Code memory has to be handled specially so blocks can be marked as executable
on the CPU. It has its own separate code in LuaJIT and has to be especially
careful to never have an uninitialized or unused block in executable state.
Wants to keep the number of blocks low and released asap, and to try to play
fair with the OS and I assume release and reallocate completely on flush to let
the OS decide where the best placement is in a long running program. Each block
must also be close within 2gb to the lua jit code so that relative address
calls can go between lua jit and the compiled Lua code. This means that these
block must be place near your compiled LuaJIT in your executable or dyn
library, and there has to be a strategy for ensuring that happened.
These requirements and constraints can cause the allocator to sometimes hit the
OS necessarily very fast, when things get “flushy” in the JIT. When this
frequency goes down it stops stalling, and I presume stops outstripping the
recovery mechanisms. I presume also that when things get outstripped the OS
allocator has to go do the full brunt of what could have been deferred normally
right then on your thread on that call.
----
I made changes to the code to minimize or throttle the frequency, based on and
using some of the trace state to know how hard its getting hit, while keeping
all of the normal JIT state and temporary moratoriums on blacklist, and
reworked some of how the allocator and relative address discovery
functionality. I also made a more dynamic keep or toss of the allocated block
where they may or may not be kept around and may put a lower limit on how many
blocks get send back to the OS .
Sort of a separate electronic transmission control for the JIT trace and flush.
This has a side effect of rebalancing how long JIT vs Interp code runs, and
sometimes has the effect of nipping the overhead at the right times on another
vector and may get even lower run times for certain code.