Thanks for your summary. We've been having issues related this code as well,
albeit different since we are concerned with x64 and a different usage model.
I have two questions/suggestions on this topic:
1) could LuaJIT not create an allocation island with a trampoline in the middle
which is an absolute jump to the exit function? This would avoid issues with
having a special range of memory not completely under LuaJIT control.
2) could LuaJIT provide a way to set a function vector for this allocation
scheme? We would really prefer to write a customized allocator to manage this
memory, but right now we will need to patch LuaJIT to do this.
On Feb 27, 2017, at 4:47 AM, Alexey Kopytov <akopytov@xxxxxxxxx> wrote:
After many hours of profiling, tracing and debugging I thought it would be
valuable to post my findings here. Mostly as a summary for people coming from
search engines. My understanding of LuaJIT internals is expectedly limited,
so I'm also happy to be corrected.
The root cause of my LuaJIT scalability issues (manifesting themselves as
long benchmark warmup times) was a combination of many things at different
levels:
- ARM64 + many cores. I originally discovered the issue on an ARM64 server
with more than a hundred of cores, and omitted this detail from my original
email as I thought it was nonessential. But that was before repeating the
same benchmarks on an X64 server and diving into the LuaJIT code. It is
(almost) a non-issue for X64 for the reasons described below
- implementation of mcode_alloc() in LuaJIT. Which is a function to allocate
memory for machine code. It uses mmap() to first try to allocate memory at a
specific address, and (if that fails) randomized addresses within a certain,
architecture-specific range. If a certain number of attempts fails, it aborts
the trace with the "failed to allocate mcode memory" error
- Linux mmap() scalability issues I mentioned in my original email
I started by repeating my benchmarks on an X64 server and discovering that
the issue with long warmup times doesn't exist there. Tracing showed much
higher rates of "failed to allocate mcode memory" trace aborts on ARM64,
which pointed me to mcode_alloc() and the following issues:
1. Lower allocation range. mcode_alloc() wants to allocate areas in a certain
range so that they are all mutually reachable by CPU relative jump
instructions. The range naturally depends on the CPU instruction set. For
X64, the jump range is +/- 2 GB. It is lower for most non-x86 architectures.
For ARM64 it is +/- 128 MB. If we want all allocated blocks to be mutually
reachable, all blocks must fit into 128 MB. We also want static assembler
code from the processor code segment to be reachable from all blocks, which
means the allowed allocation range must be centered around some "target"
address in the process code segment.
2. Even lower allocation range. mcode_alloc() implementation for some reasons
divided the jump range by 2 for randomized (but not initial "targeted")
mmap() calls. For ARM64 it resulted in theoretical range for mmap()
allocations of [target - 32MB; target + 32MB], i.e. 64 MB pool shared by all
threads. Which is not much, but it was even lower in fact, because...
3. Tiny allocation range. The process code segment on both ARM64 and X64
servers was in the 0x00400000-0x00499000 range, i.e. 4 MB away from the zero
address. Which means there was not much to allocate below that range (and
wrap-around addresses in the upper memory space are usually reserved by the
kernel, i.e. not available for mmap()). As a result, I only had a little more
than 32 MB of memory available for mcode allocations on the ARM64 machine.
Which was fairly quickly exhausted with 500+ threads.
4. And this is when the Linux mmap() scalability bug came into play. During
the initial warmup period I had dozens of CPU cores all spinning in the
mcode_alloc() loop doing 32 mmap() attempts with most (or all) of them
failing. I'm not quite sure why that activity was eventually dying down. I
suspected poor quality of PRNG being used, and got 2x lower warmup times by
using a better PRNG and seeding it for each thread. Which was still only a
part of the solution.
I proposed solutions for the above problems in the following Github issues:
- https://github.com/LuaJIT/LuaJIT/issues/282
- https://github.com/LuaJIT/LuaJIT/issues/283
- https://github.com/LuaJIT/LuaJIT/issues/284
- https://github.com/LuaJIT/LuaJIT/issues/285
With my (experimental quality!) patches the warmup time for high-concurrency
benchmarks has been reduced from 30+ seconds to less than 5 seconds which is
good enough for my purposes.
I think there's room for further improvements. We probably want to exclude
process code and heap segments from the mcode_alloc() allocation range to
reduce failed attempts, handle exhausted allocation pools in a more
reasonable way, use some linker magic to move the code segment higher in the
process address space, etc. I'm also looking at mmap_probe() which may
require similar fixes. But then again, my current fixes are good enough for
me.
/Alexey