After many hours of profiling, tracing and debugging I thought it would
be valuable to post my findings here. Mostly as a summary for people
coming from search engines. My understanding of LuaJIT internals is
expectedly limited, so I'm also happy to be corrected.
The root cause of my LuaJIT scalability issues (manifesting themselves
as long benchmark warmup times) was a combination of many things at
different levels:
- ARM64 + many cores. I originally discovered the issue on an ARM64
server with more than a hundred of cores, and omitted this detail from
my original email as I thought it was nonessential. But that was before
repeating the same benchmarks on an X64 server and diving into the
LuaJIT code. It is (almost) a non-issue for X64 for the reasons
described below
- implementation of mcode_alloc() in LuaJIT. Which is a function to
allocate memory for machine code. It uses mmap() to first try to
allocate memory at a specific address, and (if that fails) randomized
addresses within a certain, architecture-specific range. If a certain
number of attempts fails, it aborts the trace with the "failed to
allocate mcode memory" error
- Linux mmap() scalability issues I mentioned in my original email
I started by repeating my benchmarks on an X64 server and discovering
that the issue with long warmup times doesn't exist there. Tracing
showed much higher rates of "failed to allocate mcode memory" trace
aborts on ARM64, which pointed me to mcode_alloc() and the following issues:
1. Lower allocation range. mcode_alloc() wants to allocate areas in a
certain range so that they are all mutually reachable by CPU relative
jump instructions. The range naturally depends on the CPU instruction
set. For X64, the jump range is +/- 2 GB. It is lower for most non-x86
architectures. For ARM64 it is +/- 128 MB. If we want all allocated
blocks to be mutually reachable, all blocks must fit into 128 MB. We
also want static assembler code from the processor code segment to be
reachable from all blocks, which means the allowed allocation range must
be centered around some "target" address in the process code segment.
2. Even lower allocation range. mcode_alloc() implementation for some
reasons divided the jump range by 2 for randomized (but not initial
"targeted") mmap() calls. For ARM64 it resulted in theoretical range for
mmap() allocations of [target - 32MB; target + 32MB], i.e. 64 MB pool
shared by all threads. Which is not much, but it was even lower in fact,
because...
3. Tiny allocation range. The process code segment on both ARM64 and X64
servers was in the 0x00400000-0x00499000 range, i.e. 4 MB away from the
zero address. Which means there was not much to allocate below that
range (and wrap-around addresses in the upper memory space are usually
reserved by the kernel, i.e. not available for mmap()). As a result, I
only had a little more than 32 MB of memory available for mcode
allocations on the ARM64 machine. Which was fairly quickly exhausted
with 500+ threads.
4. And this is when the Linux mmap() scalability bug came into play.
During the initial warmup period I had dozens of CPU cores all spinning
in the mcode_alloc() loop doing 32 mmap() attempts with most (or all) of
them failing. I'm not quite sure why that activity was eventually dying
down. I suspected poor quality of PRNG being used, and got 2x lower
warmup times by using a better PRNG and seeding it for each thread.
Which was still only a part of the solution.
I proposed solutions for the above problems in the following Github issues:
- https://github.com/LuaJIT/LuaJIT/issues/282
- https://github.com/LuaJIT/LuaJIT/issues/283
- https://github.com/LuaJIT/LuaJIT/issues/284
- https://github.com/LuaJIT/LuaJIT/issues/285
With my (experimental quality!) patches the warmup time for
high-concurrency benchmarks has been reduced from 30+ seconds to less
than 5 seconds which is good enough for my purposes.
I think there's room for further improvements. We probably want to
exclude process code and heap segments from the mcode_alloc() allocation
range to reduce failed attempts, handle exhausted allocation pools in a
more reasonable way, use some linker magic to move the code segment
higher in the process address space, etc. I'm also looking at
mmap_probe() which may require similar fixes. But then again, my current
fixes are good enough for me.
/Alexey