Re: LuaJIT and mmap() scalability

  • From: Alexey Kopytov <akopytov@xxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Mon, 27 Feb 2017 17:45:53 +0300

After many hours of profiling, tracing and debugging I thought it would be valuable to post my findings here. Mostly as a summary for people coming from search engines. My understanding of LuaJIT internals is expectedly limited, so I'm also happy to be corrected.

The root cause of my LuaJIT scalability issues (manifesting themselves as long benchmark warmup times) was a combination of many things at different levels:

- ARM64 + many cores. I originally discovered the issue on an ARM64 server with more than a hundred of cores, and omitted this detail from my original email as I thought it was nonessential. But that was before repeating the same benchmarks on an X64 server and diving into the LuaJIT code. It is (almost) a non-issue for X64 for the reasons described below

- implementation of mcode_alloc() in LuaJIT. Which is a function to allocate memory for machine code. It uses mmap() to first try to allocate memory at a specific address, and (if that fails) randomized addresses within a certain, architecture-specific range. If a certain number of attempts fails, it aborts the trace with the "failed to allocate mcode memory" error

- Linux mmap() scalability issues I mentioned in my original email

I started by repeating my benchmarks on an X64 server and discovering that the issue with long warmup times doesn't exist there. Tracing showed much higher rates of "failed to allocate mcode memory" trace aborts on ARM64, which pointed me to mcode_alloc() and the following issues:

1. Lower allocation range. mcode_alloc() wants to allocate areas in a certain range so that they are all mutually reachable by CPU relative jump instructions. The range naturally depends on the CPU instruction set. For X64, the jump range is +/- 2 GB. It is lower for most non-x86 architectures. For ARM64 it is +/- 128 MB. If we want all allocated blocks to be mutually reachable, all blocks must fit into 128 MB. We also want static assembler code from the processor code segment to be reachable from all blocks, which means the allowed allocation range must be centered around some "target" address in the process code segment.

2. Even lower allocation range. mcode_alloc() implementation for some reasons divided the jump range by 2 for randomized (but not initial "targeted") mmap() calls. For ARM64 it resulted in theoretical range for mmap() allocations of [target - 32MB; target + 32MB], i.e. 64 MB pool shared by all threads. Which is not much, but it was even lower in fact, because...

3. Tiny allocation range. The process code segment on both ARM64 and X64 servers was in the 0x00400000-0x00499000 range, i.e. 4 MB away from the zero address. Which means there was not much to allocate below that range (and wrap-around addresses in the upper memory space are usually reserved by the kernel, i.e. not available for mmap()). As a result, I only had a little more than 32 MB of memory available for mcode allocations on the ARM64 machine. Which was fairly quickly exhausted with 500+ threads.

4. And this is when the Linux mmap() scalability bug came into play. During the initial warmup period I had dozens of CPU cores all spinning in the mcode_alloc() loop doing 32 mmap() attempts with most (or all) of them failing. I'm not quite sure why that activity was eventually dying down. I suspected poor quality of PRNG being used, and got 2x lower warmup times by using a better PRNG and seeding it for each thread. Which was still only a part of the solution.

I proposed solutions for the above problems in the following Github issues:

https://github.com/LuaJIT/LuaJIT/issues/282
https://github.com/LuaJIT/LuaJIT/issues/283
https://github.com/LuaJIT/LuaJIT/issues/284
https://github.com/LuaJIT/LuaJIT/issues/285

With my (experimental quality!) patches the warmup time for high-concurrency benchmarks has been reduced from 30+ seconds to less than 5 seconds which is good enough for my purposes.

I think there's room for further improvements. We probably want to exclude process code and heap segments from the mcode_alloc() allocation range to reduce failed attempts, handle exhausted allocation pools in a more reasonable way, use some linker magic to move the code segment higher in the process address space, etc. I'm also looking at mmap_probe() which may require similar fixes. But then again, my current fixes are good enough for me.

/Alexey

Other related posts: