Hello,
I maintain a benchmark application that has been recently migrated from
PUC Lua to LuaJIT. The application may create lots (thousands) of
threads with some workloads.
I started hitting issues with mmap() scalability after the migration due
to the fact that LuaJIT uses mmap() intensively during the initial trace
generation. Which manifests itself as much lower performance during the
first 10-30 seconds of a benchmark run, with CPU being mostly idle in
workloads involving 1000+ threads.
Which is apparently a long-known problem in the Linux kernel:
https://lkml.org/lkml/2013/1/2/299
In a nutshell, mmap()/munmap() calls performed concurrently with
multiple threads within the same process are serialized around a
per-process lock. There have been a number of attempts to fix it, but
apparently it's still there.
Searching the web does not indicate that anyone has encountered this
specific problem with LuaJIT.
Is there anything that can be done at the application or LuaJIT level to
circumvent or relax the mmap() bottleneck?
Here's a sample perf report captured during the initial period when
benchmark numbers are low:
88.57% 0.00% swapper [unknown] [k]
0x0000000080b50054
88.22%
0x80b50054
secondary_start_kernel
cpu_startup_entry
arch_cpu_idle
10.74% 0.00% sysbench sysbench [.] worker_thread
4.04%
worker_thread
lua_pcall
lj_trace_ins
__mmap64
el0_svc_naked
sys_mmap
sys_mmap_pgoff
vm_mmap_pgoff
down_write_killable
rwsem_down_write_failed_killable
rwsem_optimistic_spin
osq_lock
3.88%
worker_thread
lua_pcall
lj_trace_ins
__GI___munmap
el0_svc_naked
sys_munmap
down_write_killable
rwsem_down_write_failed_killable
rwsem_optimistic_spin
osq_lock
Best regards,
Alexey.