The git repo for LuaJIT 2.1 now features an integrated statistical profiler with very low overhead. There are plenty of options to generate flat lists, two-level views, combined sampling, raw output and annotated source code. The profiler is able to sample stacks, VM states and zones. Here are some simple examples (for more see the docs): $ luajit -jp spectralnorm.lua 4000 53% A 24% Av 23% Atv $ luajit -jp=s spectralnorm.lua 4000 53% A <- 50% Av <- 50% Atv 24% Atv <- 100% AtAv 23% Av <- 100% AtAv $ luajit -jp=l2 spectralnorm.lua 4000 33% spectralnorm.lua:21 < spectralnorm.lua:28 33% spectralnorm.lua:13 < spectralnorm.lua:27 13% spectralnorm.lua:7 < spectralnorm.lua:13 12% spectralnorm.lua:7 < spectralnorm.lua:21 5% spectralnorm.lua:6 < spectralnorm.lua:21 4% spectralnorm.lua:6 < spectralnorm.lua:13 $ luajit -jp=a spectralnorm.lua 4000 ====== spectralnorm.lua ====== @@ 3 @@ | -- contributed by Mike Pall | | local function A(i, j) 10% | local ij = i+j-1 25% | return 1.0 / (ij * (ij-1) * 0.5 + i) | end | | local function Av(x, y, N) | for i=1,N do | local a = 0 32% | for j=1,N do a = a + x[j] * A(i, j) end | y[i] = a | end | end @@ 18 @@ | local function Atv(x, y, N) | for i=1,N do | local a = 0 32% | for j=1,N do a = a + x[j] * A(j, i) end | y[i] = a | end | end Stack sampling works for both interpreted and JIT-compiled code. The results for JIT-compiled code may sometimes be surprising since LuaJIT heavily optimizes and inlines Lua code. Please note that it's not a goal to add every possible advanced functionality to the bundled high-level profiler. The low-level Lua and C APIs for the profiler are fully documented. Third-party authors are welcome to use these building blocks and offer their own extensions, such as graphical profilers or IDE integration. Documentation is available from the 2.1 branch of the git repository in the file: doc/ext_profiler.html Instructions on how to access the git repo are here: http://luajit.org/download.html Feedback welcome! ---- As you might have noticed, I had to change my plans compared to the original approach presented in June. The main problem with the instrumenting profiler was finding high-precision and high-speed timing sources for all platforms. Even on x86/x64, where RDTSC is readily available, the results were disappointing. The necessary pipeline flushes shadowed the actual timings up to the point where the measurements were less accurate than with a sampling profiler! Other platforms offered only inaccurate timing sources or none that are accessible from user mode. And to top it off, the instrumentation added considerable overhead. Much less than traditional solutions using Lua hooks, but much more than I was willing to accept. Ok, so I had to scrap that work and decided to go with a sampling profiler. I've previously mentioned that multi-platform support for a sampling profiler would be challenging. But it seemed to be more promising than pouring even more time into a dead end. I've implemented three timing/sampling mechanisms: setitimer+SIGPROF, POSIX threads and Windows threads. This should cover nearly all platforms (even the consoles). Here's how it works: the asynchronous signal or thread only sets a flag and patches the dispatch table for the interpreter. Careful arrangement of accesses to the shared state avoids explicit synchronization -- eventual consistency is sufficient for a profiler that has to cope with timer inaccuracies, anyway. Whenever the interpreter executes the next instruction, the profiler callback is invoked synchronously (Lua callbacks are invoked on a separate coroutine). The callback samples the stack with a fast stack dump helper and aggregates all samples. The stored samples are analyzed after the profiling run is finished. The JIT-compiler adds extra checks to the generated code at the desired granularity (function or line) when profiling is active. The machine code exits to the interpreter when the flag is set. This will eventually invoke the profiler callback. VM states are sampled when the timer expires, not when the callback is later invoked. This allows accurately reporting VM states: interpreted vs. compiled vs. running in C code vs. garbage collector overhead. Note: I've dropped the originally planned basic block mode since the results were confusing and line profiling turned out to be cheap enough. And a final word: One key point often overlooked when discussing instrumenting vs. statistical profilers: sampling a million times a second vs. hundred times a second does not significantly improve the ability to pinpoint (and fix) the hotspots of a program. It doesn't matter if a piece of code that takes exactly 20.7% of the runtime is shown as taking 18% or 23% -- you'll have to investigate, anyway. A profiler is not a benchmarking tool. --Mike