Hi Mike, This looks great; thank you! I'm already thinking about ways to integrating this into ZBS. Couple of questions: would it be possible to add an option to the high-level profiler to simply return the result instead of saving it to a file? require("jit.p").start(options) ... local output = require("jit.p").stop() Also, is the profiler always available, so I can simply check for LuaJIT version to be 2.1+ and that guarantees profiler functionality? To profile the interpreted code, is it sufficient to call jit.off() before starting profiling? Paul. On Tue, Sep 10, 2013 at 10:26 AM, Mike Pall <mike-1309@xxxxxxxxxx> wrote: > The git repo for LuaJIT 2.1 now features an integrated statistical > profiler with very low overhead. > > There are plenty of options to generate flat lists, two-level > views, combined sampling, raw output and annotated source code. > The profiler is able to sample stacks, VM states and zones. > > Here are some simple examples (for more see the docs): > > $ luajit -jp spectralnorm.lua 4000 > 53% A > 24% Av > 23% Atv > > $ luajit -jp=s spectralnorm.lua 4000 > 53% A > <- 50% Av > <- 50% Atv > 24% Atv > <- 100% AtAv > 23% Av > <- 100% AtAv > > $ luajit -jp=l2 spectralnorm.lua 4000 > 33% spectralnorm.lua:21 < spectralnorm.lua:28 > 33% spectralnorm.lua:13 < spectralnorm.lua:27 > 13% spectralnorm.lua:7 < spectralnorm.lua:13 > 12% spectralnorm.lua:7 < spectralnorm.lua:21 > 5% spectralnorm.lua:6 < spectralnorm.lua:21 > 4% spectralnorm.lua:6 < spectralnorm.lua:13 > > $ luajit -jp=a spectralnorm.lua 4000 > ====== spectralnorm.lua ====== > @@ 3 @@ > | -- contributed by Mike Pall > | > | local function A(i, j) > 10% | local ij = i+j-1 > 25% | return 1.0 / (ij * (ij-1) * 0.5 + i) > | end > | > | local function Av(x, y, N) > | for i=1,N do > | local a = 0 > 32% | for j=1,N do a = a + x[j] * A(i, j) end > | y[i] = a > | end > | end > @@ 18 @@ > | local function Atv(x, y, N) > | for i=1,N do > | local a = 0 > 32% | for j=1,N do a = a + x[j] * A(j, i) end > | y[i] = a > | end > | end > > Stack sampling works for both interpreted and JIT-compiled code. > The results for JIT-compiled code may sometimes be surprising > since LuaJIT heavily optimizes and inlines Lua code. > > Please note that it's not a goal to add every possible advanced > functionality to the bundled high-level profiler. The low-level > Lua and C APIs for the profiler are fully documented. Third-party > authors are welcome to use these building blocks and offer their > own extensions, such as graphical profilers or IDE integration. > > Documentation is available from the 2.1 branch of the git > repository in the file: doc/ext_profiler.html > > Instructions on how to access the git repo are here: > http://luajit.org/download.html > > Feedback welcome! > > ---- > > As you might have noticed, I had to change my plans compared to > the original approach presented in June. The main problem with the > instrumenting profiler was finding high-precision and high-speed > timing sources for all platforms. > > Even on x86/x64, where RDTSC is readily available, the results > were disappointing. The necessary pipeline flushes shadowed the > actual timings up to the point where the measurements were less > accurate than with a sampling profiler! Other platforms offered > only inaccurate timing sources or none that are accessible from > user mode. And to top it off, the instrumentation added > considerable overhead. Much less than traditional solutions using > Lua hooks, but much more than I was willing to accept. > > Ok, so I had to scrap that work and decided to go with a sampling > profiler. I've previously mentioned that multi-platform support > for a sampling profiler would be challenging. But it seemed to be > more promising than pouring even more time into a dead end. I've > implemented three timing/sampling mechanisms: setitimer+SIGPROF, > POSIX threads and Windows threads. This should cover nearly all > platforms (even the consoles). > > Here's how it works: the asynchronous signal or thread only sets a > flag and patches the dispatch table for the interpreter. Careful > arrangement of accesses to the shared state avoids explicit > synchronization -- eventual consistency is sufficient for a > profiler that has to cope with timer inaccuracies, anyway. > > Whenever the interpreter executes the next instruction, the > profiler callback is invoked synchronously (Lua callbacks are > invoked on a separate coroutine). The callback samples the stack > with a fast stack dump helper and aggregates all samples. The > stored samples are analyzed after the profiling run is finished. > > The JIT-compiler adds extra checks to the generated code at the > desired granularity (function or line) when profiling is active. > The machine code exits to the interpreter when the flag is set. > This will eventually invoke the profiler callback. > > VM states are sampled when the timer expires, not when the > callback is later invoked. This allows accurately reporting VM > states: interpreted vs. compiled vs. running in C code vs. garbage > collector overhead. > > Note: I've dropped the originally planned basic block mode since > the results were confusing and line profiling turned out to be > cheap enough. > > And a final word: One key point often overlooked when discussing > instrumenting vs. statistical profilers: sampling a million times > a second vs. hundred times a second does not significantly improve > the ability to pinpoint (and fix) the hotspots of a program. It > doesn't matter if a piece of code that takes exactly 20.7% of > the runtime is shown as taking 18% or 23% -- you'll have to > investigate, anyway. A profiler is not a benchmarking tool. > > --Mike >