Re: LuaJIT 2.1 Profiler released

  • From: Paul K <paul@xxxxxxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Tue, 10 Sep 2013 11:26:19 -0700

Hi Mike,

This looks great; thank you! I'm already thinking about ways to
integrating this into ZBS. Couple of questions: would it be possible
to add an option to the high-level profiler to simply return the
result instead of saving it to a file?

require("jit.p").start(options)
...
local output = require("jit.p").stop()

Also, is the profiler always available, so I can simply check for
LuaJIT version to be 2.1+ and that guarantees profiler functionality?

To profile the interpreted code, is it sufficient to call jit.off()
before starting profiling?

Paul.

On Tue, Sep 10, 2013 at 10:26 AM, Mike Pall <mike-1309@xxxxxxxxxx> wrote:
> The git repo for LuaJIT 2.1 now features an integrated statistical
> profiler with very low overhead.
>
> There are plenty of options to generate flat lists, two-level
> views, combined sampling, raw output and annotated source code.
> The profiler is able to sample stacks, VM states and zones.
>
> Here are some simple examples (for more see the docs):
>
> $ luajit -jp spectralnorm.lua 4000
> 53%  A
> 24%  Av
> 23%  Atv
>
> $ luajit -jp=s spectralnorm.lua 4000
> 53%  A
>   <- 50%  Av
>   <- 50%  Atv
> 24%  Atv
>   <- 100%  AtAv
> 23%  Av
>   <- 100%  AtAv
>
> $ luajit -jp=l2 spectralnorm.lua 4000
> 33%  spectralnorm.lua:21 < spectralnorm.lua:28
> 33%  spectralnorm.lua:13 < spectralnorm.lua:27
> 13%  spectralnorm.lua:7 < spectralnorm.lua:13
> 12%  spectralnorm.lua:7 < spectralnorm.lua:21
>  5%  spectralnorm.lua:6 < spectralnorm.lua:21
>  4%  spectralnorm.lua:6 < spectralnorm.lua:13
>
> $ luajit -jp=a spectralnorm.lua 4000
> ====== spectralnorm.lua ======
> @@ 3 @@
>       | -- contributed by Mike Pall
>       |
>       | local function A(i, j)
>   10% |   local ij = i+j-1
>   25% |   return 1.0 / (ij * (ij-1) * 0.5 + i)
>       | end
>       |
>       | local function Av(x, y, N)
>       |   for i=1,N do
>       |     local a = 0
>   32% |     for j=1,N do a = a + x[j] * A(i, j) end
>       |     y[i] = a
>       |   end
>       | end
> @@ 18 @@
>       | local function Atv(x, y, N)
>       |   for i=1,N do
>       |     local a = 0
>   32% |     for j=1,N do a = a + x[j] * A(j, i) end
>       |     y[i] = a
>       |   end
>       | end
>
> Stack sampling works for both interpreted and JIT-compiled code.
> The results for JIT-compiled code may sometimes be surprising
> since LuaJIT heavily optimizes and inlines Lua code.
>
> Please note that it's not a goal to add every possible advanced
> functionality to the bundled high-level profiler. The low-level
> Lua and C APIs for the profiler are fully documented. Third-party
> authors are welcome to use these building blocks and offer their
> own extensions, such as graphical profilers or IDE integration.
>
> Documentation is available from the 2.1 branch of the git
> repository in the file: doc/ext_profiler.html
>
> Instructions on how to access the git repo are here:
>   http://luajit.org/download.html
>
> Feedback welcome!
>
> ----
>
> As you might have noticed, I had to change my plans compared to
> the original approach presented in June. The main problem with the
> instrumenting profiler was finding high-precision and high-speed
> timing sources for all platforms.
>
> Even on x86/x64, where RDTSC is readily available, the results
> were disappointing. The necessary pipeline flushes shadowed the
> actual timings up to the point where the measurements were less
> accurate than with a sampling profiler! Other platforms offered
> only inaccurate timing sources or none that are accessible from
> user mode. And to top it off, the instrumentation added
> considerable overhead. Much less than traditional solutions using
> Lua hooks, but much more than I was willing to accept.
>
> Ok, so I had to scrap that work and decided to go with a sampling
> profiler. I've previously mentioned that multi-platform support
> for a sampling profiler would be challenging. But it seemed to be
> more promising than pouring even more time into a dead end. I've
> implemented three timing/sampling mechanisms: setitimer+SIGPROF,
> POSIX threads and Windows threads. This should cover nearly all
> platforms (even the consoles).
>
> Here's how it works: the asynchronous signal or thread only sets a
> flag and patches the dispatch table for the interpreter. Careful
> arrangement of accesses to the shared state avoids explicit
> synchronization -- eventual consistency is sufficient for a
> profiler that has to cope with timer inaccuracies, anyway.
>
> Whenever the interpreter executes the next instruction, the
> profiler callback is invoked synchronously (Lua callbacks are
> invoked on a separate coroutine). The callback samples the stack
> with a fast stack dump helper and aggregates all samples. The
> stored samples are analyzed after the profiling run is finished.
>
> The JIT-compiler adds extra checks to the generated code at the
> desired granularity (function or line) when profiling is active.
> The machine code exits to the interpreter when the flag is set.
> This will eventually invoke the profiler callback.
>
> VM states are sampled when the timer expires, not when the
> callback is later invoked. This allows accurately reporting VM
> states: interpreted vs. compiled vs. running in C code vs. garbage
> collector overhead.
>
> Note: I've dropped the originally planned basic block mode since
> the results were confusing and line profiling turned out to be
> cheap enough.
>
> And a final word: One key point often overlooked when discussing
> instrumenting vs. statistical profilers: sampling a million times
> a second vs. hundred times a second does not significantly improve
> the ability to pinpoint (and fix) the hotspots of a program. It
> doesn't matter if a piece of code that takes exactly 20.7% of
> the runtime is shown as taking 18% or 23% -- you'll have to
> investigate, anyway. A profiler is not a benchmarking tool.
>
> --Mike
>

Other related posts: