Re: Strange performance behaviour

  • From: Henk Boom <henk@xxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Sun, 26 Aug 2012 12:50:24 -0400

On 25 August 2012 18:26, Mike Pall <mike-1208@xxxxxxxxxx> wrote:
> Henk Boom wrote:
>> I'm not sure if this is something to do with luajit or just something
>> about memory access pattern efficiency that I don't know about, but
>> either way knowing what's making it happen would help me to find out
>> what to avoid :)
> This doesn't have much to do with the compiler. You'd observe the
> same effect on any modern CPU with memory access disambiguation.
> The compiler cannot disambiguate buffer[0] from buffer[n], since n
> is not a constant (and specializing to a numeric value is usually
> not a win). Ok, so the load of buffer[0] can be forwarded from the
> preceding store to buffer[0]. But the load of buffer[n] can
> neither be hoisted nor forwarded. This in turn means none of the
> stores to buffer[0] can be eliminated, either.
> Actually LuaJIT is in no better position than a C compiler and
> I bet the code and the resulting behavior is pretty much the same.
> Modern CPUs can do better, because they know the actual addresses
> involved and their memory disambiguation logic speculates that
> buffer[n] does (n = 0) or does not (n = 1) alias buffer[0].
> The non-aliasing case (n = 1) is simpler to handle for the CPU,
> because buffer[n] is effectively read-only and the load (and the
> multiply) can be executed ahead or in parallel to everything else.
> For the aliasing case (n = 0), the preceding store to buffer[0]
> must be forwarded to the load for buffer[n]. This incurs a
> store-to-load forwarding delay, which increases the depth of the
> data dependency chain by this delay plus the multiply latency for
> each step.
> tl;dr: The performance difference you're seeing is a result of
> billions of dollars of engineering that went into the CPU next to
> your knee.
> Related reading:
> Jon Stokes
> Inside the Machine: An Illustrated Introduction to Microprocessors
> and Computer Architecture
> [It's a bit pricey and a bit dated (only up to the Core2). Or get
> the articles this book is based on from]

Thanks for the very informative answers! :o

I've thought some more about my algorithm and I've realized that what
I'm doing is just a disguised vector-by-sparse-matrix multiplication.
I'm effectively using an array of [row, column, value]-tuple as the
representation, though, which means that I have to gradually
accumulate my answer into a vector instead of computing the answer
element by element. This sometimes leads to these awkward access
patterns which I'm realizing I should avoid. I'm going to look into
better sparse matrix structures to rewrite this in a smarter way.


Other related posts: