Henk Boom wrote: > I'm not sure if this is something to do with luajit or just something > about memory access pattern efficiency that I don't know about, but > either way knowing what's making it happen would help me to find out > what to avoid :) This doesn't have much to do with the compiler. You'd observe the same effect on any modern CPU with memory access disambiguation. The compiler cannot disambiguate buffer[0] from buffer[n], since n is not a constant (and specializing to a numeric value is usually not a win). Ok, so the load of buffer[0] can be forwarded from the preceding store to buffer[0]. But the load of buffer[n] can neither be hoisted nor forwarded. This in turn means none of the stores to buffer[0] can be eliminated, either. Actually LuaJIT is in no better position than a C compiler and I bet the code and the resulting behavior is pretty much the same. Modern CPUs can do better, because they know the actual addresses involved and their memory disambiguation logic speculates that buffer[n] does (n = 0) or does not (n = 1) alias buffer[0]. The non-aliasing case (n = 1) is simpler to handle for the CPU, because buffer[n] is effectively read-only and the load (and the multiply) can be executed ahead or in parallel to everything else. For the aliasing case (n = 0), the preceding store to buffer[0] must be forwarded to the load for buffer[n]. This incurs a store-to-load forwarding delay, which increases the depth of the data dependency chain by this delay plus the multiply latency for each step. tl;dr: The performance difference you're seeing is a result of billions of dollars of engineering that went into the CPU next to your knee. Related reading: Jon Stokes Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture [It's a bit pricey and a bit dated (only up to the Core2). Or get the articles this book is based on from arstechnica.com.] --Mike