Re: Compiler load/store barrier; volatile pointer; barriers in general

  • From: Mike Pall <mike-1501@xxxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Wed, 28 Jan 2015 10:10:37 +0100

Luke Gorrie wrote:
> What are the ordering properties of loads/stores to FFI objects? Are there
> situations where LuaJIT will reorder these?

There's only elimination, but no reordering at this time. But
there might be in the future.

> If so, is there a way to do a compiler load/store barrier in
> cases where ordering is important?

There's no explicit compiler optimization barrier. But any call to
a (dummy) C function via the FFI would effectively form such a
barrier.

> Can LuaJIT also eliminate FFI load/store operations that it thinks are
> redundant?

Yes, the compiler tries to eliminate redundant loads (L2L FWD)
or stores (DSE) and forward stored values to loads (S2L FWD).

> If so, is there a way to prevent this in cases where every
> read/write must really be performed (e.g. when talking to hardware with
> MMIO and reading/writing for side effects)?

See above. Any C function call serves as an optimization barrier
for the compiler. But unless this one uses a low-level barrier
instruction, the CPU may still eliminate or reorder loads and
stores. MMIO regions are usually tagged to prevent that, but that
may have a significant impact on performance, even when you don't
want or need it.

> Generally speaking: what barriers exist in LuaJIT and is it important to
> think about them? (For example are there common operations that will create
> a barrier and prevent optimizations that the JIT could otherwise do?)

Ok, so the workaround with a C function is more like using a
hammer, because it stops all memory optimizations. Maybe more
delicate tools are needed.

I think it would be difficult to do this at the individual cdata
object level (hard to explain the constraints to developers, too).
That would leave us with stopping optimizations for either all
loads or all stores or both. Alas, at that granularity one might
as well disable both, since DSE is not that effective, anyway.

There's already an IR instruction that does that (XBAR). But so
far there's no cheap way to emit it, e.g. in a tight loop.

The issue with hardware barriers is similar in that most
architectures only offer a combined read/write barrier. And this
is what you really want most of the time, anyway.

That said, I'm willing to add an ffi.barrier() instruction that
would give you finer control and more efficiency in tight polling
loops. With an argument that can either be "l", "s", "m" (compiler
barrier) or "L", "S", "M" (hardware barrier). The latter would
imply the former, of course. Patches welcome!

But I'm wary of adding an extra API function without seeing a good
use case -- most of them use a C function call in the inner loop,
anyway.

--Mike

Other related posts: